Gigablast - compare

Comparison of Gigablast vs SOLR Open Source Search Engine

Comparing Gigablast to SOLR

	Gigablast	Solr
Package Installation	Download packages for Ubuntu or RedHat	Instructions
Source Language	C/C++	Java
Runs on Linux	Yes.	Yes.
Runs on Windows	Yes with Virtual Box. Soon natively.	Yes.
License	Apache Open Source License 2	Apache Open Source License 2
Release Date	2000	2007
Scalability	Has scaled to over 12 billion unique web pages. Can scale to over 100 billion pages in a single collection.	Good luck!
HTTP API	here	here
Search Results	here	here
Source Repository	github	github
Github Star Ratings	326 (8/2/2014)	767 (8/2/2014)
Source Installation	Just a few simple steps	Source download instructions
Complete Web GUI	Yes.	???
Operating Layout	A single binary containing web server, database, admin tools, spider logic, etc.	Many different packages quilted together. Apache, MySQL, Lucene, Tika, Zookeeper, Solr, Nutch, ...
Indexing a Single File Containing Multiple Documents via cmdline	*Use curl using args (including delim) listed here*	unsupported
Indexing an Individual File via cmdline	Use curl to post the content of the file with args listed here	You can index individual local files as such: curl "http://127.0.0.1:8080/solr/update" --data-binary @myfile.html -H 'Content-type: text/html' but it does not seem to work unless your HTML meets stringent requirements for some reason.
Indexing an Individual URL via cmdline	Use curl to inject the url with args listed here	???
Indexing a File of URLs via cmdline	Use one curl command for each url, using the interface described here	???
Deleting Documents via cmdline	Use curl command to delete a url, using the interface described here	You can delete individual documents by specifying queries that match just those documents: java -Dcommit....
Getting Results via cmdline	Use curl command to do a search, using the interface described here	???
Facets	Yes. Basic support. See gbfacet operators in the help file.	Yes.
Search Result Limitations Based on Facet Value Counts	Coming soon.	Yes.
Numeric Fields	You can forward/reverse sort by and constrain by numeric fields.	You can forward/reverse sort by and constrain by numeric fields.
Boolean Search	Fully nested boolean search with AND OR NOT.	Fully nested boolean search with AND OR NOT.
Searchable Fields	Yes. Any meta tag, or if indexing JSON or XML.	???
Site Restricted Searches	Yes. Using the site: query operator. Or use &sites=... to constrain your search up to 500 sites.	???
Spell Checker	Yes. But currently disabled until improved.	Yes.
Language Identification	Yes. On a per word level for searching purposes.	Yes. Not on a per word level for searching purposes.
Index Multiple Languages	Yes. Can expand words in many languages to all their different forms. More forms coming soon, too.	Yes, but stemming/expansion may be limited.
Show Images in Search Results	Yes.	No.
Related Concepts	*Yes. Called Gigabits.*	No.
Query Expansion (Synonyms)	Yes. And also uses mysynonyms.txt file to add your own expansion terms.	???
Cached Pages	Yes.	???
RESTful/XML/JSON APIs	Yes, XML. JSON coming soon.	???
Schemas	You do not need to define schemas to begin indexing files and urls.	You have to define annoying schemas.
Spidering	Gigablast has a complete distributed web spider with powerful controls.	SOLR has no spider. You can try to integrate Nutch.
Document Filters	antiword (for Microsoft Word) pdftohtml (for PDF) xlstohtml (for Excel) ppthtml (for power point) pstotext (for PostScript)	uses Apache Tika for several formats.
Scalability	*Highly scalable. Has scaled to over 12 billion pages while serving millions of queries per day. Can easily add new servers to the hosts.conf file and click rebalance shards* to rebalance the data.**	Has not scaled nearly as high to our knowledge. Not originally built for more than one server.
Cluster Administration	Built into the web GUI.	Requires separate Zookeeper package installation.
Performance	High performance. Written in C/C++.	Slower. Written in Java. Has garbage collection, etc.
Ranking Algorithm	Custom query term proximity based algorithm. Superior to TF/IDF or Cosine methods.	Old school TF/IDF based on simple statistics.
Scoring Explanations	Complete scoring information provided.	Complete scoring information provided.
Inlink Text	Indexed incoming link text, compensates for link spam.	None. Not geared for web search.
Page Rank	*Uses Site Rank* based on number of incoming links to a site from other sites. Detects link spam and compensates accordingly.**	None. Not geared for web search.
On-Page Spam	Demotes terms deemed spammy on a page.	None.
Reliability	Pretty good.	Pretty good.
Developer Documentation	Yes. Here.	Yes. Lots of documentation.
Graphing	Graphs performance of various subroutines and query times.	Unknown.
Monitoring	Monitors drive temperature, disk space, query latency and shard uptime. Sends email alerts.	None known.
Geospatial	Can use with numeric gbminint: gbmaxint: query operators on lat/lon fields. See help file for examples using these operators.	Yes.
Dynamic Summaries	Yes. Contain query terms.	Yes. Contain query terms.
Site Clustering	Yes.	???
More Like This	Coming soon.	Yes.
Sort by Date	gbsortbyint:gbspiderdate gbsortbyint:gbindexdate gbrevsortbyint:gbspiderdate gbrevsortbyint:gbindexdate See help file for examples using these operators.	???
Query Completion	Coming soon.	Available with additional module.
Document Collections	Supports tens of thousands of separate collections, and federated search across them.	???