|
Gigablast |
Solr |
Package Installation |
Download packages for Ubuntu or RedHat
|
Instructions
|
Source Language |
C/C++
|
Java
|
Runs on Linux |
Yes.
|
Yes.
|
Runs on Windows |
Yes with Virtual Box. Soon natively.
|
Yes.
|
License |
Apache Open Source License 2
|
Apache Open Source License 2
|
Release Date |
2000
|
2007
|
Scalability |
Has scaled to over 12 billion unique web pages.
Can scale to over 100 billion pages in a single collection.
|
Good luck!
|
HTTP API |
here
|
here
|
Search Results |
here
|
here
|
Source Repository |
github
|
github
|
Github Star Ratings |
326 (8/2/2014)
|
767 (8/2/2014)
|
Source Installation |
Just a few simple steps
|
Source download instructions
|
Complete Web GUI
|
Yes.
|
???
|
Operating Layout
|
A single binary containing web server, database, admin tools, spider logic, etc.
|
Many different packages quilted together. Apache, MySQL, Lucene, Tika, Zookeeper, Solr, Nutch, ...
|
Indexing a Single File Containing Multiple Documents via cmdline
|
Use curl using args (including delim) listed here
|
unsupported
|
Indexing an Individual File via cmdline
|
Use curl to post the content of the file with args listed
here
|
You can index individual local files as such:
curl "http://127.0.0.1:8080/solr/update" --data-binary @myfile.html -H 'Content-type: text/html'
but it does not seem to work unless your HTML meets stringent requirements for some reason.
|
Indexing an Individual URL via cmdline
|
Use curl to inject the url with args listed
here
|
???
|
Indexing a File of URLs via cmdline
|
Use one curl command for each url, using the interface described
here
|
???
|
Deleting Documents via cmdline
|
Use curl command to delete a url, using the interface described
here
|
You can delete individual documents by specifying queries that match just those documents:
java -Dcommit....
|
Getting Results via cmdline |
Use curl command to do a search, using the interface described
here
|
???
|
Facets |
Yes. Basic support. See gbfacet operators in the help file.
|
Yes.
|
Search Result Limitations Based on Facet Value Counts |
Coming soon.
|
Yes.
|
Numeric Fields |
You can forward/reverse sort by and constrain by numeric fields.
|
You can forward/reverse sort by and constrain by numeric fields.
|
Boolean Search |
Fully nested boolean search with AND OR NOT.
|
Fully nested boolean search with AND OR NOT.
|
Searchable Fields |
Yes. Any meta tag, or if indexing JSON or XML.
|
???
|
Site Restricted Searches |
Yes. Using the site: query operator. Or use &sites=... to constrain your search up to 500 sites.
|
???
|
Spell Checker |
Yes. But currently disabled until improved.
|
Yes.
|
Language Identification |
Yes. On a per word level for searching purposes.
|
Yes. Not on a per word level for searching purposes.
|
Index Multiple Languages |
Yes. Can expand words in many languages to all their different forms. More forms coming soon, too.
|
Yes, but stemming/expansion may be limited.
|
Show Images in Search Results |
Yes.
|
No.
|
Related Concepts |
Yes. Called Gigabits.
|
No.
|
Query Expansion (Synonyms) |
Yes. And also uses mysynonyms.txt file to add your own expansion terms.
|
???
|
Cached Pages |
Yes.
|
???
|
RESTful/XML/JSON APIs |
Yes, XML. JSON coming soon.
|
???
|
Schemas |
You do not need to define schemas to begin indexing files and urls.
|
You have to define annoying schemas.
|
Spidering |
Gigablast has a complete distributed web spider with powerful controls.
|
SOLR has no spider. You can try to integrate Nutch.
|
Document Filters |
antiword (for Microsoft Word)
pdftohtml (for PDF)
xlstohtml (for Excel)
ppthtml (for power point)
pstotext (for PostScript)
|
uses Apache Tika for several formats.
|
Scalability |
Highly scalable. Has scaled to over
12 billion pages while serving millions
of queries per day. Can easily add new servers to the
hosts.conf file and click rebalance shards to
rebalance the data.
|
Has not scaled nearly as high to our knowledge. Not originally built for more than one server.
|
Cluster Administration |
Built into the web GUI.
|
Requires separate Zookeeper package installation.
|
Performance |
High performance. Written in C/C++.
|
Slower. Written in Java. Has garbage collection, etc.
|
Ranking Algorithm |
Custom query term proximity based algorithm. Superior to TF/IDF or Cosine methods.
|
Old school TF/IDF based on simple statistics.
|
Scoring Explanations |
Complete scoring information provided.
|
Complete scoring information provided.
|
Inlink Text |
Indexed incoming link text, compensates for link spam.
|
None. Not geared for web search.
|
Page Rank |
Uses Site Rank based on number of incoming links to a site
from other sites. Detects link spam and compensates accordingly.
|
None. Not geared for web search.
|
On-Page Spam |
Demotes terms deemed spammy on a page.
|
None.
|
Reliability |
Pretty good.
|
Pretty good.
|
Developer Documentation |
Yes. Here.
|
Yes. Lots of documentation.
|
Graphing |
Graphs performance of various subroutines and query times.
|
Unknown.
|
Monitoring |
Monitors drive temperature, disk space, query latency and shard uptime. Sends email alerts.
|
None known.
|
Geospatial |
Can use with numeric gbminint: gbmaxint: query operators on lat/lon fields.
See help file for examples using these operators.
|
Yes.
|
Dynamic Summaries |
Yes. Contain query terms.
|
Yes. Contain query terms.
|
Site Clustering |
Yes.
|
???
|
More Like This |
Coming soon.
|
Yes.
|
Sort by Date |
gbsortbyint:gbspiderdate
gbsortbyint:gbindexdate
gbrevsortbyint:gbspiderdate
gbrevsortbyint:gbindexdate
See help file for examples using these operators.
|
???
|
Query Completion |
Coming soon.
|
Available with additional module.
|
Document Collections |
Supports tens of thousands of separate collections,
and federated search across them.
|
???
|