Gigablast - syntax

Example Query	Description
Basic Query Syntax
cat dog	Search results have the word cat and the word dog in them. They could also have cats and dogs.
+cat	Search results have the word cat in them. If the search results has the word cats then it will not be included. The plus sign indicates an exact match and not to use synonyms, hypernyms or hyponyms or any other form of the word.
mp3 "take five"	Search results have the word mp3 and the exact phrase take five in them.
"john smith" -"bob dole"	Search results have the phrase john smith but NOT the phrase bob dole in them.
bmx -game	Search results have the word bmx but not game.
cat \| dog	Match documents that have cat and dog in them, but do not allow cat to affect the ranking score, only dog. This is called a query refinement.
document.title:paper	That query will match a JSON document like { "document":{"title":"This is a good paper." }} or, alternatively, an XML document like <document><title>This is a good paper</title></document>

Example Query	Description
Advanced Query Operators
gbfieldmatch:strings.vendor:"My Vendor Inc."	Matches all the meta tag or JSON or XML fields that have the name "strings.vendor" and contain the exactly provided value, in this case, My Vendor Inc.. This is CASE SENSITIVE and includes punctuation, so it's exact match. In general, it should be a very short termlist, so it should be fast.
url:www.abc.com/page.html	Matches the page with that exact url. Uses the first url, not the url it redirects to, if any.
ext:doc	Match documents whose url ends in the .doc file extension.
link:www.gigablast.com/foo.html	Matches all the documents that have a link to http://www.gigablast.com/foobar.html
sitelink:abc.foobar.com	Matches all documents that link to any page on the abc.foobar.com site.
site:mysite.com	Matches all documents on the mysite.com domain.
site:www.mysite.com/dir1/dir2/	Matches all documents whose url starts with www.mysite.com/dir1/dir2/
ip:1.2.3.4	Matches all documents whose IP is 1.2.3.4.
ip:1.2.3	Matches all documents whose IP STARTS with 1.2.3.
inurl:dog	Matches all documents that have the word dog in their url, like http://www.mysite.com/dog/food.html. However will not match http://www.mysite.com/dogfood.html because it is not an individual word. It must be delineated by punctuation.
suburl:dog	Same as inurl.
title:cat	Matches all the documents that have the word cat in their title.
title:"cat food"	Matches all the documents that have the phrase "cat food" in their title.
title:cat	Same as intitle:
gbinrss:1	Matches all documents that are in RSS feeds. Likewise, use gbinrss:0 to match all documents that are NOT in RSS feeds.
type:json	Matches all documents that are in JSON format. Other possible types include html, text, xml, pdf, doc, xls, ppt, ps, css, json, status. status matches special documents that are stored every time a url is spidered so you can see all the spider attempts and when they occurred as well as the outcome.
filetype:json	Same as type: above.
gbisadult:1	Matches all documents that have been detected as adult documents and may be unsuitable for children. Likewise, use gbisadult:0 to match all documents that were NOT detected as adult documents.
gbimage:site.com/image.jpg	Matches all documents that contain the specified image.
gbhasthumbnail:1	Matches all documents for which Gigablast detected a thumbnail. Likewise use gbhasthumbnail:0 to match all documents that do not have thumbnails.
gbtag*	Matches all documents whose tag named * have the specified value in the tagdb entry for the url. Example: gbtagsitenuminlinks:2 matches all documents that have 2 qualified inlinks pointing to their site based on the tagdb record. You can also provide your own tags in addition to the tags already present. See the tagdb menu for more information.
gbzip:90210	Matches all documents that have the specified zip code in their meta zip code tag.
gbcharset:windows-1252	Matches all documents originally in the Windows-1252 charset. Available character sets are listed in the iana_charset.cpp file in the open source distribution. There are a lot. Some more popular ones are: us, latin1, iso-8859-1, csascii, ascii, latin2, latin3, latin4, greek, utf-8, shift_jis.
gblang:de	Matches all documents in german. The supported language abbreviations are at the bottom of the url filters page. Some more common ones are gblang:en, gblang:es, gblang:fr, gblang:"zh_cn" (note the quotes for zh_cn!).
gbpathdepth:3	Matches all documents whose url has 3 path components to it like http://somedomain.com/dir1/dir2/dir3/foo.html
gbhopcount:2	Matches all documents that are a minimum of two link hops away from a root url.
gbhasfilename:1	Matches all documents whose url ends in a filename like http://somedomain.com/dir1/myfile and not http://somedomain.com/dir1/dir2/. Likewise, use gbhasfilename:0 to match all the documents that do not have a filename in their url.
gbiscgi:1	Matches all documents that have a question mark in their url. Likewise gbiscgi:0 matches all documents that do not.
gbhasext:1	Matches all documents that have a file extension in their url. Likewise, gbhasext:0 matches all documents that do not have a file extension in their url.
gbsubmiturl:domain.com/process.php	Matches all documents that have a form that submits to the specified url.
gbparenturl:www.xyz.com/abc.html	Diffbot only. Match the json urls that were extract from this parent url. Example: gbparenturl:www.gigablast.com/addurl.htm
gbcountry:us	Matches documents determined by Gigablast to be from the United States. See the country abbreviations in the CountryCode.cpp open source distribution. Some more popular examples include: de, fr, uk, ca, cn.
gbpermalink:1	Matches documents that are permalinks. Use gbpermalink:0 to match documents that are NOT permalinks.
gbdocid:123456	Matches the document with the docid 123456

Example Query	Description
Numeric Field Query Operators
cameras gbsortbyfloat:price	Sort all documents that contain 'camera' by price. price can be a root JSON field or in a meta tag, or in an xml <price> tag.
cameras gbsortbyfloat:product.price	Sort all documents that contain 'camera' by price. price can be in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product>
cameras gbrevsortbyfloat:product.price	Like above example but sorted with highest prices on top.
pilots gbsortbyint:employees	Sort all documents that contain 'pilots' by employees. employees can be a root JSON field or in a meta tag, or in an xml <price> tag. The value it contains is interpreted as a 32-bit integer.
gbsortbyint:gbdocspiderdate	Sort all documents by the date they were spidered/downloaded.
gbsortbyint:company.employees	Sort all documents by employees. Documents can contain employees in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product>
gbsortbyint:gbsitenuminlinks	Sort all documents by the number of distinct inlinks the document's site has.
gbrevsortbyint:gbdocspiderdate	Sort all documents by the date they were spidered/downloaded but with the oldest on top.
cameras gbminfloat:price:109.99	Matches all documents that contain 'camera' or 'cameras' and have a price of at least 109.99. price can be a root JSON field or in a meta tag name price, or in an xml <price> tag.
cameras gbminfloat:product.price:109.99	Matches all documents that contain 'camera' or 'cameras' and have a price of at least 109.99 in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product>
cameras gbmaxfloat:price:109.99	Like the gbminfloat examples above, but is an upper bound.
gbequalfloat:product.price:1.23	Similar to gbminfloat and gbmaxfloat but is an equality constraint.
gbminint:gbspiderdate:1391749680	Matches all documents with a spider timestamp of at least 1391749680. Use this as opposed th gbminfloat when you need 32 bits of integer precision.
gbmaxint:company.employees:20	Matches all companies with 20 or less employees in a JSON document like { "company":{"employees":13}} or, alternatively, an XML document like <company><employees>13</employees></company>
gbequalint:company.employees:13	Similar to gbminint and gbmaxint but is an equality constraint.

Example Query	Description
Date Related Query Operators
gbdocspiderdate:1400081479	Matches documents that have that spider date timestamp (UTC). This is the time the document completed downloading.
gbspiderdate:1400081479	Like above.
gbdocindexdate:1400081479	Like above, but is the time the document was last indexed. This time is slightly greater than or equal to the spider date.
gbindexdate:1400081479	Like above.

Example Query	Description
Facet Related Query Operators
gbfacetstr:color	Returns facets in the search results by their color field. color is case INsensitive.
gbfacetstr:product.color	Returns facets in the color field in a JSON document like { "product":{"color":"red"}} or, alternatively, an XML document like <product><color>red</price></product>. product.color is case INsensitive.
gbfacetstr:gbtagsite cat	Returns facets from the site names of all pages that contain the word 'cat' or 'cats', etc. gbtagsite is case insensitive.
gbfacetint:product.cores	Returns facets in of the cores field in a JSON document like { "product":{"cores":10}} or, alternatively, an XML document like <product><cores>10</price></product>. product.cores is case INsensitive.
gbfacetint:gbhopcount	Returns facets in of the gbhopcount field over the documents so you can search the distribution of hopcounts over the index. gbhopcount is case INsensitive.
gbfacetint:gbtagsitenuminlinks	Returns facets in of the sitenuminlinks field for the tag sitenuminlinksin the tag for each site. Any numeric tag in tagdb can be facetizeed in this manner so you can add your own facets this way on a per site or per url basis by making tagdb entries. Case Insensitive.
gbfacetint:size,0-10,10-20,30-100,100-200,200-1000,1000-10000	Returns facets in of the size field (either in json, field or a meta tag) and cluster the results into the specified ranges. size is case INsensitive.
gbfacetint:gbsitenuminlinks	Returns facets based on # of site inlinks the site of each result has. gbsitenuminlinks is case INsensitive.
gbfacetfloat:product.weight	Returns facets of the weight field in a JSON document like { "product":{"weight":1.45}} or, alternatively, an XML document like <product><weight>1.45</price></product>. product.weight is case INsensitive.
gbfacetfloat:product.price,0-1.5,1.5-5,5.0-20,20-100.0	Similar to above but cluster the pricess into the specified ranges. product.price is case insensitive.

Example Query	Description
Spider Status Documents
gbssUrl:com	Query the url of a spider status document.
gbssFinalRedirectUrl:abc.com/page2.html	Query on the last url redirect to, if any.
gbssStatusCode:0	Query on the status code of the index attempt. 0 means no error.
gbssStatusMsg:"Tcp timed"	Like gbssStatusCode but a textual representation.
gbssHttpStatus:200	Query on the HTTP status returned from the web server.
gbssWasIndexed:0	Was the document in the index before attempting to index? Use 0 or 1 to find all documents that were not or were, respectively.
gbssIsDiffbotObject:1	This field is only present if the document was an object from a diffbot reply. Use gbssIsDiffbotObject:0 to find the non-diffbot objects.
gbsortby:gbssAgeInIndex	If the document was in the index at the time we attempted to reindex it, how long has it been since it was last indexed?
gbssDomain:yahoo.com	Query on the domain of the url.
gbssSubdomain:www.yahoo.com	Query on the subdomain of the url.
gbfacetint:gbssNumRedirects	Query on the number of times the url redirect when attempting to index it.
gbssDocId:1234567	Show all the spider status docs for the document with this docId.
gbfacetint:gbssHopCount	Query on the hop count of the document.
gbfacetint:gbssCrawlRound	Query on the crawl round number.
gbssDupOfDocId:123456	Show all the documents that were considered dups of this docId.
gbssPrevTotalNumIndexAttempts:1	Before this index attempt, how many attempts were there?
gbssPrevTotalNumIndexSuccesses:1	Before this index attempt, how many successful attempts were there?
gbssPrevTotalNumIndexFailures:1	Before this index attempt, how many failed attempts were there?
gbrevsortbyint:gbssFirsIndexed	The date in utc that the document was first indexed.
gbfacetint:gbssContentHash32	The hash of the document content, excluding dates and times. Used internally for deduping.
gbsortbyint:gbssDownloadDurationMS	How long it took in millisecons to download the document.
gbsortbyint:gbssDownloadStartTime	When the download started, in seconds since the epoch, UTC.
gbsortbyint:gbssDownloadEndTime	When the download ended, in seconds since the epoch, UTC.
gbfacetint:gbssUsedRobotsTxt	This is 0 or 1 depending on if robots.txt was not obeyed or obeyed, respectively.
gbfacetint:gbssConsecutiveErrors	For the last set of indexing attempts how many were errors?
gbssIp:1.2.3.4	The IP address of the document being indexed. Is 0.0.0.0 if unknown.
gbsortby:gbssIpLookupTimeMS	How long it took to lookup the IP of the document. Might have been in the cache.
gbsortby:gbssSiteNumInlinks	How many good inlinks the document's site had.
gbsortby:gbssSiteRank	The site rank of the document. Based directly on the number of inlinks the site had.
gbfacetint:gbssContentInjected	This is 0 or 1 if the content was not injected or injected, respectively.
gbfacetfloat:gbssPercentContentChanged	A float between 0 and 100, inclusive. Represents how much the document has changed since the last time we indexed it. This is only valid if the document was successfully indexed this time.respectively.
gbfacetint:gbssSpiderPriority	The spider priority, from 0 to 127, inclusive, of the document according to the url filters table.
gbfacetstr:gbssMatchingUrlFilter	The url filter expression the document matched.
gbfacetstr:gbssLanguage	The language of the document. If document was empty or not downloaded then this will not be present. Uses xx to mean unknown language. Uses the language abbreviations found at the bottom of the url filters page.
gbfacetstr:gbssContentType	The content type of the document. Like html, xml, json, pdf, etc. This field is not present if unknown.
gbsortbyint:gbssContentLen	The content length of the document. 0 if empty or not downloaded.
gbfacetint:gbssCrawlDelay	The crawl delay according to the robots.txt of the document. This is -1 if not specified in the robots.txt or not found.
gbssSentToDiffbotThisTime:1	Was the document's url sent to diffbot for processing this time of spidering the url?
gbssSentToDiffbotAtSomeTime:1	Was the document's url sent to diffbot for processing, either this time or some time before?
gbssDiffbotReplyCode:0	The reply received from diffbot. 0 means success, otherwise, it indicates an error code.
gbfacetstr:gbssDiffbotReplyMsg:0	The reply received from diffbot represented in text.
gbsortbyint:gbssDiffbotReplyLen	The length of the reply received from diffbot.
gbsortbyint:gbssDiffbotReplyResponseTimeMS	The time in milliseconds it took to get a reply from diffbot.
gbfacetint:gbssDiffbotReplyRetries	The number of times we had to resend the request to diffbot because diffbot returned a 504 gateway timed out error.
gbfacetint:gbssDiffbotReplyNumObjects	The number of JSON objects diffbot excavated from the provided url.

Example Query	Description
Boolean Queries
Note: boolean operators must be in UPPER CASE.
cat AND dog	Search results have the word cat AND the word dog in them.
cat OR dog	Search results have the word cat OR the word dog in them, but preference is given to results that have both words.
cat dog OR pig	Search results have the two words cat and dog OR search results have the word pig, but preference is given to results that have all three words. This illustrates how the individual words of one operand are all required for that operand to be true.
"cat dog" OR pig	Search results have the phrase "cat dog" in them OR they have the word pig, but preference is given to results that have both.
title :"cat dog" OR pig	Search results have the phrase "cat dog" in their title OR they have the word pig, but preference is given to results that have both.
cat OR dog OR pig	Search results need only have one word, cat or dog or pig, but preference is given to results that have the most of the words.
cat OR dog AND pig	Search results have dog and pig, but they may or may not have cat. Preference is given to results that have all three. To evaluate expressions with more than two operands, as in this case where we have three, you can divide the expression up into sub-expressions that consist of only one operator each. In this case we would have the following two sub-expressions: cat OR dog and dog AND pig. Then, for the original expression to be true, at least one of the sub-expressions that have an OR operator must be true, and, in addition, all of the sub-expressions that have AND operators must be true. Using this logic you can evaluate expressions with more than one boolean operator.
cat AND NOT dog	Search results have cat but do not have dog.
cat AND NOT (dog OR pig)	Search results have cat but do not have dog and do not have pig. When evaluating a boolean expression that contains ()'s you can evaluate the sub-expression in the ()'s first. So if a document has dog or it has pig or it has both, then the expression, (dog OR pig) would be true. So you could, in this case, substitute true for that expression to get the following: cat AND NOT (true) = cat AND false = false. Does anyone actually read this far?
(cat OR dog) AND NOT (cat AND dog)	Search results have cat or dog but not both.
left-operand OPERATOR right-operand	This is the general format of a boolean expression. The possible operators are: OR and AND. The operands can themselves be boolean expressions and can be optionally enclosed in parentheses. A NOT operator can optionally preceed the left or the right operand.