SEARCH
ADVANCED
WIDGETS
USERS
ABOUT
BLOG
FAQ
API
ADMIN
|
Basic Query Syntax |
Example Query | Description |
---|
cat dog | Search results have the word cat and the word dog in them. They could also have cats and dogs. | +cat | Search results have the word cat in them. If the search results has the word cats then it will not be included. The plus sign indicates an exact match and not to use synonyms, hypernyms or hyponyms or any other form of the word. | mp3 "take five" | Search results have the word mp3 and the exact phrase take five in them. | "john smith" -"bob dole" | Search results have the phrase john smith but NOT the phrase bob dole in them. | bmx -game | Search results have the word bmx but not game. | cat | dog | Match documents that have cat and dog in them, but do not allow cat to affect the ranking score, only dog. This is called a query refinement. |
document.title:paper | That query will match a JSON document like { "document":{"title":"This is a good paper." }} or, alternatively, an XML document like <document><title>This is a good paper</title></document> |
Advanced Query Operators |
Example Query | Description |
gbfieldmatch:strings.vendor:"My Vendor Inc." | Matches all the meta tag or JSON or XML fields that have the name "strings.vendor" and contain the exactly provided value, in this case, My Vendor Inc.. This is CASE SENSITIVE and includes punctuation, so it's exact match. In general, it should be a very short termlist, so it should be fast. |
url:www.abc.com/page.html | Matches the page with that exact url. Uses the first url, not the url it redirects to, if any. |
ext:doc | Match documents whose url ends in the .doc file extension. |
link:www.gigablast.com/foo.html | Matches all the documents that have a link to http://www.gigablast.com/foobar.html |
sitelink:abc.foobar.com | Matches all documents that link to any page on the abc.foobar.com site. |
site:mysite.com | Matches all documents on the mysite.com domain. |
site:www.mysite.com/dir1/dir2/ | Matches all documents whose url starts with www.mysite.com/dir1/dir2/ |
ip:1.2.3.4 | Matches all documents whose IP is 1.2.3.4. |
ip:1.2.3 | Matches all documents whose IP STARTS with 1.2.3. |
inurl:dog | Matches all documents that have the word dog in their url, like http://www.mysite.com/dog/food.html. However will not match http://www.mysite.com/dogfood.html because it is not an individual word. It must be delineated by punctuation. |
suburl:dog | Same as inurl. |
title:cat | Matches all the documents that have the word cat in their title. |
title:"cat food" | Matches all the documents that have the phrase "cat food" in their title. |
title:cat | Same as intitle: |
gbinrss:1 | Matches all documents that are in RSS feeds. Likewise, use gbinrss:0 to match all documents that are NOT in RSS feeds. |
type:json | Matches all documents that are in JSON format. Other possible types include html, text, xml, pdf, doc, xls, ppt, ps, css, json, status. status matches special documents that are stored every time a url is spidered so you can see all the spider attempts and when they occurred as well as the outcome. |
filetype:json | Same as type: above. |
gbisadult:1 | Matches all documents that have been detected as adult documents and may be unsuitable for children. Likewise, use gbisadult:0 to match all documents that were NOT detected as adult documents. |
gbimage:site.com/image.jpg | Matches all documents that contain the specified image. |
gbhasthumbnail:1 | Matches all documents for which Gigablast detected a thumbnail. Likewise use gbhasthumbnail:0 to match all documents that do not have thumbnails. |
gbtag* | Matches all documents whose tag named * have the specified value in the tagdb entry for the url. Example: gbtagsitenuminlinks:2 matches all documents that have 2 qualified inlinks pointing to their site based on the tagdb record. You can also provide your own tags in addition to the tags already present. See the tagdb menu for more information. |
gbzip:90210 | Matches all documents that have the specified zip code in their meta zip code tag. |
gbcharset:windows-1252 | Matches all documents originally in the Windows-1252 charset. Available character sets are listed in the iana_charset.cpp file in the open source distribution. There are a lot. Some more popular ones are: us, latin1, iso-8859-1, csascii, ascii, latin2, latin3, latin4, greek, utf-8, shift_jis. |
gblang:de | Matches all documents in german. The supported language abbreviations are at the bottom of the url filters page. Some more common ones are gblang:en, gblang:es, gblang:fr, gblang:"zh_cn" (note the quotes for zh_cn!). |
gbpathdepth:3 | Matches all documents whose url has 3 path components to it like http://somedomain.com/dir1/dir2/dir3/foo.html |
gbhopcount:2 | Matches all documents that are a minimum of two link hops away from a root url. |
gbhasfilename:1 | Matches all documents whose url ends in a filename like http://somedomain.com/dir1/myfile and not http://somedomain.com/dir1/dir2/. Likewise, use gbhasfilename:0 to match all the documents that do not have a filename in their url. |
gbiscgi:1 | Matches all documents that have a question mark in their url. Likewise gbiscgi:0 matches all documents that do not. |
gbhasext:1 | Matches all documents that have a file extension in their url. Likewise, gbhasext:0 matches all documents that do not have a file extension in their url. |
gbsubmiturl:domain.com/process.php | Matches all documents that have a form that submits to the specified url. |
gbparenturl:www.xyz.com/abc.html | Diffbot only. Match the json urls that were extract from this parent url. Example: gbparenturl:www.gigablast.com/addurl.htm |
gbcountry:us | Matches documents determined by Gigablast to be from the United States. See the country abbreviations in the CountryCode.cpp open source distribution. Some more popular examples include: de, fr, uk, ca, cn. |
gbpermalink:1 | Matches documents that are permalinks. Use gbpermalink:0 to match documents that are NOT permalinks. |
gbdocid:123456 | Matches the document with the docid 123456 |
Numeric Field Query Operators |
Example Query | Description |
cameras gbsortbyfloat:price | Sort all documents that contain 'camera' by price. price can be a root JSON field or in a meta tag, or in an xml <price> tag. |
cameras gbsortbyfloat:product.price | Sort all documents that contain 'camera' by price. price can be in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product> |
cameras gbrevsortbyfloat:product.price | Like above example but sorted with highest prices on top. |
pilots gbsortbyint:employees | Sort all documents that contain 'pilots' by employees. employees can be a root JSON field or in a meta tag, or in an xml <price> tag. The value it contains is interpreted as a 32-bit integer. |
gbsortbyint:gbdocspiderdate | Sort all documents by the date they were spidered/downloaded. |
gbsortbyint:company.employees | Sort all documents by employees. Documents can contain employees in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product> |
gbsortbyint:gbsitenuminlinks | Sort all documents by the number of distinct inlinks the document's site has. |
gbrevsortbyint:gbdocspiderdate | Sort all documents by the date they were spidered/downloaded but with the oldest on top. |
cameras gbminfloat:price:109.99 | Matches all documents that contain 'camera' or 'cameras' and have a price of at least 109.99. price can be a root JSON field or in a meta tag name price, or in an xml <price> tag. |
cameras gbminfloat:product.price:109.99 | Matches all documents that contain 'camera' or 'cameras' and have a price of at least 109.99 in a JSON document like { "product":{"price":1500.00}} or, alternatively, an XML document like <product><price>1500.00</price></product> |
cameras gbmaxfloat:price:109.99 | Like the gbminfloat examples above, but is an upper bound. |
gbequalfloat:product.price:1.23 | Similar to gbminfloat and gbmaxfloat but is an equality constraint. |
gbminint:gbspiderdate:1391749680 | Matches all documents with a spider timestamp of at least 1391749680. Use this as opposed th gbminfloat when you need 32 bits of integer precision. |
gbmaxint:company.employees:20 | Matches all companies with 20 or less employees in a JSON document like { "company":{"employees":13}} or, alternatively, an XML document like <company><employees>13</employees></company> |
gbequalint:company.employees:13 | Similar to gbminint and gbmaxint but is an equality constraint. |
Facet Related Query Operators |
Example Query | Description |
gbfacetstr:color | Returns facets in the search results by their color field. color is case INsensitive. |
gbfacetstr:product.color | Returns facets in the color field in a JSON document like { "product":{"color":"red"}} or, alternatively, an XML document like <product><color>red</price></product>. product.color is case INsensitive. |
gbfacetstr:gbtagsite cat | Returns facets from the site names of all pages that contain the word 'cat' or 'cats', etc. gbtagsite is case insensitive. |
gbfacetint:product.cores | Returns facets in of the cores field in a JSON document like { "product":{"cores":10}} or, alternatively, an XML document like <product><cores>10</price></product>. product.cores is case INsensitive. |
gbfacetint:gbhopcount | Returns facets in of the gbhopcount field over the documents so you can search the distribution of hopcounts over the index. gbhopcount is case INsensitive. |
gbfacetint:gbtagsitenuminlinks | Returns facets in of the sitenuminlinks field for the tag sitenuminlinksin the tag for each site. Any numeric tag in tagdb can be facetizeed in this manner so you can add your own facets this way on a per site or per url basis by making tagdb entries. Case Insensitive. |
gbfacetint:size,0-10,10-20,30-100,100-200,200-1000,1000-10000 | Returns facets in of the size field (either in json, field or a meta tag) and cluster the results into the specified ranges. size is case INsensitive. |
gbfacetint:gbsitenuminlinks | Returns facets based on # of site inlinks the site of each result has. gbsitenuminlinks is case INsensitive. |
gbfacetfloat:product.weight | Returns facets of the weight field in a JSON document like { "product":{"weight":1.45}} or, alternatively, an XML document like <product><weight>1.45</price></product>. product.weight is case INsensitive. |
gbfacetfloat:product.price,0-1.5,1.5-5,5.0-20,20-100.0 | Similar to above but cluster the pricess into the specified ranges. product.price is case insensitive. |
Spider Status Documents |
Example Query | Description |
gbssUrl:com | Query the url of a spider status document. |
gbssFinalRedirectUrl:abc.com/page2.html | Query on the last url redirect to, if any. |
gbssStatusCode:0 | Query on the status code of the index attempt. 0 means no error. |
gbssStatusMsg:"Tcp timed" | Like gbssStatusCode but a textual representation. |
gbssHttpStatus:200 | Query on the HTTP status returned from the web server. |
gbssWasIndexed:0 | Was the document in the index before attempting to index? Use 0 or 1 to find all documents that were not or were, respectively. |
gbssIsDiffbotObject:1 | This field is only present if the document was an object from a diffbot reply. Use gbssIsDiffbotObject:0 to find the non-diffbot objects. |
gbsortby:gbssAgeInIndex | If the document was in the index at the time we attempted to reindex it, how long has it been since it was last indexed? |
gbssDomain:yahoo.com | Query on the domain of the url. |
gbssSubdomain:www.yahoo.com | Query on the subdomain of the url. |
gbfacetint:gbssNumRedirects | Query on the number of times the url redirect when attempting to index it. |
gbssDocId:1234567 | Show all the spider status docs for the document with this docId. |
gbfacetint:gbssHopCount | Query on the hop count of the document. |
gbfacetint:gbssCrawlRound | Query on the crawl round number. |
gbssDupOfDocId:123456 | Show all the documents that were considered dups of this docId. |
gbssPrevTotalNumIndexAttempts:1 | Before this index attempt, how many attempts were there? |
gbssPrevTotalNumIndexSuccesses:1 | Before this index attempt, how many successful attempts were there? |
gbssPrevTotalNumIndexFailures:1 | Before this index attempt, how many failed attempts were there? |
gbrevsortbyint:gbssFirsIndexed | The date in utc that the document was first indexed. |
gbfacetint:gbssContentHash32 | The hash of the document content, excluding dates and times. Used internally for deduping. |
gbsortbyint:gbssDownloadDurationMS | How long it took in millisecons to download the document. |
gbsortbyint:gbssDownloadStartTime | When the download started, in seconds since the epoch, UTC. |
gbsortbyint:gbssDownloadEndTime | When the download ended, in seconds since the epoch, UTC. |
gbfacetint:gbssUsedRobotsTxt | This is 0 or 1 depending on if robots.txt was not obeyed or obeyed, respectively. |
gbfacetint:gbssConsecutiveErrors | For the last set of indexing attempts how many were errors? |
gbssIp:1.2.3.4 | The IP address of the document being indexed. Is 0.0.0.0 if unknown. |
gbsortby:gbssIpLookupTimeMS | How long it took to lookup the IP of the document. Might have been in the cache. |
gbsortby:gbssSiteNumInlinks | How many good inlinks the document's site had. |
gbsortby:gbssSiteRank | The site rank of the document. Based directly on the number of inlinks the site had. |
gbfacetint:gbssContentInjected | This is 0 or 1 if the content was not injected or injected, respectively. |
gbfacetfloat:gbssPercentContentChanged | A float between 0 and 100, inclusive. Represents how much the document has changed since the last time we indexed it. This is only valid if the document was successfully indexed this time.respectively. |
gbfacetint:gbssSpiderPriority | The spider priority, from 0 to 127, inclusive, of the document according to the url filters table. |
gbfacetstr:gbssMatchingUrlFilter | The url filter expression the document matched. |
gbfacetstr:gbssLanguage | The language of the document. If document was empty or not downloaded then this will not be present. Uses xx to mean unknown language. Uses the language abbreviations found at the bottom of the url filters page. |
gbfacetstr:gbssContentType | The content type of the document. Like html, xml, json, pdf, etc. This field is not present if unknown. |
gbsortbyint:gbssContentLen | The content length of the document. 0 if empty or not downloaded. |
gbfacetint:gbssCrawlDelay | The crawl delay according to the robots.txt of the document. This is -1 if not specified in the robots.txt or not found. |
gbssSentToDiffbotThisTime:1 | Was the document's url sent to diffbot for processing this time of spidering the url? |
gbssSentToDiffbotAtSomeTime:1 | Was the document's url sent to diffbot for processing, either this time or some time before? |
gbssDiffbotReplyCode:0 | The reply received from diffbot. 0 means success, otherwise, it indicates an error code. |
gbfacetstr:gbssDiffbotReplyMsg:0 | The reply received from diffbot represented in text. |
gbsortbyint:gbssDiffbotReplyLen | The length of the reply received from diffbot. |
gbsortbyint:gbssDiffbotReplyResponseTimeMS | The time in milliseconds it took to get a reply from diffbot. |
gbfacetint:gbssDiffbotReplyRetries | The number of times we had to resend the request to diffbot because diffbot returned a 504 gateway timed out error. |
gbfacetint:gbssDiffbotReplyNumObjects | The number of JSON objects diffbot excavated from the provided url. |
Boolean Queries |
Example Query | Description | Note: boolean operators must be in UPPER CASE. | cat AND dog | Search results have the word cat AND the word dog in them. | cat OR dog | Search results have the word cat OR the word dog in them, but preference is given to results that have both words. | cat dog OR pig | Search results have the two words cat and dog OR search results have the word pig, but preference is given to results that have all three words. This illustrates how the individual words of one operand are all required for that operand to be true. | "cat dog" OR pig | Search results have the phrase "cat dog" in them OR they have the word pig, but preference is given to results that have both. | title:"cat dog" OR pig | Search results have the phrase "cat dog" in their title OR they have the word pig, but preference is given to results that have both. | cat OR dog OR pig | Search results need only have one word, cat or dog or pig, but preference is given to results that have the most of the words. | cat OR dog AND pig | Search results have dog and pig, but they may or may not have cat. Preference is given to results that have all three. To evaluate expressions with more than two operands, as in this case where we have three, you can divide the expression up into sub-expressions that consist of only one operator each. In this case we would have the following two sub-expressions: cat OR dog and dog AND pig. Then, for the original expression to be true, at least one of the sub-expressions that have an OR operator must be true, and, in addition, all of the sub-expressions that have AND operators must be true. Using this logic you can evaluate expressions with more than one boolean operator. | cat AND NOT dog | Search results have cat but do not have dog. | cat AND NOT (dog OR pig) | Search results have cat but do not have dog and do not have pig. When evaluating a boolean expression that contains ()'s you can evaluate the sub-expression in the ()'s first. So if a document has dog or it has pig or it has both, then the expression, (dog OR pig) would be true. So you could, in this case, substitute true for that expression to get the following: cat AND NOT (true) = cat AND false = false. Does anyone actually read this far? | (cat OR dog) AND NOT (cat AND dog) | Search results have cat or dog but not both. | left-operand OPERATOR right-operand | This is the general format of a boolean expression. The possible operators are: OR and AND. The operands can themselves be boolean expressions and can be optionally enclosed in parentheses. A NOT operator can optionally preceed the left or the right operand. |
|