Gigablast - api




            SEARCH    

            ADVANCED    

            WIDGETS    

            SYNTAX    

            USERS    

            ABOUT    

            BLOG    

            FAQ    

            API    
 

ADMIN    





NOTE: All APIs support both GET and POST method. If the size of your request is more than 2K you should use POST.

NOTE: All APIs support both http and https protocols.

API by pages
/search - search results page   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3fastSTRINGfast results0Use &fast=1 to obtain search results from the much faster Gigablast index, although the results are not searched as thoroughly.
4qSTRINGqueryThe query to perform. See help. See the query operators below for more info. REQUIRED
5cSTRINGcollectionSearch this collection. Use multiple collection names separated by a whitespace to search multiple collections at once. REQUIRED
6n INT32 number of results per query10The number of results returned per page.
7s INT32 first result num0Start displaying at search result #X. Starts at 0.
8showerrorsBOOL (0 or 1)show errors0Show errors from generating search result summaries rather than just hide the docid. Useful for debugging.
9scBOOL (0 or 1)site cluster0Should search results be site clustered? This limits each site to appearing at most twice in the search results. Sites are subdomains for the most part, like abc.xyz.com.
10hacrBOOL (0 or 1)hide all clustered results0Only display at most one result per site.
11drBOOL (0 or 1)dedup results0Should duplicate search results be removed? This is based on a content hash of the entire document. So documents must be exactly the same for the most part.
12pss INT32 percent similar dedup summaryIf document summary (and title) are this percent similar to a document summary above it, then remove it from the search results. 100 means only to remove if exactly the same. 0 means no summary deduping. You must also supply dr=1 for this to work.
13dduBOOL (0 or 1)dedup URLs0Should we dedup URLs with case insensitivity? This is mainly to correct duplicate wiki pages.
14spellBOOL (0 or 1)do spell checking1If enabled while using the XML feed, when Gigablast finds a spelling recommendation it will be included in the XML tag. Default is 0 if using an XML feed, 1 otherwise. Will be available again soon.
15streamCHARstream search results0Stream search results back on socket as they arrive. Useful when thousands/millions of search results are requested. Required when doing such things otherwise Gigablast could run out of memory. Only supported for JSON and XML formats, not HTML.
16secsback INT32 seconds back0Limit results to pages spidered this many seconds ago. Use 0 to disable.
17sortbyCHARsort by0Use 0 to sort results by relevance, 1 to sort by most recent spider date down, and 2 to sort by oldest spidered results first.
18filetypeSTRINGfiletypeRestrict results to this filetype. Supported filetypes are pdf, doc, html xml, json, xls.
19scoresBOOL (0 or 1)get scoring infoGet scoring information for each result so you can see how each result is scored. You must explicitly request this using &scores=1 for the XML feed because it is not included by default.
20qeBOOL (0 or 1)do query expansionIf enabled, query expansion will expand your query to include the various forms and synonyms of the query terms.
21uipSTRINGuser ipThe ip address of the searcher. We can pass back for use in the autoban technology which bans abusive IPs.
22nf INT32 max number of facets to return50Max number of facets to return
23qlangSTRINGsort language preferenceDefault language to use for ranking results. Value should be any language abbreviation, for example "en" for English. Use xx to give ranking boosts to no language in particular. See the language abbreviations at the bottom of the url filters page.
24langwFLOAT32language weightUse this to override the default language weight for this collection. The default language weight can be set in the search controls and is usually something like 20.0. Which means that we multiply a result's score by 20 if from the same language as the query or the language is unknown.
25tml INT32 max title lenWhat is the maximum number of characters allowed in titles displayed in the search results?
26ns INT32 number of summary excerptsHow many summary excerpts to display per search result?
27sw INT32 max summary line width<br> tags are inserted to keep the number of chars in the summary per line at or below this width. Also affects title. Strings without spaces that exceed this width are not split. Has no affect on xml or json feed, only works on html.
28smxcpl INT32 max summary excerpt lengthWhat is the maximum number of characters allowed per summary excerpt?
29dsrt INT32 results to scan for gigabits generationHow many search results should we scan for gigabit (related topics) generation. Set this to zero to disable gigabits!
30iprBOOL (0 or 1)ip restriction for gigabitsShould Gigablast only get one document per IP domain and per domain for gigabits (related topics) generation?
31nrt INT32 number of gigabits to show11What is the number of gigabits (related topics) displayed per query? Set to 0 to save a little CPU time.
32mts INT32 min topics scoreGigabits (related topics) with scores below this will be excluded. Scores range from 0% to over 100%.
33mdc INT32 min gigabit doc count by default2How many documents must contain the gigabit (related topic) in order for it to be displayed.
34dsp INT32 dedup doc percent for gigabits (related topics)80If a document is this percent similar to another document with a higher score, then it will not contribute to the gigabit generation.
35mwpt INT32 max words per gigabit (related topic) by default6Maximum number of words a gigabit (related topic) can have. Affects xml feeds, too.
36showimagesBOOL (0 or 1)show images1Should we return or show the thumbnail images in the search results?
37usecacheCHARuse cache-1Use 0 if Gigablast should not read or write from any caches at any level.
38rcacheBOOL (0 or 1)read from cache1Should we read search results from the cache? Set to false to fix dmoz bug.
39wcacheCHARwrite to cache-1Use 0 if Gigablast should not write to any caches at any level.
40minserpdocid INT64 max serp docid0Start displaying results after this score/docid pair. Used by widget to append results to end when index is volatile.
41maxserpscoreFLOAT64max serp score0Start displaying results after this score/docid pair. Used by widget to append results to end when index is volatile.
42linkSTRINGrestrict search to pages that link to this urlThe url which the pages must link to.
43sitesSTRINGrestrict results to these sitesReturned results will have URLs from these space-separated list of sites. Can have up to 200 sites. A site can include sub folders. This is allows you to build a Custom Topic Search Engine.
44ffBOOL (0 or 1)family filter0Remove objectionable results if this is enabled.
45qhBOOL (0 or 1)highlight query terms in summaries1Use to disable or enable highlighting of the query terms in the summaries.
46hqSTRINGcached page highlight queryHighlight the terms in this query instead.
47bq INT32 boolean status2Can be 0 or 1 or 2. 0 means the query is NOT boolean, 1 means the query is boolean and 2 means to auto-detect.
48dtSTRINGmeta tags to displayA space-separated string of meta tag names. Do not forget to url-encode the spaces to +'s or %%20's. Gigablast will extract the contents of these specified meta tags out of the pages listed in the search results and display that content after each summary. i.e. &dt=description will display the meta description of each search result. &dt=description:32+keywords:64 will display the meta description and meta keywords of each search result and limit the fields to 32 and 64 characters respectively. When used in an XML feed the <display name="meta_tag_name">meta_tag_content</> XML tag will be used to convey each requested meta tag's content.
49niceness INT32 niceness0Can be 0 or 1. 0 is usually a faster, high-priority query, 1 is a slower, lower-priority query.
50debugBOOL (0 or 1)debug flag0Is 1 to log debug information, 0 otherwise.
51rdcBOOL (0 or 1)return number of docs per topic1Use 1 if you want Gigablast to return the number of documents in the search results that contained each topic (gigabit).
52rdBOOL (0 or 1)return docids per topic0Use 1 if you want Gigablast to return the list of docIds from the search results that contained each topic (gigabit).
53debuggigabitsBOOL (0 or 1)debug gigabits flag0Is 1 to log gigabits debug information, 0 otherwise.
54dioBOOL (0 or 1)return docids only0Is 1 to return only docids as query results.
55adminBOOL (0 or 1)admin override1admin override
56prependSTRINGprependprepend this to the supplied query followed by a |.
57sbBOOL (0 or 1)show banned pages0show banned pages
58icc INT32 include cached copy of page0Will cause a cached copy of content to be returned instead of summary.

Example XML Output (&format=xml)
<response>
	<statusCode>0</statusCode>
	<statusMsg>Success</statusMsg>
	<currentTimeUTC>1404513734</currentTimeUTC>
	<responseTimeMS>284</responseTimeMS>
	<docsInCollection>226</docsInCollection>
	<hits>193</hits>
	<moreResultsFollow>1</moreResultsFollow>
	<result>
		<imageBase64>/9j/4AAQSkZJRgABAQAAAQABA...</imageBase64>
		<imageHeight>350</imageHeight>
		<imageWidth>223</imageWidth>
		<origImageHeight>470</origImageHeight>
		<origImageWidth>300</origImageWidth>
		<title><![CDATA[U.S....]]></title>
		<sum>Department of the Interior protects America's natural resources and</sum>
		<url><![CDATA[www.doi.gov]]></url>
		<size>  64k</size>
		<docId>34111603247</docId>
		<site>www.doi.gov</site>
		<spidered>1404512549</spidered>
		<firstIndexedDateUTC>1404512549</firstIndexedDateUTC>
		<contentHash32>2680492249</contentHash32>
		<language>English</language>
	</result>
</response>

Example JSON Output (&format=json)
{ "response:"{

	# This is zero on a successful query. 
	# Otherwise it will be a non-zero number 
	# indicating the error code.
	"statusCode":0,

	# Similar to above, this is "Success" 
	# on a successful query. Otherwise it 
	# will indicate an error message 
	# corresponding to the statusCode above.
	"statusMsg":"Success",

	# This is the current time in UTC in 
	# unix timestamp format (seconds since 
	# the epoch) that the server has when 
	# generating this JSON response.
	"currentTimeUTC":1404588231,

	# This is how long it took in 
	# milliseconds to generate the JSON 
	# response from reception of the request.
	"responseTimeMS":312,

	# This is how many matches were 
	# excluded from the search results 
	# because they were considered 
	# duplicates, banned, had errors 
	# generating the summary, or where from 
	# an over-represented site. To show them 
	# use the &sc &dr &pss &sb and 
	# &showerrors input parameters described 
	# above.
	"numResultsOmitted":3,

	# This is how many shards failed to 
	# return results. Gigablast gets results 
	# from multiple shards (computers) and 
	# merges them to get the final result 
	# set. Some times a shard is down or 
	# malfunctioning so it will not 
	# contribute to the results. So If this 
	# number is non-zero then you had such a 
	# shard.
	"numShardsSkipped":0,

	# This is how many shards are ideally 
	# in use by Gigablast to generate search 
	# results.
	"totalShards":159,

	# This is how many total documents are 
	# in the collection being searched.
	"docsInCollection":226,

	# This is how many of those documents 
	# matched the query.
	"hits":193,

	# This is 1 if more search results are 
	# available, otherwise it is 0.
	"moreResultsFollow":1,

	# Start of query-based information.
	"queryInfo":{

		# The entire query that was received, 
		# represented as a single string.
		"fullQuery":"test",

		# The language of the query. This is 
		# the 'preferred' language of the search 
		# results. It is reflecting the &qlang 
		# input parameter described above. Search 
		# results in this language (or an unknown 
		# language) will receive a large boost. 
		# The boost is multiplicative. The 
		# default boost size can be overridden 
		# using the &langw input parameter 
		# described above. This language 
		# abbreviation here is usually 2 letter, 
		# but can be more, like in the case of 
		# zh-cn, for example.
		"queryLanguageAbbr":"en",

		# The language of the query. Just 
		# like above but the language is spelled 
		# out. It may be multiple words.
		"queryLanguage":"English",

		# List of space separated words in 
		# the query that were ignored for the 
		# most part. Because they were common 
		# words for the query language they are 
		# in.
		"ignoredWords":"to the",

		# There is a maximum limit placed on 
		# the number of query terms we search on 
		# to keep things fast. This can be 
		# changed in the search controls.
		"queryNumTermsTotal":52,
		"queryNumTermsUsed":20,
		"queryWasTruncated":1,

		# The start of the terms array. Each 
		# query is broken down into a list of 
		# terms. Each term is described here.
		"terms":[

			# The first query term in the JSON 
			# terms array.
			{

			# The term number, starting at 0.
			"termNum":0,

			# The term as a string.
			"termStr":"test",

			# The term frequency. An estimate of 
			# how many pages in the collection 
			# contain the term. Helps us weight terms 
			# by popularity when scoring the results.
			"termFreq":425239458,

			# A 48-bit hash of the term. Used to 
			# represent the term in the index.
			"termHash48":67259736306430,

			# A 64-bit hash of the term.
			"termHash64":9448336835959712000,

			# If the term has a field, like the 
			# term title:cat, then what is the hash 
			# of the field. In this example it would 
			# be the hash of 'title'. But for the 
			# query 'test' there is no field so it is 
			# 0.
			"prefixHash64":0

			},

			# The second query term in the JSON 
			# terms array.
			{

			"termNum":1,
			"termStr":"tested",

			# The language the term is from, in 
			# the case of query expansion on the 
			# original query term. Gigablast tries to 
			# find multiple forms of the word that 
			# have the same essential meaning. It 
			# uses the specified query language 
			# (&qlang), however, if a query term is 
			# from a different language, then that 
			# language will be implied for query 
			# expansion.
			"termLang":"en",

			# The query term that this term is a 
			# form of.
			"synonymOf":"test",

			"termFreq":73338909,
			"termHash48":66292713121321,
			"termHash64":9448336835959712000,
			"prefixHash64":0
			},

			...

		# End of the JSON terms array.
		]

	# End of the queryInfo JSON structure.
	},

	# The start of the gigabits array. 
	# Each gigabit is mined from the content 
	# of the search results. The top N 
	# results are mined, and you can control 
	# N with the &dsrt input parameter 
	# described above.
	"gigabits":[

		# The first gigabit in the array.
		{

		# The gigabit as a string in utf8.
		"term":"Membership",

		# The numeric score of the gigabit.
		"score":240,

		# The popularity ranking of the 
		# gigabit. Out of 10000 random documents, 
		# how many documents contain it?
		"minPop":480,

		# The gigabit in the context of a 
		# document.
		"instance":{

			# A sentence, if it exists, from one 
			# of the search results which also 
			# contains the gigabit and as many 
			# significant query terms as possible. In 
			# UTF-8.
			"sentence":"Get a free Tested Premium Membership here!",

			# The url that contained that 
			# sentence. Always starts with http.
			"url":"http://www.tested.com/",

			# The domain of that url.
			"domain":"tested.com"
		}

		# End of the first gigabit
		},

		...

	# End of the JSON gigabits array.
	],

	# Start of the facets array, if any.
	"facets":[

		# The first facet in the array.
		{
			# The field you are faceting over
			"field":"Company",

			# How many documents in the 
			# collection had this particular field? 
			# 64-bit integer.
			"totalDocsWithField":148553,

			# How many documents in the 
			# collection had this particular field 
			# with the same value as the value line 
			# directly below? This should always be 
			# less than or equal to the 
			# totalDocsWithField count. 64-bit 
			# integer.
			"totalDocsWithFieldAndValue":44184,

			# The value of the field in the case 
			# of this facet. Can be a string or an 
			# integer or a float, depending on the 
			# type described in the gbfacet query 
			# term. i.e. gbfacetstr, gbfacetint or 
			# gbfacetfloat.
			"value":"Widgets, Inc.",

			# Should be the same as 
			# totalDocsWithFieldAndValue, above. 
			# 64-bit integer.
			"docCount":44184

		# End of the first facet in the array.
		}

	# End of the facets array.
	],

	# Start of the JSON array of 
	# individual search results.
	"results":[

		# The first result in the array.
		{

		# The title of the result. In UTF-8.
		"title":"This is the title.",

		# A DMOZ entry. One result can have 
		# multiple DMOZ entries.
		"dmozEntry":{

			# The DMOZ category ID.
			"dmozCatId":374449,

			# The DMOZ direct category ID.
			"directCatId":1,

			# The DMOZ category as a UTF-8 
			# string.
			"dmozCatStr":"Top: Computers: Security: Malicious 
			 Software: Viruses: Detection and Removal Tools: 
			 Reviews",

			# What title some DMOZ editor gave 
			# to this url.
			"dmozTitle":"The DMOZ Title",

			# What summary some DMOZ editor gave 
			# to this url.
			"dmozSum":"A great web page.",

			# The DMOZ anchor text, if any.
			"dmozAnchor":"",

		# End DMOZ entry.
		},

		# The content type of the url. Can be 
		# html, pdf, text, xml, json, doc, xls or 
		# ps.
		"contentType":"html",

		# The summary excerpt of the result. 
		# In UTF-8.
		"sum":"Department of the Interior protects America's natural resources.",

		# The url of the result. If it starts 
		# with http:// then that is omitted. Also 
		# omits the trailing / if the urls is 
		# just a domain or subdomain on the root 
		# path.
		"url":"www.doi.gov",

		# The hopcount of the url. The 
		# minimum number of links we would have 
		# to click to get to it from a root url. 
		# If this is 0 that means the url is a 
		# root url, like http://www.root.com/.
		"hopCount":0,

		# The size of the result's content. 
		# Always in kilobytes. k stands for 
		# kilobytes. Could be a floating point 
		# number or and integer.
		"size":"  64k",

		# The exact size of the result's 
		# content in bytes.
		"sizeInBytes":64560,

		# The unique document identifier of 
		# the result. Used for getting the cached 
		# content of the url.
		"docId":34111603247,

		# The site the result comes from. 
		# Usually a subdomain, but can also 
		# include part of the URL path, like, 
		# abc.com/users/brad/. A site is a set of 
		# web pages controlled by the same 
		# entity.
		"site":"www.doi.gov",

		# The time the url was last INDEXED. 
		# If there was an error or the url's 
		# content was unchanged since last 
		# download, then this time will remain 
		# unchanged because the document is not 
		# reindexed in those cases. Time is in 
		# unix timestamp format and is in UTC.
		"spidered":1404512549,

		# The first time the url was 
		# successfully INDEXED. Time is in unix 
		# timestamp format and is in UTC.
		"firstIndexedDateUTC":1404512549,

		# A 32-bit hash of the url's content. 
		# It is used to determine if the content 
		# changes the next time we download it.
		"contentHash32":2680492249,

		# The dominant language that the 
		# url's content is in. The language name 
		# is spelled out in its entirety.
		"language":"English"

		# A convenient abbreviation of the 
		# above language. Most are two 
		# characters, but some, like zh-cn, are 
		# more.
		"langAbbr":"en"

		# If the result has an associated 
		# image then the image thumbnail is 
		# encoded in base64 format here. It is a 
		# jpg image.
		"imageBase64":"/9j/4AAQSkZJR...",

		# If the result has an associated 
		# image then what is its height and width 
		# of the above jpg thumbnail image in 
		# pixels?
		"imageHeight":223,
		"imageWidth":350,

		# If the result has an associated 
		# image then what are the dimensions of 
		# the original image in pixels?
		"origImageHeight":300,
		"origImageWidth":470

		# End of the first result.
		},

		...

	# End of the JSON results array.
	]

# End of the response.
}

}



/get - gets cached web page   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3d INT64 docId0The docid of the cached page to view. REQUIRED
4urlSTRINGurlInstead of specifying a docid, you can get the cached webpage by url as well. REQUIRED
5cSTRINGcollectionGet the cached page from this collection. REQUIRED
6strip INT32 strip0Is 1 or 2 two strip various tags from the cached content.
7ihBOOL (0 or 1)include header1Is 1 to include the Gigablast header at the top of the cached page, 0 to exclude the header.
8qSTRINGqueryHighlight this query in the page.

Example XML Output (&format=xml)
<response>
	<statusCode>0</statusCode>
	<statusMsg>Success</statusMsg>
	<url><![CDATA[http://www.doi.gov/]]></url>
	<docId>34111603247</docId>
	<cachedTimeUTC>1404512549</cachedTimeUTC>
	<cachedTimeStr>Jul 04, 2014 UTC</cachedTimeStr>
	<content><![CDATA[<html><title>Some web page title</title><head>My first web page</head></html>]]></content>
</response>

Example JSON Output (&format=json)
{ "response:"{
	"statusCode":0,
	"statusMsg":"Success",
	"url":"http://www.doi.gov/",
	"docId":34111603247,
	"cachedTimeUTC":1404512549,
	"cachedTimeStr":"Jul 04, 2014 UTC",
	"content":"<html><title>Some web page title</title><head>My first web page</head></html>"
}
}



/admin/status - basic status   [ show parms in xml or json ]   [ show status in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3cSTRINGcollectionUse this collection. REQUIRED



/admin/collectionpasswords - passwords   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3collpwdSTRINGCollection PasswordsWhitespace separated list of passwords. Any matching password will have administrative access to the controls for just this collection. The master password and IPs are controlled through the master passwords link under the ADVANCED controls tab. The master passwords or IPs have administrative access to all collections.
4collipsSTRINGCollection IPsWhitespace separated list of IPs. Any matching IP will have administrative access to the controls for just this collection.



/admin/hosts - hosts status   [ show parms in xml or json ]   [ show status in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.



/admin/master - master controls   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3seBOOL (0 or 1)spidering enabled1Controls all spidering for all collections
Current value: 1
4injenBOOL (0 or 1)injections enabled1Controls injecting for all collections
Current value: 1
5qryenBOOL (0 or 1)querying enabled1Controls querying for all collections
Current value: 1
6rraBOOL (0 or 1)return results even if a shard is down1If you turn this off then Gigablast will return an error message if a shard was down and did not return results for a query. The XML and JSON feed let's you know when a shard is down and will give you the results back any way, but if you would rather have just and error message and no results, then set then set this to 'NO'.
Current value: 1
7maxmem INT64 max mem8000000000Mem available to this process. May be exceeded due to fragmentation.
Current value: 8000000000
8mtsp INT32 max total spiders100What is the maximum number of web pages the spider is allowed to download simultaneously for ALL collections PER HOST? Caution: raising this too high could result in some Out of Memory (OOM) errors. The hard limit is currently 300. Each collection has its own limit in the spider controls that you may have to increase as well.
Current value: 100
9aeBOOL (0 or 1)add url enabled1Can people use the add url interface to add urls to the index?
Current value: 0
10ucpBOOL (0 or 1)use collection passwords0Should collections have individual password settings so different users can administrer different collections? If not the only the master passwords and IPs will be able to administer any collection.
Current value: 0
11acuBOOL (0 or 1)allow cloud users0Can guest users create and administer a collection? Limit: 1 collection per IP address. This is mainly for doing demos on the gigablast.com domain.
Current value: 0
12asf INT32 auto save frequency5Save data in memory to disk after this many minutes have passed without the data having been dumped or saved to disk. Use 0 to disable.
Current value: 5
13ms INT32 max http sockets300Maximum sockets available to serve incoming HTTP requests. Too many outstanding requests will increase query latency. Excess requests will simply have their sockets closed.
Current value: 300
14mss INT32 max https sockets100Maximum sockets available to serve incoming HTTPS requests. Like max http sockets, but for secure sockets.
Current value: 100
15suaSTRINGspider user agentGigablastOpenSource/1.0Identification seen by web servers when the Gigablast spider downloads their web pages. It is polite to insert a contact email address here so webmasters that experience problems from the Gigablast spider have somewhere to vent.
Current value: GigablastOpenSource/1.0 search.dxhub.de
16jsUNARY CMD (set to 1)saveSaves in-memory data for ALL hosts. Does Not exit.
17saveUNARY CMD (set to 1)save & exitSaves the data and exits for ALL hosts.
18rebalanceUNARY CMD (set to 1)rebalance shardsTell all hosts to scan all records in all databases, and move records to the shard they belong to. You only need to run this if Gigablast tells you to, when you are changing hosts.conf to add or remove more nodes/hosts.
19dumpUNARY CMD (set to 1)dump to diskFlushes all records in memory to the disk on all hosts.
20pmergeUNARY CMD (set to 1)tight merge posdbMerges all outstanding posdb (index) files.
21tmergeUNARY CMD (set to 1)tight merge titledbMerges all outstanding titledb (web page cache) files.
22spmergeUNARY CMD (set to 1)tight merge spiderdbMerges all outstanding spiderdb files.
23clrkrnerrUNARY CMD (set to 1)clear kernel error messageClears the kernel error message. You must do this to stop getting email alerts for a kernel ring buffer error alert.
24afgdwdBOOL (0 or 1)ask for gzipped docs when downloading0If this is true, gb will send Accept-Encoding: gzip to web servers when doing http downloads. It does have a tendency to cause out-of-memory errors when you enable this, so until that is fixed better, it's probably a good idea to leave this disabled.
Current value: 0
25srcma INT32 search results cache max age10800How many seconds should we cache a search results page for?
Current value: 10800
26abBOOL (0 or 1)autoban IPs which violate the queries per day quotas0Keep track of ips which do queries, disallow non-customers from hitting us too hard.
Current value: 0
27mhdms INT32 max heartbeat delay in milliseconds0If a heartbeat is delayed this many milliseconds dump a core so we can see where the CPU was. Logs 'db: missed heartbeat by %ld ms'. Use 0 or less to disable.
Current value: 0
28mdch INT32 max delay before logging a callback or handler-1If a call to a message callback or message handler in the udp server takes more than this many milliseconds, then log it. Logs 'udp: Took %ld ms to call callback for msgType=0x%hhx niceness=%d'. Use -1 or less to disable the logging.
Current value: -1
29seaBOOL (0 or 1)send email alerts0Sends emails to admin if a host goes down.
Current value: 0
30dncaBOOL (0 or 1)delay non critical email alerts0Do not send email alerts about dead hosts to anyone except sysadmin@gigablast.com between the times given below unless all the twins of the dead host are also dead. Instead, wait till after if the host is still dead.
Current value: 0
31set INT32 send email timeout62000Send an email after a host has not responded to successive pings for this many milliseconds.
Current value: 62000
32qsrtFLOAT32query success rate threshold0.850000Send email alerts when query success rate goes below this threshold. (percent rate between 0.0 and 1.0)
Current value: 0.850000
33aqpstFLOAT32average query latency threshold2.000000Send email alerts when average query latency goes above this threshold. (in seconds)
Current value: 2.000000
34nqt INT32 number of query times in average300Record this number of query times before calculating average query latency.
Current value: 300
35mcil INT32 max corrupt index lists5If we reach this many corrupt index lists, send an admin email. Set to -1 to disable.
Current value: 5
36mhdt INT32 max hard drive temperature45At what temperature in Celsius should we send an email alert if a hard drive reaches it?
Current value: 45
37errstroneSTRINGerror string 1I/O errorLook for this string in the kernel buffer for sending email alert. Useful for detecting some strange hard drive failures that really slow performance.
Current value: I/O error
38errstrtwoSTRINGerror string 2Look for this string in the kernel buffer for sending email alert. Useful for detecting some strange hard drive failures that really slow performance.
Current value:
39errstrthreeSTRINGerror string 3Look for this string in the kernel buffer for sending email alert. Useful for detecting some strange hard drive failures that really slow performance.
Current value:
40seatoneBOOL (0 or 1)send email alerts to email 10Sends to email address 1 through email server 1.
Current value: 0
41seatonepBOOL (0 or 1)send parm change email alerts to email 10Sends to email address 1 through email server 1 if any parm is changed.
Current value: 0
42esrvoneSTRINGemail server 1127.0.0.1Connects to this IP or hostname directly when sending email 1. Use apt-get install sendmail to install sendmail on that IP or hostname. Add From:10.5 RELAY to /etc/mail/access to allow sendmail to forward email it receives from gigablast if gigablast hosts are on the 10.5.*.* IPs. Then run /etc/init.d/sendmail restart as root to pick up those changes so sendmail will forward Gigablast's email to the email address you give below.
Current value: 127.0.0.1
43eaddroneSTRINGemail address 14081234567@vtext.comSends to this address when sending email 1
Current value: 4081234567@vtext.com
44efaddroneSTRINGfrom email address 1sysadmin@mydomain.comThe from field when sending email 1
Current value: sysadmin@mydomain.com
45seattwoBOOL (0 or 1)send email alerts to email 20Sends to email address 2 through email server 2.
Current value: 0
46seattwopBOOL (0 or 1)send parm change email alerts to email 20Sends to email address 2 through email server 2 if any parm is changed.
Current value: 0
47esrvtwoSTRINGemail server 2mail.mydomain.comConnects to this server directly when sending email 2
Current value: mail.mydomain.com
48eaddrtwoSTRINGemail address 2Sends to this address when sending email 2
Current value:
49efaddrtwoSTRINGfrom email address 2sysadmin@mydomain.comThe from field when sending email 2
Current value: sysadmin@mydomain.com
50seatthreeBOOL (0 or 1)send email alerts to email 30Sends to email address 3 through email server 3.
Current value: 0
51seatthreepBOOL (0 or 1)send parm change email alerts to email 30Sends to email address 3 through email server 3 if any parm is changed.
Current value: 0
52esrvthreeSTRINGemail server 3mail.mydomain.comConnects to this server directly when sending email 3
Current value: mail.mydomain.com
53eaddrthreeSTRINGemail address 3Sends to this address when sending email 3
Current value:
54efaddrthreeSTRINGfrom email address 3sysadmin@mydomain.comThe from field when sending email 3
Current value: sysadmin@mydomain.com
55dpcsp INT64 posdb disk cache size30000000How much file cache size to use in bytes? Posdb is the index.
Current value: 30000000
56dpcst INT64 tagdb disk cache size30000000How much file cache size to use in bytes? Tagdb is consulted at spider time and query time to determine if a url or outlink is banned or what its siterank is, etc.
Current value: 30000000
57dpcsc INT64 clusterdb disk cache size30000000How much file cache size to use in bytes? Gigablast does a lookup in clusterdb for each search result at query time to get its site information for site clustering. If you disable site clustering in the search controls then clusterdb will not be consulted.
Current value: 30000000
58dpcsx INT64 titledb disk cache size30000000How much file cache size to use in bytes? Titledb holds the cached web pages, compressed. Gigablast consults it to generate a summary for a search result, or to see if a url Gigablast is spidering is already in the index.
Current value: 30000000
59dpcsy INT64 spiderdb disk cache size30000000How much file cache size to use in bytes? Titledb holds the cached web pages, compressed. Gigablast consults it to generate a summary for a search result, or to see if a url Gigablast is spidering is already in the index.
Current value: 30000000
60pdnsIPdns 08.8.8.8IP address of the primary DNS server. Assumes UDP port 53. REQUIRED FOR SPIDERING! Use Google's public DNS 8.8.8.8 as default.
Current value: 8.8.8.8
61sdnsIPdns 18.8.4.4IP address of the secondary DNS server. Assumes UDP port 53. Will be accessed in conjunction with the primary dns, so make sure this is always up. An ip of 0 means disabled. Google's secondary public DNS is 8.8.4.4.
Current value: 8.8.4.4
62sdnsaIPdns 20.0.0.0All hosts send to these DNSes based on hash of the subdomain to try to split DNS load evenly.
Current value: 0.0.0.0
63sdnsbIPdns 30.0.0.0
Current value: 0.0.0.0
64sdnscIPdns 40.0.0.0
Current value: 0.0.0.0
65sdnsdIPdns 50.0.0.0
Current value: 0.0.0.0
66sdnseIPdns 60.0.0.0
Current value: 0.0.0.0
67sdnsfIPdns 70.0.0.0
Current value: 0.0.0.0
68sdnsgIPdns 80.0.0.0
Current value: 0.0.0.0
69sdnshIPdns 90.0.0.0
Current value: 0.0.0.0
70sdnsiIPdns 100.0.0.0
Current value: 0.0.0.0
71sdnsjIPdns 110.0.0.0
Current value: 0.0.0.0
72sdnskIPdns 120.0.0.0
Current value: 0.0.0.0
73sdnslIPdns 130.0.0.0
Current value: 0.0.0.0
74sdnsmIPdns 140.0.0.0
Current value: 0.0.0.0
75sdnsnIPdns 150.0.0.0
Current value: 0.0.0.0
76utfdBOOL (0 or 1)use threads for disk1If enabled, Gigablast will use threads for disk ops. Now that Gigablast uses pthreads more effectively, leave this enabled for optimal performance in all cases.
Current value: 1
77utfioBOOL (0 or 1)use threads for intersects and merges1If enabled, Gigablast will use threads for these ops. Default is now on in the event you have simultaneous queries so one query does not hold back the other. There seems to be a bug so leave this ON for now.
Current value: 1
78utfscBOOL (0 or 1)use threads for system calls1Gigablast does not make too many system calls so leave this on in case the system call is slow.
Current value: 1
79fwBOOL (0 or 1)flush disk writes0If enabled then all writes will be flushed to disk. If not enabled, then gb uses the Linux disk write cache.
Current value: 0
80vwlBOOL (0 or 1)verify written lists0Ensure lists being written to disk are not corrupt. That title recs appear valid, etc. Helps isolate sources of corruption. Used for debugging.
Current value: 0
81smdt INT32 max spider read threads20Maximum number of threads to use per Gigablast process for accessing the disk for index-building purposes. Keep low to reduce impact on query response time. Increase for fast disks or when preferring build speed over lower query latencies
Current value: 20
82sdtBOOL (0 or 1)separate disk reads1If enabled then we will not launch a low priority disk read or write while a high priority is outstanding. Help improve query response time at the expense of spider performance.
Current value: 1



/admin/search - search controls   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3cSTRINGcollectionUse this collection. REQUIRED
4msrpq INT32 max search results per query100What is the limit to the total number of returned search results per query?
Current value: 100
5langweightFLOAT32language weight20.000000Default language weight if document matches quer language. Use this to give results that match the specified the specified &qlang higher ranking, or docs whose language is unnknown. Can be override with &langw in the query url.
Current value: 20.000000
6mqt INT32 max query terms999999Do not allow more than this many query terms. Helps prevent big queries from resource hogging.
Current value: 999999
7spellBOOL (0 or 1)do spell checking by default1If enabled while using the XML feed, when Gigablast finds a spelling recommendation it will be included in the XML tag. Default is 0 if using an XML feed, 1 otherwise.
Current value: 1
8scoresBOOL (0 or 1)get scoring info by default1Get scoring information for each result so you can see how each result is scored. You must explicitly request this using &scores=1 for the XML feed because it is not included by default.
Current value: 1
9qeBOOL (0 or 1)do query expansion by default1If enabled, query expansion will expand your query to include the various forms and synonyms of the query terms.
Current value: 1
10qhBOOL (0 or 1)highlight query terms in summaries by default1Use to disable or enable highlighting of the query terms in the summaries.
Current value: 1
11tml INT32 max title len80What is the maximum number of characters allowed in titles displayed in the search results?
Current value: 80
12scdBOOL (0 or 1)site cluster by default0Should search results be site clustered? This limits each site to appearing at most twice in the search results. Sites are subdomains for the most part, like abc.xyz.com.
Current value: 0
13hacrBOOL (0 or 1)hide all clustered results0Only display at most one result per site.
Current value: 0
14drdBOOL (0 or 1)dedup results by default1Should duplicate search results be removed? This is based on a content hash of the entire document. So documents must be exactly the same for the most part.
Current value: 1
15stgdblBOOL (0 or 1)do tagdb lookups for queries1For each search result a tagdb lookup is made, usually across the network on distributed clusters, to see if the URL's site has been manually banned in tagdb. If you don't manually ban sites then turn this off for extra speed.
Current value: 1
16psds INT32 percent similar dedup summary default value90If document summary (and title) are this percent similar to a document summary above it, then remove it from the search results. 100 means only to remove if exactly the same. 0 means no summary deduping.
Current value: 90
17msld INT32 number of lines to use in summary to dedup4Sets the number of lines to generate for summary deduping. This is to help the deduping process not throw out valid summaries when normally displayed summaries are smaller values. Requires percent similar dedup summary to be non-zero.
Current value: 4
18dduBOOL (0 or 1)dedup URLs by default0Should we dedup URLs with case insensitivity? This is mainly to correct duplicate wiki pages.
Current value: 0
19defqlangSTRINGsort language preference defaultxxDefault language to use for ranking results. Value should be any language abbreviation, for example "en" for English. Use xx to give ranking boosts to no language in particular. See the language abbreviations at the bottom of the url filters page.
Current value: xx
20qcountrySTRINGsort country preference defaultusDefault country to use for ranking results. Value should be any country code abbreviation, for example "us" for United States. This is currently not working.
Current value: de
21sml INT32 max summary len512What is the maximum number of characters displayed in a summary for a search result?
Current value: 512
22smnl INT32 max summary excerpts4What is the maximum number of excerpts displayed in the summary of a search result?
Current value: 4
23smxcpl INT32 max summary excerpt length90What is the maximum number of characters allowed per summary excerpt?
Current value: 90
24smw INT32 max summary line width by default80<br> tags are inserted to keep the number of chars in the summary per line at or below this width. Also affects title. Strings without spaces that exceed this width are not split. Has no affect on xml or json feed, only works on html.
Current value: 80
25clmfs INT32 bytes of doc to scan for summary generation70000Truncating this will miss out on good summaries, but performance will increase.
Current value: 70000
26sfhtSTRINGfront highlight tagFront html tag used for highlightig query terms in the summaries displated in the search results.
Current value: <b style="color:black;background-color:#ffff66">
27sbhtSTRINGback highlight tagFront html tag used for highlightig query terms in the summaries displated in the search results.
Current value: </b>
28dsrt INT32 results to scan for gigabits generation by default30How many search results should we scan for gigabit (related topics) generation. Set this to zero to disable gigabits generation by default.
Current value: 30
29iprBOOL (0 or 1)ip restriction for gigabits by default0Should Gigablast only get one document per IP domain and per domain for gigabits (related topics) generation?
Current value: 0
30rotBOOL (0 or 1)remove overlapping topics1Should Gigablast remove overlapping topics (gigabits)?
Current value: 1
31nrt INT32 number of gigabits to show by default11What is the number of related topics (gigabits) displayed per query? Set to 0 to save CPU time.
Current value: 11
32mts INT32 min gigabit score by default5Gigabits (related topics) with scores below this will be excluded. Scores range from 0% to over 100%.
Current value: 5
33mdc INT32 min gigabit doc count by default2How many documents must contain the gigabit (related topic) in order for it to be displayed.
Current value: 2
34dsp INT32 dedup doc percent for gigabits (related topics)80If a document is this percent similar to another document with a higher score, then it will not contribute to the gigabit generation.
Current value: 80
35mwpt INT32 max words per gigabit (related topic) by default6Maximum number of words a gigabit (related topic) can have. Affects xml feeds, too.
Current value: 6
36tmss INT32 gigabit max sample size4096Max chars to sample from each doc for gigabits (related topics).
Current value: 4096
37ddcBOOL (0 or 1)display dmoz categories in results1If enabled, results in dmoz will display their categories on the results page.
Current value: 1
38didcBOOL (0 or 1)display indirect dmoz categories in results0If enabled, results in dmoz will display their indirect categories on the results page.
Current value: 0
39dsclBOOL (0 or 1)display Search Category link to query category of result0If enabled, a link will appear next to each category on each result allowing the user to perform their query on that entire category.
Current value: 0
40udfuBOOL (0 or 1)use dmoz for untitled1Yes to use DMOZ given title when a page is untitled but is in DMOZ.
Current value: 1
41udsmBOOL (0 or 1)show dmoz summaries1Yes to always show DMOZ summaries with search results that are in DMOZ.
Current value: 1
42sacotBOOL (0 or 1)show adult category on top0Yes to display the Adult category in the Top category
Current value: 0
43hpSTRINGhome pageHtml to display for the home page. Leave empty for default home page. Use %N for total number of pages indexed. Use %n for number of pages indexed for the current collection. Use %c to insert the current collection name. Use %q to display the query in a text box. Use %t to display the directory TOP. Example to paste into textbox:
<html><title>My Gigablast Search Engine</title><script> function x(){document.f.q.focus();} </script><body onload="x()"><br><br><center><a href=/><img border=0 width=500 height=122 src=/logo-med.jpg></a><br><br><b>My Search Engine</b><br><br><form method=get action=/search name=f><input type=hidden name=c value="%c"><input name=q type=text size=60 value="">&nbsp;<input type="submit" value="Search"></form><br><center>Searching the <b>%c</b> collection of %n documents.</center><br></body></html>
Current value:
44hhSTRINGhtml headHtml to display before the search results. Leave empty for default. Convenient for changing colors and displaying logos. Use the variable, %q, to represent the query to display in a text box. Use %e to print the url encoded query. Use %S to print sort by date or relevance link. Use %L to display the logo. Use %R to display radio buttons for site search. Use %F to begin the form. and use %H to insert hidden text boxes of parameters like the current search result page number. BOTH %F and %H are necessary for the html head, but do not duplicate them in the html tail. Use %f to display the family filter radio buttons. Example to paste into textbox:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>My Gigablast Search Results</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body%l> %F<table cellpadding="2" cellspacing="0" border="0"> <tr> <td valign=top>%L</td> <td valign=top> <nobr> <input type="text" name="q" size="60" value="%q"> %D<input type="submit" value="Blast It!" border="0"> </nobr> <br>%f %R </tr> </table> %H
Current value:
45htSTRINGhtml tailHtml to display after the search results. Leave empty for default. Convenient for changing colors and displaying logos. Use the variable, %q, to represent the query to display in a text box. Use %e to print the url encoded query. Use %S to print sort by date or relevance link. Use %L to display the logo. Use %R to display radio buttons for site search. Use %F to begin the form. and use %H to insert hidden text boxes of parameters like the current search result page number. BOTH %F and %H are necessary for the html head, but do not duplicate them in the html tail. Use %f to display the family filter radio buttons. Example to paste into textbox:
<br> <table cellpadding=2 cellspacing=0 border=0> <tr><td></td> <td>%s</td> </tr> </table> Try your search on <a href=http://www.google.com/search?q=%e>google</a> &nbsp; <a href=http://search.yahoo.com/bin/search?p=%e>yahoo</a> &nbsp; <a href=http://search.dmoz.org/cgi-bin/search?search=%e>dmoz</a> &nbsp; </font></body>
Current value:



/admin/spider - spider controls   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3cSTRINGcollectionUse this collection. REQUIRED
4cseBOOL (0 or 1)spidering enabled1Controls just the spiders for this collection.
Current value: 1
5sitelistSTRINGsite listList of sites to spider, one per line. See example site list below. Gigablast uses the insitelist directive on the url filters page to make sure that the spider only indexes urls that match the site patterns you specify here, other than urls you add individually via the add urls or inject url tools. Limit list to 300MB. If you have a lot of INDIVIDUAL urls to add then consider using the addurl interface.
Current value: https://www.meilimuseum.ch/ https://solar.lowtechmagazine.com/ https://www.woo...
6restartUNARY CMD (set to 1)restart collectionRemove all documents from the collection and re-add seed urls from site list.
7mns INT32 max spiders300What is the maximum number of web pages the spider is allowed to download simultaneously PER HOST for THIS collection? The maximum number of spiders over all collections is controlled in the master controls.
Current value: 300
8sdms INT32 spider delay in milliseconds0make each spider wait this many milliseconds before getting the ip and downloading the page.
Current value: 0
9obeyRobotsBOOL (0 or 1)obey robots.txt1If this is true Gigablast will respect the robots.txt convention and rel no follow meta tags.
Current value: 1
10obeyRelNoFollowBOOL (0 or 1)obey rel no follow links1If this is true Gigablast will respect the rel no follow link attribute.
Current value: 1
11mrca INT32 max robots.txt cache age86400How many seconds to cache a robots.txt file for. 86400 is 1 day. 0 means Gigablast will not read from the cache at all and will download the robots.txt before every page if robots.txt use is enabled above. However, if this is 0 then Gigablast will still store robots.txt files in the cache.
Current value: 86400
12useproxiesBOOL (0 or 1)always use spider proxies0If this is true Gigablast will ALWAYS use the proxies listed on the proxies page for spidering for this collection.
Current value: 0
13automaticallyuseproxiesBOOL (0 or 1)automatically use spider proxies0Use the spider proxies listed on the proxies page if gb detects that a webserver is throttling the spiders. This way we can learn the webserver's spidering policy so that our spiders can be more polite. If no proxies are listed on the proxies page then this parameter will have no effect.
Current value: 0
14automaticallybackoffBOOL (0 or 1)automatically back off0Set the crawl delay to 5 seconds if gb detects that an IP is throttling or banning gigabot from crawling it. The crawl delay just applies to that IP. Such throttling will be logged.
Current value: 1
15usetimeaxisBOOL (0 or 1)use time axis0If this is true Gigablast will index the same url multiple times if its content varies over time, rather than overwriting the older version in the index. Useful for archive web pages as they change over time.
Current value: 0
16indexwarcsBOOL (0 or 1)index warc or arc files0If this is true Gigablast will index .warc and .arc files by injecting the pages contained in them as if they were spidered with the content in the .warc or .arc file. The spidered time will be taken from the archive file as well.
Current value: 0
17dmt INT32 daily merge time-1Do a tight merge on posdb and titledb at this time every day. This is expressed in MINUTES past midnight UTC. UTC is 5 hours ahead of EST and 7 hours ahead of MST. Leave this as -1 to NOT perform a daily merge. To merge at midnight EST use 60*5=300 and midnight MST use 60*7=420.
Current value: -1
18dmdlSTRINGdaily merge days0Comma separated list of days to merge on. Use 0 for Sunday, 1 for Monday, ... 6 for Saturday. Leaving this parameter empty or without any numbers will make the daily merge happen every day
Current value: 0
19dttBOOL (0 or 1)turing test enabled0If this is true, users will have to pass a simple Turing test to add a url. This prevents automated url submission.
Current value: 0
20mau INT32 max add urls0Maximum number of urls that can be submitted via the addurl interface, per IP domain, per 24 hour period. A value less than or equal to zero implies no limit.
Current value: 0
21deBOOL (0 or 1)deduping enabled0When enabled, the spider will discard web pages which are identical to other web pages that are already in the index. However, root urls, urls that have no path, are never discarded. It most likely has to hit disk to do these checks so it does cause some slow down. Only use it if you need it.
Current value: 0
22dewBOOL (0 or 1)deduping enabled for www1When enabled, the spider will discard web pages which, when a www is prepended to the page's url, result in a url already in the index.
Current value: 1
23dcepBOOL (0 or 1)detect custom error pages1Detect and do not index pages which have a 200 status code, but are likely to be error pages.
Current value: 1
24usrBOOL (0 or 1)use simplified redirects0If this is true, the spider, when a url redirects to a "simpler" url, will add that simpler url into the spider queue and abandon the spidering of the current url.
Current value: 0
25ucrBOOL (0 or 1)use canonical redirects1If page has a on it then treat it as a redirect, add it to spiderdb for spidering and abandon the indexing of the current url.
Current value: 1
26uimsBOOL (0 or 1)use ifModifiedSince0If this is true, the spider, when updating a web page that is already in the index, will not even download the whole page if it hasn't been updated since the last time Gigablast spidered it. This is primarily a bandwidth saving feature. It relies on the remote webserver's returned Last-Modified-Since field being accurate.
Current value: 1
27mlkftm INT32 linkdb min files needed to trigger to merge6Merge is triggered when this many linkdb data files are on disk. Raise this when initially growing an index in order to keep merging down.
Current value: 6
28mtftgm INT32 tagdb min files to merge2Merge is triggered when this many linkdb data files are on disk.
Current value: 2
29mpftm INT32 posdb min files needed to trigger to merge6Merge is triggered when this many posdb data files are on disk. Raise this while doing massive injections and not doing much querying. Then when done injecting keep this low to make queries fast.
Current value: 6
30gltBOOL (0 or 1)enable link voting1If this is true Gigablast will index hyper-link text and use hyper-link structures to boost the quality of indexed documents. You can disable this when doing a ton of injections to keep things fast. Then do a posdb (index) rebuild after re-enabling this when you are done injecting. Or if you simply do not want link voting this will speed upyour injections and spidering a bit.
Current value: 1
31csniBOOL (0 or 1)compute inlinks to sites1If this is true Gigablast will compute the number of site inlinks for the sites it indexes. This is a measure of the sites popularity and is used for ranking and some times spidering prioritzation. It will cache the site information in tagdb. The greater the number of inlinks, the longer the cached time, because the site is considered more stable. If this is NOT true then Gigablast will use the included file, sitelinks.txt, which stores the site inlinks of millions of the most popular sites. This is the fastest way. If you notice a lot of getting link info requests in the sockets table you may want to disable this parm.
Current value: 1
32dlscBOOL (0 or 1)do link spam checking1If this is true, do not allow spammy inlinks to vote. This check is too aggressive for some collections, i.e. it does not allow pages with cgi in their urls to vote.
Current value: 1
33ovpidBOOL (0 or 1)restrict link voting by ip1If this is true Gigablast will only allow one vote per the top 2 significant bytes of the IP address. Otherwise, multiple pages from the same top IP can contribute to the link text and link-based quality ratings of a particular URL. Furthermore, no votes will be accepted from IPs that have the same top 2 significant bytes as the IP of the page being indexed.
Current value: 1
34uvfFLOAT32update link info frequency60.000000How often should Gigablast recompute the link info for a url. Also applies to getting the quality of a site or root url, which is based on the link info. In days. Can use decimals. 0 means to update the link info every time the url's content is re-indexed. If the content is not reindexed because it is unchanged then the link info will not be updated. When getting the link info or quality of the root url from an external cluster, Gigablast will tell the external cluster to recompute it if its age is this or higher.
Current value: 60.000000
35dsdBOOL (0 or 1)do serp detection1If this is enabled the spider will not allow any docs which are determined to be serps.
Current value: 1
36mtdl INT32 max text doc length1048576Gigablast will not download, index or store more than this many bytes of an HTML or text document. XML is NOT considered to be HTML or text, use the rule below to control the maximum length of an XML document. Use -1 for no max.
Current value: 1048576
37modl INT32 max other doc length1048576Gigablast will not download, index or store more than this many bytes of a non-html, non-text document. XML documents will be restricted to this length. Use -1 for no max.
Current value: 1048576
38aftBOOL (0 or 1)apply filter to text pages0If this is false then the filter will not be used on html or text pages.
Current value: 1
39filterSTRINGfilter nameProgram to spawn to filter all HTTP replies the spider receives. Leave blank for none.
Current value: /mnt/nvme/osse.eclipse/tools_jcp/filter.sh
40fto INT32 filter timeout40Kill filter shell after this many seconds. Assume it stalled permanently.
Current value: 40
41mitBOOL (0 or 1)make image thumbnails0Try to find the best image on each page and store it as a thumbnail for presenting in the search results.
Current value: 0
42mtwh INT32 max thumbnail width or height250This is in pixels and limits the size of the thumbnail. Gigablast tries to make at least the width or the height equal to this maximum, but, unless the thumbnail is square, one side will be longer than the other.
Current value: 250
43isrBOOL (0 or 1)index spider status documents0Index a spider status "document" for every url the spider attempts to spider. Search for them using special query operators like type:status or gberrorstr:success or stats:gberrornum to get a histogram. See syntax page for more examples. They will not otherwise show up in the search results.
Current value: 0
44ibBOOL (0 or 1)index body1Index the body of the documents so you can search it. Required for searching that. You wil pretty much always want to keep this enabled. Does not apply to JSON documents.
Current value: 1
45apiUrlSTRINGdiffbot api urlSend every spidered url to this url and index the reply in addition to the normal indexing process. Example: by specifying http://api.diffbot.com/v3/analyze?mode=high-precision&token= here you can index the structured JSON replies from diffbot for every url that is spidered. Gigablast will automatically append a &url= to this url before sending it to diffbot.
Current value:
46urlProcessPatternTwoSTRINGdiffbot url process patternOnly send urls that match this simple substring pattern to Diffbot. Separate substrings with two pipe operators, ||. Leave empty for no restrictions.
Current value:
47urlProcessRegExTwoSTRINGdiffbot url process regexOnly send urls that match this regular expression to Diffbot. Leave empty for no restrictions.
Current value:
48pageProcessPatternTwoSTRINGdiffbot page process patternOnly send urls whose content matches this simple substring pattern to Diffbot. Separate substrings with two pipe operators, ||. Leave empty for no restrictions.
Current value:



/admin/proxies - proxies   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3proxyipsSTRINGspider proxy ipsList of white space-separated spider proxy IPs. Put in IP:port format. Example 1.2.3.4:80 4.5.6.7:99. You can also use username:password@1.2.3.4:80. If a proxy itself times out when downloading through it it will be perceived as a normal download timeout and the page will be retried according to the url filters table, so you might want to modify the url filters to retry network errors more aggressively. Search for 'private proxies' on google to find proxy providers. Try to ensure all your proxies are on different class C IPs if possible. That is, the first 3 numbers in the IP addresses are all different.
Current value:
4proxytesturlSTRINGspider proxy test urlhttp://www.gigablast.com/Download this url every minute through each proxy listed above to ensure they are up. Typically you should make this a URL you own so you do not aggravate another webmaster.
Current value: http://www.gigablast.com/
5resetproxytableUNARY CMD (set to 1)reset proxy tableReset the proxy statistics in the table below. Makes all your proxies treated like new again.
6userandagentsBOOL (0 or 1)mix up user agents1Use random user-agents when downloading through a spider proxy listed above to protecting gb's anonymity. The User-Agent used is a function of the proxy IP/port and IP of the url being downloaded. That way it is consistent when downloading the same website through the same proxy.
Current value: 1
7proxyAuthSTRINGsquid proxy authorized usersGigablast can also simulate a squid proxy, complete with caching. It will forward your request to the proxies you list above, if any. This list consists of space-separated username:password items. Leave this list empty to disable squid caching behaviour. The default cache size for this is 10MB per shard. Use item *:* to allow anyone access.
Current value:



/admin/log - log controls   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3hrBOOL (0 or 1)log http requests1Log GET and POST requests received from the http server?
Current value: 1
4laqBOOL (0 or 1)log autobanned queries1Should we log queries that are autobanned? They can really fill up the log.
Current value: 1
5lqtt INT32 log query time threshold5000If query took this many millliseconds or longer, then log the query and the time it took to process.
Current value: 5000
6lqrBOOL (0 or 1)log query reply0Log query reply in proxy, but only for those queries above the time threshold above.
Current value: 0
7lsuBOOL (0 or 1)log spidered urls1Log status of spidered or injected urls?
Current value: 1
8lncBOOL (0 or 1)log network congestion0Log messages if Gigablast runs out of udp sockets?
Current value: 0
9liBOOL (0 or 1)log informational messages1Log messages not related to an error condition, but meant more to give an idea of the state of the gigablast process. These can be useful when diagnosing problems.
Current value: 1
10llBOOL (0 or 1)log limit breeches0Log it when document not added due to quota breech. Log it when url is too long and it gets truncated.
Current value: 0
11ldaBOOL (0 or 1)log debug admin messages0Log various debug messages.
Current value: 0
12ldbBOOL (0 or 1)log debug build messages0
Current value: 0
13ldbtBOOL (0 or 1)log debug build time messages0
Current value: 0
14lddBOOL (0 or 1)log debug database messages0
Current value: 0
15lddmBOOL (0 or 1)log debug dirty messages0
Current value: 0
16lddiBOOL (0 or 1)log debug disk messages0
Current value: 0
17ldpcBOOL (0 or 1)log debug disk page cache0
Current value: 0
18lddnsBOOL (0 or 1)log debug dns messages0
Current value: 0
19ldhBOOL (0 or 1)log debug http messages0
Current value: 0
20ldiBOOL (0 or 1)log debug image messages0
Current value: 0
21ldlBOOL (0 or 1)log debug loop messages0
Current value: 0
22ldgBOOL (0 or 1)log debug language detection messages0
Current value: 0
23ldliBOOL (0 or 1)log debug link info0
Current value: 0
24ldmBOOL (0 or 1)log debug mem messages0
Current value: 0
25ldmuBOOL (0 or 1)log debug mem usage messages0
Current value: 0
26ldnBOOL (0 or 1)log debug net messages0
Current value: 0
27ldqBOOL (0 or 1)log debug query messages0
Current value: 0
28ldqtaBOOL (0 or 1)log debug quota messages0
Current value: 0
29ldrBOOL (0 or 1)log debug robots messages0
Current value: 0
30ldsBOOL (0 or 1)log debug spider cache messages0
Current value: 0
31ldspBOOL (0 or 1)log debug speller messages0
Current value: 0
32ldsccBOOL (0 or 1)log debug sections messages0
Current value: 0
33ldsiBOOL (0 or 1)log debug seo insert messages0
Current value: 0
34ldseoBOOL (0 or 1)log debug seo messages0
Current value: 0
35ldstBOOL (0 or 1)log debug stats messages0
Current value: 0
36ldsuBOOL (0 or 1)log debug summary messages0
Current value: 0
37ldspidBOOL (0 or 1)log debug spider messages0
Current value: 0
38ldspmthBOOL (0 or 1)log debug msg13 messages0
Current value: 0
39dmthBOOL (0 or 1)disable host0 for msg13 reception hack0
Current value: 0
40ldsprBOOL (0 or 1)log debug spider proxies0
Current value: 0
41ldspuaBOOL (0 or 1)log debug url attempts0
Current value: 0
42ldsdBOOL (0 or 1)log debug spider downloads0
Current value: 0
43ldfbBOOL (0 or 1)log debug facebook0
Current value: 0
44ldtmBOOL (0 or 1)log debug tagdb messages0
Current value: 0
45ldtBOOL (0 or 1)log debug tcp messages0
Current value: 0
46ldtbBOOL (0 or 1)log debug tcp buffer messages0
Current value: 0
47ldthBOOL (0 or 1)log debug thread messages0
Current value: 0
48ldtiBOOL (0 or 1)log debug title messages0
Current value: 0
49ldtimBOOL (0 or 1)log debug timedb messages0
Current value: 0
50ldtoBOOL (0 or 1)log debug topic messages0
Current value: 0
51ldtopdBOOL (0 or 1)log debug topDoc messages0
Current value: 0
52lduBOOL (0 or 1)log debug udp messages0
Current value: 0
53ldunBOOL (0 or 1)log debug unicode messages0
Current value: 0
54ldreBOOL (0 or 1)log debug repair messages0
Current value: 0
55ldpdBOOL (0 or 1)log debug pub date extraction messages0
Current value: 0
56ltbBOOL (0 or 1)log timing messages for build0Log various timing related messages.
Current value: 0
57ltadmBOOL (0 or 1)log timing messages for admin0Log various timing related messages.
Current value: 0
58ltdBOOL (0 or 1)log timing messages for database0
Current value: 0
59ltnBOOL (0 or 1)log timing messages for network layer0
Current value: 0
60ltqBOOL (0 or 1)log timing messages for query0
Current value: 0
61ltspcBOOL (0 or 1)log timing messages for spcache0Log various timing related messages.
Current value: 0
62lttBOOL (0 or 1)log timing messages for related topics0
Current value: 0
63lrBOOL (0 or 1)log reminder messages0Log reminders to the programmer. You do not need this.
Current value: 0



/admin/masterpasswords - master passwords   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3masterpwdsSTRINGMaster PasswordsWhitespace separated list of passwords. Any matching password will have administrative access to Gigablast and all collections.
4masteripsSTRINGMaster IPsWhitespace separated list of Ips. Any IPs in this list will have administrative access to Gigablast and all collections.



/admin/addcoll - add a new collection   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3addcollUNARY CMD (set to 1)add collectionAdd a new collection with this name. No spaces allowed or strange characters allowed. Max of 64 characters. REQUIRED



/admin/delcoll - delete a collection   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3delcollUNARY CMD (set to 1)delete collectionDelete the specified collection. You can specify multiple &delcoll= parms in a single request to delete multiple collections at once. REQUIRED



/admin/clonecoll - clone one collection's settings to another   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3cSTRINGcollectionClone settings INTO this collection. REQUIRED
4clonecollUNARY CMD (set to 1)clone collectionClone collection settings FROM this collection. REQUIRED



/admin/rebuild - rebuild data   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3rmeBOOL (0 or 1)rebuild mode enabled0If enabled, gigablast will rebuild the rdbs as specified by the parameters below. When a particular collection is in rebuild mode, it can not spider or merge titledb files.
Current value: 0
4rctrSTRINGcollection to rebuildName of collection to rebuild. REQUIRED
Current value:
5racBOOL (0 or 1)rebuild ALL collections0If enabled, gigablast will rebuild all collections.
Current value: 0
6rmtu INT32 memory to use for rebuild200000000In bytes.
Current value: 200000000
7mrps INT32 max rebuild injections2Maximum number of outstanding injections for rebuild.
Current value: 2
8rfrBOOL (0 or 1)full rebuild1If enabled, gigablast will reinject the content of all title recs into a secondary rdb system. That will the primary rdb system when complete.
Current value: 1
9rfrknsxBOOL (0 or 1)add spiderdb recs of non indexed urls0If enabled, gigablast will add the spiderdb records of unindexed urls when doing the full rebuild or the spiderdb rebuild. Otherwise, only the indexed urls will get spiderdb records in spiderdb. This can be faster because Gigablast does not have to do an IP lookup on every url if its IP address is not in tagdb already.
Current value: 0
10rrliBOOL (0 or 1)recycle link text1If enabled, gigablast will recycle the link text when rebuilding titledb. The siterank, which is determined by the number of inlinks to a site, is stored/cached in tagdb so that is a separate item. If you want to pick up new link text you will want to set this to NO and make sure to rebuild titledb, since that stores the link text.
Current value: 1
11rrtBOOL (0 or 1)rebuild titledb0If enabled, gigablast will rebuild this rdb
Current value: 0
12rriBOOL (0 or 1)rebuild posdb0If enabled, gigablast will rebuild this rdb
Current value: 0
13rrclBOOL (0 or 1)rebuild clusterdb0If enabled, gigablast will rebuild this rdb
Current value: 0
14rrspBOOL (0 or 1)rebuild spiderdb0If enabled, gigablast will rebuild this rdb
Current value: 0
15rrldBOOL (0 or 1)rebuild linkdb0If enabled, gigablast will rebuild this rdb
Current value: 0
16ruruBOOL (0 or 1)rebuild root urls1If disabled, gigablast will skip root urls.
Current value: 1
17runruBOOL (0 or 1)rebuild non-root urls1If disabled, gigablast will skip non-root urls.
Current value: 1



/admin/inject - inject url in the index here   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3cSTRINGcollectionInject into this collection. REQUIRED
4urlSTRINGurlSpecify the URL that will be immediately crawled and indexed in real time while you wait. The browser will return the final index status code. Alternatively, use the add url page to add urls individually or in bulk without having to wait for the pages to be actually indexed in realtime. By default, injected urls take precedence over the "insitelist" expression in the url filters so injected urls need not match the patterns in your site list. You can change that behavior in the url filters if you want. Injected urls will have a hopcount of 0. The injection api is described on the api page. Make up a fake url if you are injecting content that does not have one.

If the url ends in .warc or .arc or .warc.gz or .arc.gz Gigablast will index the contained documents as individual documents, using the appropriate dates and other meta information contained in the containing archive file. REQUIRED
5qtsSTRINGquery to scrapeScrape popular search engines for this query and inject their links. You are not required to supply the url parm if you supply this parm.
6injectlinksBOOL (0 or 1)inject links0Should we inject the links found in the injected content as well?
7spiderlinksBOOL (0 or 1)spider links0Add the outlinks of the injected content into spiderdb for spidering?
8newonlyBOOL (0 or 1)only inject content if new0If the specified url is already in the index then skip the injection.
9deleteurlBOOL (0 or 1)delete from index0Delete the specified url from the index.
10recycleBOOL (0 or 1)recycle content0If the url is already in the index, then do not re-download the content, just use the content that was stored in the cache from last time.
11dedupBOOL (0 or 1)dedup url0Do not index the url if there is already another url in the index with the same content.
12urlipIPurl IP0.0.0.0Use this IP when injecting the document. Do not use or set to 0.0.0.0, if unknown. If provided, it will save an IP lookup.
13hasmimeBOOL (0 or 1)content has mime0If the content of the url is provided below, does it begin with an HTTP mime header?
14delimSTRINGcontent delimeterIf the content of the url is provided below, then it consist of multiple documents separated by this delimeter. Each such item will be injected as an independent document. Some possible delimiters: ======== or <doc>. If you set hasmime above to true then Gigablast will check for a url after the delimeter and use that url as the injected url. Otherwise it will append numbers to the url you provide above.
15contenttypeSTRINGcontent typetext/htmlIf you supply content in the text box below without an HTTP mime header, then you need to enter the content type. Possible values: text/html text/plain text/xml application/json
16charset INT32 content charset106A number representing the charset of the content if provided below and no HTTP mime header is given. Defaults to utf8 which is 106. See iana_charset.h for the numeric values.
17contentSTRINGcontentIf you want to supply the URL's content rather than have Gigablast download it, then enter the content here. Enter MIME header first if "content has mime" is set to true above. Separate MIME from actual content with two returns. At least put a single space in here if you want to inject empty content, otherwise the content will be downloaded from the url. This is because the page injection form always submits the content text area even if it is empty, which should signify that the content should be downloaded.
18metadataSTRINGmetadataJson encoded metadata to be indexed with the document.
19sectionsBOOL (0 or 1)get sectiondb voting info0Return section information of injected content for the injected subdomain.
20diffbotreplySTRINGdiffbot replyUsed exclusively by diffbot. Do not use.



/admin/addurl - add url page for admin   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3cSTRINGcollectionAdd urls into this collection. REQUIRED
4urlsSTRINGurls to addList of urls to index. One per line or space separated. If your url does not index as you expect you can check it's spider history by doing a url: search on it. Added urls will have a hopcount of 0. Added urls will match the isaddurl directive on the url filters page. The add url api is described on the api page. REQUIRED
5stripBOOL (0 or 1)strip sessionids1Strip added urls of their session ids.
6spiderlinksBOOL (0 or 1)harvest links1Harvest links of added urls so we can spider them?.



/admin/reindex - query delete/reindex   [ show parms in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.
3cSTRINGcollectionquery reindex in this collection. REQUIRED
4qSTRINGquery to reindex or deleteWe either reindex or delete the search results of this query. Reindexing them will redownload them and possible update the siterank, which is based on the number of links to the site. This will add the url requests to the spider queue so ensure your spiders are enabled. REQUIRED
5srn INT32 start result number0Starting with this result #. Starts at 0.
6ern INT32 end result number99999999Ending with this result #. 0 is the first result #.
7qlangSTRINGquery languageenThe language the query is in. Used to rank results. Just use xx to indicate no language in particular. But you should use the same qlang value you used for doing the query if you want consistency.
8qrecycleBOOL (0 or 1)recycle content0If you check this box then Gigablast will not re-download the content, but use the content that was stored in the cache from last time. Useful for rebuilding the index to pick up new inlink text or fresher sitenuminlinks counts which influence ranking.
9forcedelBOOL (0 or 1)FORCE DELETE0Check this checkbox to delete the results, not just reindex them.



/admin/stats - general statistics   [ show parms in xml or json ]   [ show status in xml or json ]

Input
#ParmTypeTitleDefault ValueDescription
1 formatSTRINGoutput formathtmlDisplay output in this format. Can be html, json or xml.
2 showinputBOOL (0 or 1)show input and settings1Display possible input and the values of all settings on this page.