Forum Moderators: bakedjake
+
cache facility
"Gigablast uses its cached web pages to generate its dictionary instead of the query logs. When a word or phrase is not found in the the dictionary, Gigablast replaces it with the closest match in the dictionary. If multiple words or phrases are equally close, then Gigablast resorts to a popularity ranking."
And
additional file type indexing
"... PostScript (.ps) , PowerPoint (.ppt), Excel SpreadSheet (.xls) and Microsoft Word (.doc) support in addition to the PDF support. Woo- hoo."
Building up quite a list
Rich
I am planning on purchasing the hardware required for achieving a 5 billion document index within the next 12 months.source: Matt Wells [gigablast.com]
March last year he wrote:
my current setup only goes to about 200-250 millionMessage 39 in the thread GigaBlast Part 3 [webmasterworld.com]
It mapped each letter to its most likely typo error, and whether the typo was a substitution/deletion/insertion of letters. - i.e. some letters are used in mis-spelled words more often than others and they usually follow a pattern. Will have to re-look that one up.....
I suppose "cracking" a spelling algo is just as worthy as cracking an SE algo, ala typo domains, mis-spelled words etc......
wonders if reverse engineering a spelling algo is worth it...