|Gardi said Teoma's ideal index size will be between 350 million and 500 million pages. |
I would be interested as to why the plan has changed.
I'm not sure about the veracity of his statement. For instance, if you do a search for "diario de hoy el salvador" Teoma correctly serves up the url for the Salvadoran newspaper of that name, with a snippet from the website.
Paul Gardi says that they concentrate on English, but what they mean is that they concentrate on algorithmically processing English.
It is obvious that they are indexing non-English websites and that these websites are a part of the 1.5 billion within their index.
I feel that it may be a stretch to say that if you remove the non-English pages from Google you will have an index close to Teoma- as Teoma has an unknown but probably significant number of non-English pages in it's own database.
If you remove the non-English pages from Teoma, you'll probably be left with the aforementioned "350 million and 500 million pages. "
Hey, does anybody else have a comment about this or are you all shaking in your boots speculating about Google?
There are other search engines out there. Let's hear your input on this.
In my opinion the current web is still too small for any search engine to show decent results for multi word queries - even in English.
How many people have not searched for certain things and gotten no results?
The inventory is simply not big enough yet.
from NFFC's Register article: [theregister.co.uk]
Paul Gardi 29/03/2002 :
|"Gardi said Teoma's ideal index size will be between 350 million and 500 million pages. "There are not 2 billion useful pages on the web," he said. |
Sure, with 500 million pages, you can probably satisfy 97% of the queries in English with relevant results. I however doubt their secret sauce is so delicate that it can establish that I as a searcher find those 1,5 billion other pages unuseful. The 1,5 billion nearly never turn up in search results at Google anyway. They are probably buried by lack of inbound links, yet they might provide me with a page on a eight word search query for something obscure. My innocent self will not believe that 75% of the web is either unuseful duplicate or spam.
|Currently, Teoma is aware of about 900 million URLs, crawls about 400 million, but cuts out duplicates and spam to end up with a dataset of about 200 million fully indexed pages |
I wonder what besides robots text makes Teoma decide to not crawl those 500 million uncrawled URLs (56%)
Size matters but it is also important what you do with it :) For a search with Ask Jeeves on aclientskeyphrasehere I get 5 sponsored search results, 2 are eBay, one is an eBay affiliate, and two are dealers who don't sell the products. As for the listing that follows 7 are in the Google top 10 (but a different order) and the other 3 are from the Google 50-150 range. Some people must use it though because it represents 0.00001% of this clients referrer logs.
I wonder how many they would have if you eliminated the duplicate content. They are listing some of our domains that are parked and pointed to our main site.
|"Comparatively speaking, I would argue that we are very close to Google's size in English," Gardi said. |
"...and since the web is growing so slowly, it doesn't matter that we are months behind Google in adding sites."