Forum Moderators: open
He adds that Teoma is focused mainly on English language content at the moment -- so the perceived smaller size of Teoma may not be an issue for English speakers. Subtract non-English language pages from Teoma's competitors, and the size differences may be much less."Comparatively speaking, I would argue that we are very close to Google's size in English," Gardi said.
[edited by: martinibuster at 6:50 pm (utc) on Sep. 4, 2003]
Gardi said Teoma's ideal index size will be between 350 million and 500 million pages.
[theregister.co.uk...]
I would be interested as to why the plan has changed.
Paul Gardi says that they concentrate on English, but what they mean is that they concentrate on algorithmically processing English.
It is obvious that they are indexing non-English websites and that these websites are a part of the 1.5 billion within their index.
I feel that it may be a stretch to say that if you remove the non-English pages from Google you will have an index close to Teoma- as Teoma has an unknown but probably significant number of non-English pages in it's own database.
If you remove the non-English pages from Teoma, you'll probably be left with the aforementioned "350 million and 500 million pages. "
How many people have not searched for certain things and gotten no results?
The inventory is simply not big enough yet.
from NFFC's Register article: [theregister.co.uk]
Paul Gardi 29/03/2002 :
"Gardi said Teoma's ideal index size will be between 350 million and 500 million pages. "There are not 2 billion useful pages on the web," he said.
Sure, with 500 million pages, you can probably satisfy 97% of the queries in English with relevant results. I however doubt their secret sauce is so delicate that it can establish that I as a searcher find those 1,5 billion other pages unuseful. The 1,5 billion nearly never turn up in search results at Google anyway. They are probably buried by lack of inbound links, yet they might provide me with a page on a eight word search query for something obscure. My innocent self will not believe that 75% of the web is either unuseful duplicate or spam.
Currently, Teoma is aware of about 900 million URLs, crawls about 400 million, but cuts out duplicates and spam to end up with a dataset of about 200 million fully indexed pages
I wonder what besides robots text makes Teoma decide to not crawl those 500 million uncrawled URLs (56%)