Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What is billion (3.3 billions pages in Google-2004)

Billion = 1000...00?

         

Bambarbia

6:40 pm on May 9, 2007 (gmt 0)



I am still trying to understand what billion is:

The supplemental collection of pages has been collected from the web just like the 3.3 billion pages in Google's main index.

Is it 3,300,000,000 (USA), or 3,300,000,000,000 (UK)?

Thanks

tedster

6:41 pm on May 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No doubt it's the US meaning of "billion"

Bambarbia

7:20 pm on May 9, 2007 (gmt 0)



10,000,000,000 is too small for BigDaddy!

Delay between subsequent fetch: 2.5 seconds

Number of pages-per-day fetched by a single (and very lazy) thread: 34,560

Number of pages-per-day crawled by a single very cheap process(or) with 100 threads: 3,456,000

Number of days to crawl whole web using single process: 2894

Number of computers needed for whole-web monthly recrawling:
2894/30 = 96

Google has at least 10000 crawlers (hardware, CPUs)...

100 threads with 2.5 delay can run on a single CPU concurrently, crawl+outlink_extract+new_crawl+...+index; also some replication takes time, so I can have some percentage of mistake in calculations...

However difference between 96 and 10000 is huge, isn't it?

It could be: 100 crawlers, 100 parsers, 100 indexers, 100 PR Calculators, 100 replicators, and it could be less than 1000. Is BigDaddy really big? 100 millions websites, 1000 pages per site (in average)... 100 billions...

g1smd

12:59 pm on May 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I suspect that a single process pulls data from several different sites simultaneously.. and there are hundreds of Googlebots scouring the web.

Bambarbia

6:44 pm on May 10, 2007 (gmt 0)



I agree,
Single process can grab 1000 pages per second, from 2500 different sites concurrently (which will give us "virtually lazy thread" grabbing pages from one site only, with delay= 2.5 seconds)
Again, according to simple (or similar) calculations we can have same size index with 100 "average" computers only... within just month.

(multithreaded process with 2500 lazy threads is better than single-threaded just because there is always network delay such as 1.5-2 seconds of response time; let's forget tech. jargon... calculations remain the same).