Forum Moderators: open

Message Too Old, No Replies

Crawl Bandwidth Costs?

         

Brett_Tabke

1:54 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



<random thought>

I wonder how much each crawl costs Google. Half a million? More?

</random thought>

AthlonInside

1:57 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



maybe it is free, they exchange their search with AOL or earthlink. LOL

Brett_Tabke

1:58 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Google uses exodus - like most of the engines.

IanTurner

2:52 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Hmm - 0.5 Million, that would be enough to buy a 155Mb Pipe for a year.

IanTurner

2:59 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Hmm reckon it would take about 20 Days to index the entire 3 Billion web pages on a 155Mb pipe. So with repeat requests you are probably looking at 3 or 4 155Mb pipes. So probably about $2 Million per year.

yetanotheruser

3:17 pm on Mar 28, 2003 (gmt 0)

10+ Year Member



I know how much it costs me! ;) Must cost you quite a bit too Brett... :)

I take it then the bot's aren't distributed?

lazerzubb

3:23 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah i've wondered about that too, i wonder if they use one FAT line for each datacenter, or if they use a lot of 155mb lines, not sure what this is for either, maybe it can be a clue: [speed-measure.google.com...] :) maybe a OC-48 line?

Old thread regarding same subject [webmasterworld.com]

Also C&W Network map [www1.cw.com]

[edited by: lazerzubb at 3:40 pm (utc) on Mar. 28, 2003]

Brett_Tabke

3:27 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



When you consider fresh bot activity, the fact they often do two crawls a month, and that they may crawl 5-7 billion pages each crawl, I think you are under by about 5-10 fold Ian.

borisbaloney

3:45 pm on Mar 28, 2003 (gmt 0)

10+ Year Member



What about the total costs to webmasters? Google gets bulk discounts, but assuming that it costs about 1/5 of a cent per page per year in extra hosting costs to serve to google spiders, this means a cool 60 million bucks all up.

Like with spam, a little on the receivers end really adds up when looking at the overall community.

Acknowledged my estimates are only estamites, and people want their pages spidered.

yetanotheruser

4:07 pm on Mar 28, 2003 (gmt 0)

10+ Year Member



Surely a big bit of the 5-7 billion pages are just 304's.. GB isn't greedy enought to download the whole internet right, they must use their cache whenever they can..?

Beachboy

4:24 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've always believed the cost of a crawl is a factor in why other engines don't crawl as often or as thoroughly as they should. Inktomi and AV come to mind.

AhmedF

4:32 pm on Mar 28, 2003 (gmt 0)

10+ Year Member



If I remember right, slashdot had some info on this. They had gigabit connections BETWEEN their datacenters in the US, and I think they had multiple T3s externally.

IanTurner

4:37 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I hadn't factored in Freshbot, I was looking at the cost of a deepcrawl.

(Would guess that freshbot doesn't even double the overall cost, as the majority of crawled pages will be less than page rank 4 and below the freshbots notice)

jomaxx

5:07 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Re 304's: Just took a look at my logs for yesterday. Overall result (mostly Freshbot activity) was 6,000 page requests and only about 5% returned 304's. All but a handful of my pages would have been unchanged for at least a week, so IMO Google could be much more aggressive in using its cache.

yetanotheruser

6:21 pm on Mar 28, 2003 (gmt 0)

10+ Year Member



jomaxx - surely whether a 304's given is up to your server.. if your server is returning 200's GB can't be blamed for downloading the file.. or have I missed the point?

Thing is - much of the 3B pages are content that doesn't change too often.. I think... is it...? which would mean google only has to check for change.. . ...

jomaxx

7:03 pm on Mar 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not an expert on the HTTP protocol, but I believe the decision whether to send a 304 or not is essentially based the contents of the If-Modified-Since header in the user's request; i.e. the server has no choice but to return the full page based on the specifics of the request.

There is SOME cacheing going on, so my interpretation of what's happening is that probably Google is not merging the caches from all its different data centers.

yetanotheruser

7:29 pm on Mar 28, 2003 (gmt 0)

10+ Year Member



that seems to make sense.. I'm certainly not an http expert so I'll take your word :)

having had a look at fresgbot's efforts today though - he's been getting 304's all day so I would asume that's just the header and not the whole page?