Forum Moderators: open
nevertheless you decided to end the discussion about the number of machines, I take the liberty to add something to it. You are not allowed to reply. ;-)
In 2002 c't magazine (a well-known german computer magazine) published an interview with Google fellow Urs Hölzle. The article is in German and available pay-per-click only but I'll sticky it to you if you like. He said that they had at that point 10k machines. The blog article you mention references an IEEE article from 2003 by three G employees in which the number of machines is said to be 15k. 10k in 2000, 15k in 2003 - let's say they are now at 20k.
>4 bln pages * 40KB * 8 bits / 86,400 secs in day = 15 GigaBit / second
Yes, 16 GB/s is not reasonable. In the above mentioned interview, Hölzle said that the index turn-around is 28 days with 3 billion URLs in the index at that time. Feeding the number 3 billion URLs into your calculation, one gets 12 GB/s. Divide this by 28 days and 5 DCs, one gets 90 MB/s. A more reasonable number. We also need to take into account that the DC's must replicate the index among each other and that there is query traffic. I admit that your and Brett's theory is more likely than mine.
Still I get the majority of my offsite referrers from Google either normal or image search.
Just to contradict what y'all are saying...
Muskie
Yet all I see is MSN Bot and a variety of browsers in my top agents list... Maybe Google Bot visited a little this month as Web Analysiser only shows the top ten agents, but the fact remains for my website, sample size of one, GoogleBot is not "running hard" and in fact month in month out, MSN Bot has made it a point to completely crawl my site.
Muskie
My site has been mostly stagnant since I've gone back to school. However if, Google was re-indexing and deep crawling sites, surely it would crawl mine, being that I even manage to be number 1 for my own miniscule and unimportant keyword?
Lots of folks can rank #1 very quickly (nearly overnight) for uncompetitive keywords, I've done it many times. Not sure of your point there. Do you think that just because you rank #1 for a keyword that your site is important to Google? I've never gotten that impression from what I've read.
Boxes are all fixed low cost, however bandwidth is still relatively expensive, especially when you require THAT MUCH - 15 Gigabits sustained over period of time - this is very high load and its not cheap. Google's normal internal SLA is to crawl it all over few months, so costs of doing so are far lower - they only need 166 Mbits to achieve same results (1500/90).
You mean to tell me that Google would spend nearly a half billion dollars on 100,000 servers but go on the cheap with bandwidth?
OC768 - 39.8 Gbps and all this stuff has burst speeds. They don't even need a pipe that big with multiple centers...
I'm also not sure why you think that hosting 100,000 servers is all "fixed low cost." If your right, their electic bills alone are in the $5 million per month range (the servers plus cooling), let alone the cost to maintain all those machines.
OC768 - 39.8 Gbps and all this stuff has burst speeds. They don't even need a pipe that big with multiple centers...
BillyS - the arguement is not about what pipes they have and how much it costs - empirical and historical evidence suggests that indexers are not bottlenecks, and its the crawlers are that take longer to execute. Whatever the exact speed is irrelevant as I was trying to answer original question, which if you don't mind me quoting is as follows:
Because no index needs to be built up, the crawl is much faster than usual.
If crawling is the bottleneck (which I think is the case) then this original question is answered as "no, not running indexers won't make any difference on speed crawling provided there are enough URLs to crawl". This is all I was argueing about, I entered in to possible specifics of what infrastructure they use to show what the most likely bottleneck is.
And believe it or not - even with billions dollars there is always a bottleneck simply because one subsystem runs faster than the other for whatever reason.
My site has been mostly stagnant since I've gone back to school
Hey Muskie, when was the last time you updated your site, if you give googlebot something to chew on she will return time after time, msnbot and yahoo are also spidering very hard and have been for quite some time.
GoogleBot is not "running hard" and in fact month in month out, MSN Bot has made it a point to completely crawl my site
Belive me googlebot is running very fast indeed, not seen this type of spidering since the google dance ended and the update was spread out over the month as a continuous update instead of the last 7 - 10 days of each month, too early to tell what the outcome will be but I am sure we will all find out in the very near future.
In my case Google spidered around 25,000 pages out of total 50,000, however our Google index has increased from about 7,000 to about 8,0000. Roughly...
When you say that googlebot visited 25000 pages do you mean it visited your site 25000 times or do you know that it visited 25000 seperate pages?
I also have a site where google "loses" a percentage of indexed pages every month but despite the fact that it visits my site >50000 times a month I am not convinced that it visits every individual page.
I am going to analyse my logs in more detail.
Actually, it visited probably, like, 100 times in the span of 2 weeks. It produced 25K hits to all pages. I would say judging by the logs about 15-20% of the hits are duplicate (ie. hitting robots.txt 200 times per day, or index.html about 20 times), with the rest of the hits being unique.
Our crawler analyzes the content of webpages in our index to determine the search queries for which they're most relevant.I'm guessing that says something about the relationship betwixt crawler and indexer, although I don't know what.
I still believe they are having problems and are frantically trying to sort things out. I'm having a hard time believing there is any other reason for two deep crawls without a major update.
From recent posts in that thread it appears that the redirect/hijack problem is being fixed. The root of the problem was page contents being "credited" to redirected URLs: in that case, the index may not have contained the data necessary to re-link hijacked content with its proper URL, necessitating a complete crawl to rebuild the index.
I've been monitoring this issue from yesterday evening.
I doesn't look like Google or MSN is crawling this way.
Some other network is trying to flood the websites with the name of Google and MSN.
The reason I'm saying this is because of the pattern the hits are Generated which is similar with both user agents.
Google Never uses the Below User Agent:
Google User Agent:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Ip Range with Subnet:
66.249.66.0 255.255.255.0
66.249.65.0 255.255.255.0
Today some of my websites are pounded with hits by
MSN User Agent:
msnbot/0.3 (+http://search.msn.com/msnbot.htm)
207.46.98.0
Any Network Admins find the same Pattern.