|Anyone read a good book recently? |
Good point - I quit my full time job to do just that (among other things) :)
nevertheless you decided to end the discussion about the number of machines, I take the liberty to add something to it. You are not allowed to reply. ;-)
In 2002 c't magazine (a well-known german computer magazine) published an interview with Google fellow Urs Hölzle. The article is in German and available pay-per-click only but I'll sticky it to you if you like. He said that they had at that point 10k machines. The blog article you mention references an IEEE article from 2003 by three G employees in which the number of machines is said to be 15k. 10k in 2000, 15k in 2003 - let's say they are now at 20k.
>4 bln pages * 40KB * 8 bits / 86,400 secs in day = 15 GigaBit / second
Yes, 16 GB/s is not reasonable. In the above mentioned interview, Hölzle said that the index turn-around is 28 days with 3 billion URLs in the index at that time. Feeding the number 3 billion URLs into your calculation, one gets 12 GB/s. Divide this by 28 days and 5 DCs, one gets 90 MB/s. A more reasonable number. We also need to take into account that the DC's must replicate the index among each other and that there is query traffic. I admit that your and Brett's theory is more likely than mine.
This month in my little cul de sac off the internet super highway I have not seen much of the google bot. I have however seen the MSNBot crawl every page in my site every month for quite a while.
Still I get the majority of my offsite referrers from Google either normal or image search.
Just to contradict what y'all are saying...
|Just to contradict what y'all are saying... |
If google never visits you, then how are your pages in its index? Obviously it has all your pages and no need to revisit.
My site has been mostly stagnant since I've gone back to school. However if, Google was re-indexing and deep crawling sites, surely it would crawl mine, being that I even manage to be number 1 for my own miniscule and unimportant keyword?
Yet all I see is MSN Bot and a variety of browsers in my top agents list... Maybe Google Bot visited a little this month as Web Analysiser only shows the top ten agents, but the fact remains for my website, sample size of one, GoogleBot is not "running hard" and in fact month in month out, MSN Bot has made it a point to completely crawl my site.
|My site has been mostly stagnant since I've gone back to school. However if, Google was re-indexing and deep crawling sites, surely it would crawl mine, being that I even manage to be number 1 for my own miniscule and unimportant keyword? |
Lots of folks can rank #1 very quickly (nearly overnight) for uncompetitive keywords, I've done it many times. Not sure of your point there. Do you think that just because you rank #1 for a keyword that your site is important to Google? I've never gotten that impression from what I've read.
|Boxes are all fixed low cost, however bandwidth is still relatively expensive, especially when you require THAT MUCH - 15 Gigabits sustained over period of time - this is very high load and its not cheap. Google's normal internal SLA is to crawl it all over few months, so costs of doing so are far lower - they only need 166 Mbits to achieve same results (1500/90). |
You mean to tell me that Google would spend nearly a half billion dollars on 100,000 servers but go on the cheap with bandwidth?
OC768 - 39.8 Gbps and all this stuff has burst speeds. They don't even need a pipe that big with multiple centers...
I'm also not sure why you think that hosting 100,000 servers is all "fixed low cost." If your right, their electic bills alone are in the $5 million per month range (the servers plus cooling), let alone the cost to maintain all those machines.
|OC768 - 39.8 Gbps and all this stuff has burst speeds. They don't even need a pipe that big with multiple centers... |
BillyS - the arguement is not about what pipes they have and how much it costs - empirical and historical evidence suggests that indexers are not bottlenecks, and its the crawlers are that take longer to execute. Whatever the exact speed is irrelevant as I was trying to answer original question, which if you don't mind me quoting is as follows:
|Because no index needs to be built up, the crawl is much faster than usual. |
If crawling is the bottleneck (which I think is the case) then this original question is answered as "no, not running indexers won't make any difference on speed crawling provided there are enough URLs to crawl". This is all I was argueing about, I entered in to possible specifics of what infrastructure they use to show what the most likely bottleneck is.
And believe it or not - even with billions dollars there is always a bottleneck simply because one subsystem runs faster than the other for whatever reason.
still running hard...days after a major update. GBot is on crack.
All this is just speculations.
I have a #1 website which cache is from August 12th, and another one in the top 15 with cache of September 6th.
Well now the feeding frenzy is over and the ladies have put their handbags down :)
|My site has been mostly stagnant since I've gone back to school |
Hey Muskie, when was the last time you updated your site, if you give googlebot something to chew on she will return time after time, msnbot and yahoo are also spidering very hard and have been for quite some time.
|GoogleBot is not "running hard" and in fact month in month out, MSN Bot has made it a point to completely crawl my site |
Belive me googlebot is running very fast indeed, not seen this type of spidering since the google dance ended and the update was spread out over the month as a continuous update instead of the last 7 - 10 days of each month, too early to tell what the outcome will be but I am sure we will all find out in the very near future.
|Well now the feeding frenzy is over and the ladies have put their handbags down |
LOL. Sorry 'bout that ... ;)
For all those people who say that Google ignores the newbies, my new site has had 4,000 of its pages crawled since Gbot starting going nuts.
Aside from an increase in spidering, has anybody seen any meaningful increase in the number of pages shown on their Google index: ie, site:www.domain.com, especially for partially indexed sites?
In my case Google spidered around 25,000 pages out of total 50,000, however our Google index has increased from about 7,000 to about 8,0000. Roughly...
I'm seeing small increases like you sasha, but nothing sizeable.
When you say that googlebot visited 25000 pages do you mean it visited your site 25000 times or do you know that it visited 25000 seperate pages?
I also have a site where google "loses" a percentage of indexed pages every month but despite the fact that it visits my site >50000 times a month I am not convinced that it visits every individual page.
I am going to analyse my logs in more detail.
> When you say that googlebot visited 25000 pages do you mean it visited your site 25000 times or do you know that it visited 25000 seperate pages?
Actually, it visited probably, like, 100 times in the span of 2 weeks. It produced 25K hits to all pages. I would say judging by the logs about 15-20% of the hits are duplicate (ie. hitting robots.txt 200 times per day, or index.html about 20 times), with the rest of the hits being unique.
Bots on one site this month
MSN 88,000+ pages
Jeeves 88,000+ pages
Google 15,000 pages
Google comes in third place.
Strange thing is that another site that lost 90% of its traffic from google gets crawled like crazy - every day - Go figure!
I mentioned this in another thread but it seems to belong here. In a response to an email i sent regarding sandboxing, Google replied
I'm guessing that says something about the relationship betwixt crawler and indexer, although I don't know what.
|Our crawler analyzes the content of webpages in our index to determine the search queries for which they're most relevant. |
In my case googlebot feels the need to visit each of my pages at least 10 times per month - at this rate it would need to visit my site >150,000 times and not the 50,000 it does at present.
I am beginning to wonder whether there may be a connection between this and the "missing pages" syndrome.
In the last 24 hours Googlebot reindexed my entire site for the second time this month and that doesn't include the daily hits in between. I'm still not seeing any major shifts on much of anything.
I still believe they are having problems and are frantically trying to sort things out. I'm having a hard time believing there is any other reason for two deep crawls without a major update.
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
don't think that one is your friend!
What do you think is "unfriendly" about this bot?
If this activity is indeed an emergency rebuild of Google's index, I suspect it's related to the the meta-refresh page hijacking problem being discussed ([webmasterworld.com ])
From recent posts in that thread it appears that the redirect/hijack problem is being fixed. The root of the problem was page contents being "credited" to redirected URLs: in that case, the index may not have contained the data necessary to re-link hijacked content with its proper URL, necessitating a complete crawl to rebuild the index.
it's just not the bot you want to see if do aff marketing.... just a feeling and a few tests ;)
|Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
I've seen this in my logs too. The only note I've seen on it was a confirmation that it was Googles, starting I think around a month ago they started using it.
I've been monitoring this issue from yesterday evening.
I doesn't look like Google or MSN is crawling this way.
Some other network is trying to flood the websites with the name of Google and MSN.
The reason I'm saying this is because of the pattern the hits are Generated which is similar with both user agents.
Google Never uses the Below User Agent:
Google User Agent:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Ip Range with Subnet:
Today some of my websites are pounded with hits by
MSN User Agent:
Any Network Admins find the same Pattern.
Belong to Google
Belongs to MSN
Yes its. But the Request pattern for both the agents are similar, which gives possibility to think that there could be someone masking themselves as MSN or Google.
I have seen both googlebot and msnbot hitting my site pretty hard. from what I see they are real bots from each company based on IP address ranges. Is it possible to spoof and IP like that?
I'm also getting pounding by this 'googlebot'
An example line:
188.8.131.52 - - [28/Sep/2004:15:48:24 +0100] "GET /tps_page.html HTTP/1.1" 404 335 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
The IP does appear to belong to Google, but whereas googlebot normally spreads it's load across several bot machines, these are all coming from one IP address.
Also they are requesting docs which do not and have never existed on my site. They do appear to be docs which exist on sites I link to. It's as though the bot has followed links but has not changed the server part of the link.
| This 176 message thread spans 6 pages: < < 176 ( 1 2  4 5 6 ) > > |