Welcome to WebmasterWorld Guest from 54.145.71.115

Forum Moderators: open

Message Too Old, No Replies

Gbot running hard

     
9:04 am on Sep 23, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 7, 2003
posts:1179
votes: 0


googlebot requesting between 2 - 5 pages a second, not seen this type of spidering for a long time
9:17 pm on Sept 25, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


Anyone read a good book recently?

Good point - I quit my full time job to do just that (among other things) :)

10:48 pm on Sept 25, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
posts:548
votes: 0


Lord Majestic,

nevertheless you decided to end the discussion about the number of machines, I take the liberty to add something to it. You are not allowed to reply. ;-)

In 2002 c't magazine (a well-known german computer magazine) published an interview with Google fellow Urs Hölzle. The article is in German and available pay-per-click only but I'll sticky it to you if you like. He said that they had at that point 10k machines. The blog article you mention references an IEEE article from 2003 by three G employees in which the number of machines is said to be 15k. 10k in 2000, 15k in 2003 - let's say they are now at 20k.

>4 bln pages * 40KB * 8 bits / 86,400 secs in day = 15 GigaBit / second

Yes, 16 GB/s is not reasonable. In the above mentioned interview, Hölzle said that the index turn-around is 28 days with 3 billion URLs in the index at that time. Feeding the number 3 billion URLs into your calculation, one gets 12 GB/s. Divide this by 28 days and 5 DCs, one gets 90 MB/s. A more reasonable number. We also need to take into account that the DC's must replicate the index among each other and that there is query traffic. I admit that your and Brett's theory is more likely than mine.

5:17 am on Sept 26, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:May 30, 2002
posts:56
votes: 0


This month in my little cul de sac off the internet super highway I have not seen much of the google bot. I have however seen the MSNBot crawl every page in my site every month for quite a while.

Still I get the majority of my offsite referrers from Google either normal or image search.

Just to contradict what y'all are saying...

Muskie

11:23 am on Sept 26, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 1, 2004
posts:3181
votes: 0


Just to contradict what y'all are saying...

If google never visits you, then how are your pages in its index? Obviously it has all your pages and no need to revisit.

3:14 pm on Sept 26, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:May 30, 2002
posts:56
votes: 0


My site has been mostly stagnant since I've gone back to school. However if, Google was re-indexing and deep crawling sites, surely it would crawl mine, being that I even manage to be number 1 for my own miniscule and unimportant keyword?

Yet all I see is MSN Bot and a variety of browsers in my top agents list... Maybe Google Bot visited a little this month as Web Analysiser only shows the top ten agents, but the fact remains for my website, sample size of one, GoogleBot is not "running hard" and in fact month in month out, MSN Bot has made it a point to completely crawl my site.

Muskie

3:49 pm on Sept 26, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 1, 2004
posts:3181
votes: 0


My site has been mostly stagnant since I've gone back to school. However if, Google was re-indexing and deep crawling sites, surely it would crawl mine, being that I even manage to be number 1 for my own miniscule and unimportant keyword?

Lots of folks can rank #1 very quickly (nearly overnight) for uncompetitive keywords, I've done it many times. Not sure of your point there. Do you think that just because you rank #1 for a keyword that your site is important to Google? I've never gotten that impression from what I've read.

4:12 pm on Sept 26, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 1, 2004
posts:3181
votes: 0


Boxes are all fixed low cost, however bandwidth is still relatively expensive, especially when you require THAT MUCH - 15 Gigabits sustained over period of time - this is very high load and its not cheap. Google's normal internal SLA is to crawl it all over few months, so costs of doing so are far lower - they only need 166 Mbits to achieve same results (1500/90).

You mean to tell me that Google would spend nearly a half billion dollars on 100,000 servers but go on the cheap with bandwidth?

OC768 - 39.8 Gbps and all this stuff has burst speeds. They don't even need a pipe that big with multiple centers...

I'm also not sure why you think that hosting 100,000 servers is all "fixed low cost." If your right, their electic bills alone are in the $5 million per month range (the servers plus cooling), let alone the cost to maintain all those machines.

4:18 pm on Sept 26, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


OC768 - 39.8 Gbps and all this stuff has burst speeds. They don't even need a pipe that big with multiple centers...

BillyS - the arguement is not about what pipes they have and how much it costs - empirical and historical evidence suggests that indexers are not bottlenecks, and its the crawlers are that take longer to execute. Whatever the exact speed is irrelevant as I was trying to answer original question, which if you don't mind me quoting is as follows:

Because no index needs to be built up, the crawl is much faster than usual.

If crawling is the bottleneck (which I think is the case) then this original question is answered as "no, not running indexers won't make any difference on speed crawling provided there are enough URLs to crawl". This is all I was argueing about, I entered in to possible specifics of what infrastructure they use to show what the most likely bottleneck is.

And believe it or not - even with billions dollars there is always a bottleneck simply because one subsystem runs faster than the other for whatever reason.

5:47 pm on Sept 26, 2004 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


still running hard...days after a major update. GBot is on crack.
8:37 pm on Sept 26, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 17, 2004
posts:597
votes: 0


All this is just speculations.
I have a #1 website which cache is from August 12th, and another one in the top 15 with cache of September 6th.
8:51 pm on Sept 26, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 7, 2003
posts:1179
votes: 0


Well now the feeding frenzy is over and the ladies have put their handbags down :)

My site has been mostly stagnant since I've gone back to school

Hey Muskie, when was the last time you updated your site, if you give googlebot something to chew on she will return time after time, msnbot and yahoo are also spidering very hard and have been for quite some time.

GoogleBot is not "running hard" and in fact month in month out, MSN Bot has made it a point to completely crawl my site

Belive me googlebot is running very fast indeed, not seen this type of spidering since the google dance ended and the update was spread out over the month as a continuous update instead of the last 7 - 10 days of each month, too early to tell what the outcome will be but I am sure we will all find out in the very near future.

6:56 pm on Sept 27, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
posts:548
votes: 0


Well now the feeding frenzy is over and the ladies have put their handbags down

LOL. Sorry 'bout that ... ;)

9:24 pm on Sept 27, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Apr 25, 2002
posts:470
votes: 0


For all those people who say that Google ignores the newbies, my new site has had 4,000 of its pages crawled since Gbot starting going nuts.
4:20 am on Sept 28, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 25, 2004
posts:81
votes: 0


Aside from an increase in spidering, has anybody seen any meaningful increase in the number of pages shown on their Google index: ie, site:www.domain.com, especially for partially indexed sites?

In my case Google spidered around 25,000 pages out of total 50,000, however our Google index has increased from about 7,000 to about 8,0000. Roughly...

4:26 am on Sept 28, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 28, 2001
posts:1380
votes: 0


I'm seeing small increases like you sasha, but nothing sizeable.
6:56 am on Sept 28, 2004 (gmt 0)

New User

10+ Year Member

joined:Sept 2, 2004
posts:3
votes: 0


Hi Sasha

When you say that googlebot visited 25000 pages do you mean it visited your site 25000 times or do you know that it visited 25000 seperate pages?

I also have a site where google "loses" a percentage of indexed pages every month but despite the fact that it visits my site >50000 times a month I am not convinced that it visits every individual page.

I am going to analyse my logs in more detail.

7:11 am on Sept 28, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 25, 2004
posts:81
votes: 0


> When you say that googlebot visited 25000 pages do you mean it visited your site 25000 times or do you know that it visited 25000 seperate pages?

Actually, it visited probably, like, 100 times in the span of 2 weeks. It produced 25K hits to all pages. I would say judging by the logs about 15-20% of the hits are duplicate (ie. hitting robots.txt 200 times per day, or index.html about 20 times), with the rest of the hits being unique.

7:29 am on Sept 28, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 20, 2003
posts:408
votes: 0


Bots on one site this month

MSN 88,000+ pages
Jeeves 88,000+ pages
Google 15,000 pages

Google comes in third place.

Strange thing is that another site that lost 90% of its traffic from google gets crawled like crazy - every day - Go figure!

7:45 am on Sept 28, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member powdork is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 13, 2002
posts:3346
votes: 0


I mentioned this in another thread but it seems to belong here. In a response to an email i sent regarding sandboxing, Google replied
Our crawler analyzes the content of webpages in our index to determine the search queries for which they're most relevant.
I'm guessing that says something about the relationship betwixt crawler and indexer, although I don't know what.
8:01 am on Sept 28, 2004 (gmt 0)

New User

10+ Year Member

joined:Sept 2, 2004
posts:3
votes: 0


In my case googlebot feels the need to visit each of my pages at least 10 times per month - at this rate it would need to visit my site >150,000 times and not the 50,000 it does at present.

I am beginning to wonder whether there may be a connection between this and the "missing pages" syndrome.

11:23 am on Sept 28, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 19, 2000
posts:2501
votes: 27


In the last 24 hours Googlebot reindexed my entire site for the second time this month and that doesn't include the daily hits in between. I'm still not seeing any major shifts on much of anything.

I still believe they are having problems and are frantically trying to sort things out. I'm having a hard time believing there is any other reason for two deep crawls without a major update.

11:28 am on Sept 28, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 5, 2001
posts:2466
votes: 0


Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

don't think that one is your friend!

11:38 am on Sept 28, 2004 (gmt 0)

New User

10+ Year Member

joined:Sept 2, 2004
posts:3
votes: 0


Hi DaveN

What do you think is "unfriendly" about this bot?

11:50 am on Sept 28, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 28, 2004
posts:112
votes: 0


If this activity is indeed an emergency rebuild of Google's index, I suspect it's related to the the meta-refresh page hijacking problem being discussed ([webmasterworld.com ])

From recent posts in that thread it appears that the redirect/hijack problem is being fixed. The root of the problem was page contents being "credited" to redirected URLs: in that case, the index may not have contained the data necessary to re-link hijacked content with its proper URL, necessitating a complete crawl to rebuild the index.

11:51 am on Sept 28, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 5, 2001
posts:2466
votes: 0


it's just not the bot you want to see if do aff marketing.... just a feeling and a few tests ;)
11:52 am on Sept 28, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 1, 2004
posts:3181
votes: 0


Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

I've seen this in my logs too. The only note I've seen on it was a confirmation that it was Googles, starting I think around a month ago they started using it.

1:50 pm on Sept 28, 2004 (gmt 0)

New User

10+ Year Member

joined:Apr 23, 2003
posts:5
votes: 0


Hi,

I've been monitoring this issue from yesterday evening.

I doesn't look like Google or MSN is crawling this way.

Some other network is trying to flood the websites with the name of Google and MSN.

The reason I'm saying this is because of the pattern the hits are Generated which is similar with both user agents.

Google Never uses the Below User Agent:

Google User Agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Ip Range with Subnet:

66.249.66.0 255.255.255.0
66.249.65.0 255.255.255.0

Today some of my websites are pounded with hits by

MSN User Agent:
msnbot/0.3 (+http://search.msn.com/msnbot.htm)

207.46.98.0

Any Network Admins find the same Pattern.

1:57 pm on Sept 28, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 23, 2003
posts:77
votes: 0


sri_gan

66.249.66.0
66.249.65.0

Belong to Google

207.46.98.0

Belongs to MSN

Tony

2:09 pm on Sept 28, 2004 (gmt 0)

New User

10+ Year Member

joined:Apr 23, 2003
posts:5
votes: 0


Yes its. But the Request pattern for both the agents are similar, which gives possibility to think that there could be someone masking themselves as MSN or Google.
2:35 pm on Sept 28, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:June 22, 2004
posts:67
votes: 0


I have seen both googlebot and msnbot hitting my site pretty hard. from what I see they are real bots from each company based on IP address ranges. Is it possible to spoof and IP like that?
This 176 message thread spans 6 pages: 176