Welcome to WebmasterWorld Guest from 23.20.221.93

Forum Moderators: open

Message Too Old, No Replies

Gbot running hard

     
9:04 am on Sep 23, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 7, 2003
posts:1179
votes: 0


googlebot requesting between 2 - 5 pages a second, not seen this type of spidering for a long time
11:36 am on Sept 24, 2004 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38061
votes: 13


yep - looks like "panic" based spidering.
11:44 am on Sept 24, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 24, 2004
posts:639
votes: 0


Hi Brett,

What do you mean by "panic" type spidering?

Cheers
BTS

11:50 am on Sept 24, 2004 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38061
votes: 13


as if an index needs to be rebuilt from the ground up in a short time period (aka: the old index didn't work).
11:57 am on Sept 24, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
posts:548
votes: 0


Maybe it's part of the PR calculation for the next update. The crawl is done for two reasons A) check whether page exists and B) get all links on that page. Because no index needs to be built up, the crawl is much faster than usual.
2:08 pm on Sept 24, 2004 (gmt 0)

New User

10+ Year Member

joined:Mar 16, 2004
posts:32
votes: 0


I dunno - but ive had 3 deep crawls in a week - I mean DEEP -
2:35 pm on Sept 24, 2004 (gmt 0)

New User

10+ Year Member

joined:Aug 28, 2003
posts:35
votes: 0


FYI, I too had deep crawls on Monday, Wednesday, and Thursday. With G downloading a Gig from my site, the spikes are easy to see on my traffic graphs ;-)

-Nuttzy

2:37 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


Because no index needs to be built up, the crawl is much faster than usual.

Indexing is not done in-between crawling pages, its done by other machines running in parallel that accept complete "data barrels" from crawler.

2:52 pm on Sept 24, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
posts:548
votes: 0


Lord Majestic,

I knew that. But there must be some degree of coupling between the crawler and the indexer. If the indexer, or in this case the PR-calculator, is fast, it will process data at a higher rate and the crawlers can deliver data at a higher rate.

I'm not saying that my hypothesis is necessarily correct, it's just that what you say does not prove me wrong.

3:01 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


I'm not saying that my hypothesis is necessarily correct, it's just that what you say does not prove me wrong.

I am positive that crawler and indexer are independent systems (as in original Google's design) that, while providing each other with data, are being run in parallel as opposed to page being crawled then indexed, then moving to the next page.

Fundamentally the indexer depends on data crawled by the crawler. On the other hand the crawler depends on additional URLs that indexer gets by analysing crawled data. I can see two distinct scenarios:

a) indexer not running (as you suggest), in this case crawler would run out of URLs to do and stop at some point, and I see no reason to run it faster than normal if indexing is not performed, why indeed?

b) indexer is running and adds new URLs for crawler to do - in this case crawler would have to work its proverbial arse off (like what seems to happen in our case) in order to cope with demands for data from the indexer, and with all that computing power they have the bottleneck is bound to be in "bandwidth" area, ie crawler, so this scenario in my view is the most likely at the moment

I therefore conclude, it is far more likely that the index had to be rebuilt quickly so they pulled off all stops and started crawling like mad with indexer running on full power too - contrary to your original suggestion and inline with what Brett is suggesting.

Perhaps someone with (or without) better knowledge of search engine design would like to correct me ;)

3:21 pm on Sept 24, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
posts:548
votes: 0


>On the other hand the crawler depends on additional URLs that indexer gets by analysing crawled data.

If that's true, there is not much left to do for the crawler, is there? No, the crawler filters new URL's out on its own. Also it is fed existing URLs from the database and not the indexer.

>and with all that computing power they have the bottleneck is bound to be in "bandwidth" area

Word-based indexers are computationally expensive. That's what the G needs the computing power for. You think they buy all this machines and then force them to idle by putting stops into the code? No way! The indexer is the bottleneck, not the crawlers.

3:26 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


No, the crawler filters new URL's out on its own. Also it is fed existing URLs from the database and not the indexer.

Only amateur crawlers extract URLs from crawled page to continue crawling these - you need to ensure that you don't crawl same URLs all over again, otherwise crawlers that run in parallel on many machines might totally kill website because they will happen to crawl same pages. This is the job of indexer to analyse pages and product nice work units with clean URLs for crawlers to get so that you do not waste bandwidth by getting same stuff over and over again (and pissing off webmasters).

Word-based indexers are computationally expensive. That's what the G needs the computing power for. You think they buy all this machines and then force them to idle by putting stops into the code?

They do have lots of machines, primarily not for CPUs but for having lots of RAM to keep all index in memory. They therefore have a LOT of CPU power, 100k machines some estimate, if you spread all 4 bln URLs onto these, then there will be almost nothing to do for these boxes - 40k pages each, its nothing and good indexer does these in matter of minutes.

The indexer is the bottleneck, not the crawlers.

I stand by that the crawler is the main bottleneck - not least because websites have natural limit to how fast they serve pages - you might have 100Mbit connection, but if site give you data at rate of 10kbit then you will have to wait till you get data - hence bottleneck.

Everything from my point of view supports the original answer by Brett that Google is for some reason rebuilding index ASAP.

4:16 pm on Sept 24, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
posts:548
votes: 0


>Only amateur crawlers extract URLs from crawled page to continue crawling these

That's not what I said. I said that the crawlers extract the URLs themselfes. Obviously, there needs to be a central queue of URLs that all crawlers feed into and from such that two crawlers do not crawl the same page at a time.

>They therefore have a LOT of CPU power, 100k machines some estimate

That's not what I heard. It's rather 10k machines.

>lots of RAM to keep all index in memory

Pure speculation.

>but if site give you data at rate of 10kbit then you will have to wait till you get data - hence bottleneck.

Hence no bottleneck because one crawler crawles multiple URLs in parallel or one crawler machine runs multiple crawler processes in parallel. When crawling, my FTP search engine has 100 connections to different sites open at all times on a 2Mbit line and I can't index data as fast as it comes in.

4:23 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


That's not what I said. I said that the crawlers extract the URLs themselfes.

This is not how Google works, here is diagram of their original design which I doubt changed that much as it still make sense now:
[www-db.stanford.edu...]

Notice that Crawler is fed by the URL server, and crawled pages have to go through Indexer before new URLs are passed to the crawler. If you have more up to date scheme on how Google works, then please share it (as long as its legal of course).

That's not what I heard. It's rather 10k machines.

Thats outdated info - even if it was 10k, then 4 bln URLs only requires processing of 400k URLs on each of the boxes - this is nothing in terms of CPU requirements.

>lots of RAM to keep all index in memory

Pure speculation.

Its not speculation its fact - their original (past Stanford Backrub) design feature was to keep all index in memory, which is why they gone for 2Gbs of RAM in each of the boxes. I suggest you to check references on the matter and if you find something that says otherwise then please share it with me.

I can't index data as fast as it comes in.

Well the reasons why you can't index data as fast as you crawl are beyond scope of this conversation, I'd speculate that your indexers are just too slow, hence for you bottleneck is not bandwidth, unlike Google's situation ;)

5:22 pm on Sept 24, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:July 24, 2004
posts:95
votes: 0


Hi all
has the googlebot IP too changed.
I am getting two new IPs crawling me. A query just says its proxy.google.com and the other has no DNS but doing a whois i get Googlebot.
The ips are
216.239.37.5
216.239.39.5
216.239.51.5
and more starting with 216.239.

And
66.249.65.12
66.249.65.51
66.249.65.6

Before it used to crawl me with the ip 64.68. Why has the ip changed. Any ideas? Anyone?

5:29 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 1, 2004
posts:3181
votes: 0


I also saw hard crawling early Thursday - hundreds of pages over 30 minutes (2 sessions of 15 minutes). Very unusual pattern for my site and googlebot.
5:33 pm on Sept 24, 2004 (gmt 0)

New User

10+ Year Member

joined:Sept 28, 2003
posts:3
votes: 0


My site which has nearly 200,000 pages , only 4109 pages got crawled yesterday...Is it partial crawling, do you think googlebot will return to crawl rest of the pages. My site has 6PR. Does anybody have similar experiences, that is, partial crawling?
5:41 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 3, 2002
posts:894
votes: 0


I have been getting nothing but partial crawls for the past month. Might be some other problems I have been having though.

As for IPs of googlebot, I too noticed the change. I have benngetting hit by 66.249.65.x

5:58 pm on Sept 24, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:July 24, 2004
posts:95
votes: 0


Webdude
Do you have any reason why the ips have changed.
6:06 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 3, 2002
posts:894
votes: 0


Not that I know of. It's googlebot that has changed I am sure. Nothing I did.

Oh No! Maybe a crawl from this IP is an indication I am going to be dropped!..... Just Kidding :-)

6:17 pm on Sept 24, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 14, 2004
posts:62
votes: 0


OK, so then for my elementary understanding of ths situation, what impact does that have on SERPS?
6:22 pm on Sept 24, 2004 (gmt 0)

New User

10+ Year Member

joined:Sept 28, 2003
posts:3
votes: 0


IP's crawled on my CLEAN WEBSITE with no links to affiliate programs, no external links:
66.249.65.8
66.249.64.146
66.249.65.73
66.249.65.51

IP's crawled on product website:
66.249.64.195
66.249.64.47
66.249.65.73

Same block of IP's on both categories.
So, I believe it has nothing to do with removing our pages. Maybe google is trying out different things.

6:26 pm on Sept 24, 2004 (gmt 0)

Full Member

10+ Year Member

joined:Dec 20, 2003
posts:268
votes: 0


I expect alot of cloaking and redirect sites will be dropped soon from these new bot ip's and this crawl. It's what I had in mind in the post about hijacks when I said I think Google is on it. They have been asking for file paths and filenames with extensions I have never used before. I am hopeful anyway.
6:37 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 3, 2002
posts:894
votes: 0


I expect alot of cloaking and redirect sites will be dropped soon from these new bot ip's and this crawl. It's what I had in mind in the post about hijacks when I said I think Google is on it. They have been asking for file paths and filenames with extensions I have never used before. I am hopeful anyway.

Man Oh Man I Hope So. As you know I have been just a little invloved on that thread.

6:43 pm on Sept 24, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 11, 2003
posts:427
votes: 0


I saw some heavy spidering earlier this week on new pages. I wouldn't go so far as to say they are rebuilding an old index if they're going after new content. The gbot has significantly increased its activity the past few days. A new site I released a month ago was hit 50k times within the past few days. I wish I knew why.
6:49 pm on Sept 24, 2004 (gmt 0)

Full Member

10+ Year Member

joined:Feb 17, 2003
posts:214
votes: 0


johnnyb, when you posted the two series of IP addresses for Googlebot, 66.249.64.X and 66.249.65.X, I looked through my logs and this is what I found.

The 66.249.64.X series was requesting pages that were fully indexed i.e., they have a page title and description.

The 66.249.65.X series was requesting pages that were only partially indexed, i.e., the did not have a title or description when listed by Google using site:

In my case, the 66.249.65.X were pages that exist on my server but I am trying to get Googlebot to stop indexing. I have no links pointing to those pages but Google knows about them and keeps on requesting them.

Not sure what all this means but that is what I am seeing regarding those two IP series.

6:57 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 3, 2002
posts:894
votes: 0


I see google as having acquired the following blocks...

66.249.64.x - 66.249.79.x

Shows a registration date of 03/05/04

That's quite a hunk of IPs.

7:04 pm on Sept 24, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 15, 2003
posts:2408
votes: 5


uhm... a new datacenter perhaps? pure speculation of course...
7:16 pm on Sept 24, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:July 24, 2004
posts:95
votes: 0


And my robot detector had to be set up to recognise these bots. I am almost preparing for its one in 13 days (i dont know why G choose that no for me) full scan.
Wonder if it will be affected as its already the 14th day.
7:38 pm on Sept 24, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:July 24, 2004
posts:95
votes: 0


Just took time to analyse the stats.
what i found was that once all the pages were crawled by 66.249.65 bot, then a few pages were immediately crawled by the 216.239.37 and the few pages that were crawled by this bot started appearing in the SERPs within hours (six - ten)
Whats going on.

[edited by: boredguru at 7:43 pm (utc) on Sep. 24, 2004]

This 176 message thread spans 6 pages: 176