Gbot running hard

Forum Moderators: open

Message Too Old, No Replies

Gbot running hard

ncw164x

9:04 am on Sep 23, 2004 (gmt 0)

googlebot requesting between 2 - 5 pages a second, not seen this type of spidering for a long time

Brett_Tabke

11:36 am on Sep 24, 2004 (gmt 0)

yep - looks like "panic" based spidering.

bts111

11:44 am on Sep 24, 2004 (gmt 0)

Hi Brett,

What do you mean by "panic" type spidering?

Cheers
BTS

Brett_Tabke

11:50 am on Sep 24, 2004 (gmt 0)

as if an index needs to be rebuilt from the ground up in a short time period (aka: the old index didn't work).

Hanu

11:57 am on Sep 24, 2004 (gmt 0)

Maybe it's part of the PR calculation for the next update. The crawl is done for two reasons A) check whether page exists and B) get all links on that page. Because no index needs to be built up, the crawl is much faster than usual.

joeychgo

2:08 pm on Sep 24, 2004 (gmt 0)

I dunno - but ive had 3 deep crawls in a week - I mean DEEP -

Nuttzy99

2:35 pm on Sep 24, 2004 (gmt 0)

FYI, I too had deep crawls on Monday, Wednesday, and Thursday. With G downloading a Gig from my site, the spikes are easy to see on my traffic graphs ;-)

-Nuttzy

Lord Majestic

2:37 pm on Sep 24, 2004 (gmt 0)

Because no index needs to be built up, the crawl is much faster than usual.

Indexing is not done in-between crawling pages, its done by other machines running in parallel that accept complete "data barrels" from crawler.

Hanu

2:52 pm on Sep 24, 2004 (gmt 0)

Lord Majestic,

I knew that. But there must be some degree of coupling between the crawler and the indexer. If the indexer, or in this case the PR-calculator, is fast, it will process data at a higher rate and the crawlers can deliver data at a higher rate.

I'm not saying that my hypothesis is necessarily correct, it's just that what you say does not prove me wrong.

Lord Majestic

3:01 pm on Sep 24, 2004 (gmt 0)

I'm not saying that my hypothesis is necessarily correct, it's just that what you say does not prove me wrong.

I am positive that crawler and indexer are independent systems (as in original Google's design) that, while providing each other with data, are being run in parallel as opposed to page being crawled then indexed, then moving to the next page.

Fundamentally the indexer depends on data crawled by the crawler. On the other hand the crawler depends on additional URLs that indexer gets by analysing crawled data. I can see two distinct scenarios:

a) indexer not running (as you suggest), in this case crawler would run out of URLs to do and stop at some point, and I see no reason to run it faster than normal if indexing is not performed, why indeed?

b) indexer is running and adds new URLs for crawler to do - in this case crawler would have to work its proverbial arse off (like what seems to happen in our case) in order to cope with demands for data from the indexer, and with all that computing power they have the bottleneck is bound to be in "bandwidth" area, ie crawler, so this scenario in my view is the most likely at the moment

I therefore conclude, it is far more likely that the index had to be rebuilt quickly so they pulled off all stops and started crawling like mad with indexer running on full power too - contrary to your original suggestion and inline with what Brett is suggesting.

Perhaps someone with (or without) better knowledge of search engine design would like to correct me ;)

Hanu

3:21 pm on Sep 24, 2004 (gmt 0)

>On the other hand the crawler depends on additional URLs that indexer gets by analysing crawled data.

If that's true, there is not much left to do for the crawler, is there? No, the crawler filters new URL's out on its own. Also it is fed existing URLs from the database and not the indexer.

>and with all that computing power they have the bottleneck is bound to be in "bandwidth" area

Word-based indexers are computationally expensive. That's what the G needs the computing power for. You think they buy all this machines and then force them to idle by putting stops into the code? No way! The indexer is the bottleneck, not the crawlers.

Lord Majestic

3:26 pm on Sep 24, 2004 (gmt 0)

No, the crawler filters new URL's out on its own. Also it is fed existing URLs from the database and not the indexer.

Only amateur crawlers extract URLs from crawled page to continue crawling these - you need to ensure that you don't crawl same URLs all over again, otherwise crawlers that run in parallel on many machines might totally kill website because they will happen to crawl same pages. This is the job of indexer to analyse pages and product nice work units with clean URLs for crawlers to get so that you do not waste bandwidth by getting same stuff over and over again (and pissing off webmasters).

Word-based indexers are computationally expensive. That's what the G needs the computing power for. You think they buy all this machines and then force them to idle by putting stops into the code?

They do have lots of machines, primarily not for CPUs but for having lots of RAM to keep all index in memory. They therefore have a LOT of CPU power, 100k machines some estimate, if you spread all 4 bln URLs onto these, then there will be almost nothing to do for these boxes - 40k pages each, its nothing and good indexer does these in matter of minutes.

The indexer is the bottleneck, not the crawlers.

I stand by that the crawler is the main bottleneck - not least because websites have natural limit to how fast they serve pages - you might have 100Mbit connection, but if site give you data at rate of 10kbit then you will have to wait till you get data - hence bottleneck.

Everything from my point of view supports the original answer by Brett that Google is for some reason rebuilding index ASAP.

Hanu

4:16 pm on Sep 24, 2004 (gmt 0)

>Only amateur crawlers extract URLs from crawled page to continue crawling these

That's not what I said. I said that the crawlers extract the URLs themselfes. Obviously, there needs to be a central queue of URLs that all crawlers feed into and from such that two crawlers do not crawl the same page at a time.

>They therefore have a LOT of CPU power, 100k machines some estimate

That's not what I heard. It's rather 10k machines.

>lots of RAM to keep all index in memory

Pure speculation.

>but if site give you data at rate of 10kbit then you will have to wait till you get data - hence bottleneck.

Hence no bottleneck because one crawler crawles multiple URLs in parallel or one crawler machine runs multiple crawler processes in parallel. When crawling, my FTP search engine has 100 connections to different sites open at all times on a 2Mbit line and I can't index data as fast as it comes in.

Lord Majestic

4:23 pm on Sep 24, 2004 (gmt 0)

That's not what I said. I said that the crawlers extract the URLs themselfes.

This is not how Google works, here is diagram of their original design which I doubt changed that much as it still make sense now:
[www-db.stanford.edu...]

Notice that Crawler is fed by the URL server, and crawled pages have to go through Indexer before new URLs are passed to the crawler. If you have more up to date scheme on how Google works, then please share it (as long as its legal of course).

That's not what I heard. It's rather 10k machines.

Thats outdated info - even if it was 10k, then 4 bln URLs only requires processing of 400k URLs on each of the boxes - this is nothing in terms of CPU requirements.

>lots of RAM to keep all index in memory
Pure speculation.

Its not speculation its fact - their original (past Stanford Backrub) design feature was to keep all index in memory, which is why they gone for 2Gbs of RAM in each of the boxes. I suggest you to check references on the matter and if you find something that says otherwise then please share it with me.

I can't index data as fast as it comes in.

Well the reasons why you can't index data as fast as you crawl are beyond scope of this conversation, I'd speculate that your indexers are just too slow, hence for you bottleneck is not bandwidth, unlike Google's situation ;)

boredguru

5:22 pm on Sep 24, 2004 (gmt 0)

Hi all
has the googlebot IP too changed.
I am getting two new IPs crawling me. A query just says its proxy.google.com and the other has no DNS but doing a whois i get Googlebot.
The ips are
216.239.37.5
216.239.39.5
216.239.51.5
and more starting with 216.239.

And
66.249.65.12
66.249.65.51
66.249.65.6

Before it used to crawl me with the ip 64.68. Why has the ip changed. Any ideas? Anyone?

BillyS

5:29 pm on Sep 24, 2004 (gmt 0)

I also saw hard crawling early Thursday - hundreds of pages over 30 minutes (2 sessions of 15 minutes). Very unusual pattern for my site and googlebot.

johnnyb

5:33 pm on Sep 24, 2004 (gmt 0)

My site which has nearly 200,000 pages , only 4109 pages got crawled yesterday...Is it partial crawling, do you think googlebot will return to crawl rest of the pages. My site has 6PR. Does anybody have similar experiences, that is, partial crawling?

webdude

5:41 pm on Sep 24, 2004 (gmt 0)

I have been getting nothing but partial crawls for the past month. Might be some other problems I have been having though.

As for IPs of googlebot, I too noticed the change. I have benngetting hit by 66.249.65.x

boredguru

5:58 pm on Sep 24, 2004 (gmt 0)

Webdude
Do you have any reason why the ips have changed.

webdude

6:06 pm on Sep 24, 2004 (gmt 0)

Not that I know of. It's googlebot that has changed I am sure. Nothing I did.

Oh No! Maybe a crawl from this IP is an indication I am going to be dropped!..... Just Kidding :-)

Phil_AM

6:17 pm on Sep 24, 2004 (gmt 0)

OK, so then for my elementary understanding of ths situation, what impact does that have on SERPS?

johnnyb

6:22 pm on Sep 24, 2004 (gmt 0)

IP's crawled on my CLEAN WEBSITE with no links to affiliate programs, no external links:
66.249.65.8
66.249.64.146
66.249.65.73
66.249.65.51

IP's crawled on product website:
66.249.64.195
66.249.64.47
66.249.65.73

Same block of IP's on both categories.
So, I believe it has nothing to do with removing our pages. Maybe google is trying out different things.

idoc

6:26 pm on Sep 24, 2004 (gmt 0)

I expect alot of cloaking and redirect sites will be dropped soon from these new bot ip's and this crawl. It's what I had in mind in the post about hijacks when I said I think Google is on it. They have been asking for file paths and filenames with extensions I have never used before. I am hopeful anyway.

webdude

6:37 pm on Sep 24, 2004 (gmt 0)

I expect alot of cloaking and redirect sites will be dropped soon from these new bot ip's and this crawl. It's what I had in mind in the post about hijacks when I said I think Google is on it. They have been asking for file paths and filenames with extensions I have never used before. I am hopeful anyway.

Man Oh Man I Hope So. As you know I have been just a little invloved on that thread.

iblaine

6:43 pm on Sep 24, 2004 (gmt 0)

I saw some heavy spidering earlier this week on new pages. I wouldn't go so far as to say they are rebuilding an old index if they're going after new content. The gbot has significantly increased its activity the past few days. A new site I released a month ago was hit 50k times within the past few days. I wish I knew why.

gomer

6:49 pm on Sep 24, 2004 (gmt 0)

johnnyb, when you posted the two series of IP addresses for Googlebot, 66.249.64.X and 66.249.65.X, I looked through my logs and this is what I found.

The 66.249.64.X series was requesting pages that were fully indexed i.e., they have a page title and description.

The 66.249.65.X series was requesting pages that were only partially indexed, i.e., the did not have a title or description when listed by Google using site:

In my case, the 66.249.65.X were pages that exist on my server but I am trying to get Googlebot to stop indexing. I have no links pointing to those pages but Google knows about them and keeps on requesting them.

Not sure what all this means but that is what I am seeing regarding those two IP series.

webdude

6:57 pm on Sep 24, 2004 (gmt 0)

I see google as having acquired the following blocks...

66.249.64.x - 66.249.79.x

Shows a registration date of 03/05/04

That's quite a hunk of IPs.

claus

7:04 pm on Sep 24, 2004 (gmt 0)

uhm... a new datacenter perhaps? pure speculation of course...

boredguru

7:16 pm on Sep 24, 2004 (gmt 0)

And my robot detector had to be set up to recognise these bots. I am almost preparing for its one in 13 days (i dont know why G choose that no for me) full scan.
Wonder if it will be affected as its already the 14th day.

boredguru

7:38 pm on Sep 24, 2004 (gmt 0)

Just took time to analyse the stats.
what i found was that once all the pages were crawled by 66.249.65 bot, then a few pages were immediately crawled by the 216.239.37 and the few pages that were crawled by this bot started appearing in the SERPs within hours (six - ten)
Whats going on.

[edited by: boredguru at 7:43 pm (utc) on Sep. 24, 2004]

This 176 message thread spans 6 pages: 176