Forum Moderators: open

Message Too Old, No Replies

Parallel crawling?

googlebot patterns

         

Spica

1:45 pm on Feb 11, 2003 (gmt 0)

10+ Year Member



Hi. This is my first post. I have been reading on this forum for a few months, and want to thank you all for what I have learned here, and for the many (often very entertaining) discussion threads.

Sorry if this has been discussed before, but I could not find detailed information on this topic. My question relates to the method used by googlebot and other spiders to grab various pages on a site. How do they decide which page to grab first? This month, I have been looking at my raw log to try and understand whether there is any pattern in the order in which pages are picked up, but it seems mostly random. I have a small PR5 site with only about 200 pages. However, I find the latest entries interesting. These look like the following:

crawler12.googlebot.com - - [11/Feb/2003:01:17:40 -0500] "GET /page4.htm
crawler10.googlebot.com - - [11/Feb/2003:01:17:40 -0500] "GET /page4.htm

crawler11.googlebot.com - - [11/Feb/2003:05:09:44 -0500] "GET /subdir1/page13.htm
crawler10.googlebot.com - - [11/Feb/2003:05:09:46 -0500] "GET /subdir1/page13.htm

crawler12.googlebot.com - - [11/Feb/2003:06:27:09 -0500] "GET /subdir1/page16.htm
crawler11.googlebot.com - - [11/Feb/2003:06:27:29 -0500] "GET /subdir1/page16.htm

Parallel crawling? Deliberate clean up of missing/incompletely indexed pages?

yetanotheruser

2:43 pm on Feb 11, 2003 (gmt 0)

10+ Year Member



Welcome Spica,.. I'm new too :)

Very interesting behavior.. but I've personally never seen this.. I would hazard a guess that it's a glitch but I wouldn't swear to it.

Have been watching GB intently over the last couple of days as it goes through one of our sites. Thing is though from a quick look it has eaten 27k pages over the past few days, and only 28.2k since the log started.

Now, I know it took 1.2k pages before the last index, which would show that it has not looked at any pages more than once this time through.. (.. infact it has but just /robots.txt and /index.html a couple of times extra)

Wonder whether anyone else has seen similar behaviour to you?

ATB, :)

bull

2:49 pm on Feb 11, 2003 (gmt 0)

10+ Year Member



even more interesting:

crawler12.googlebot.com - - [11/Feb/2003:08:14:40 +0100] "GET /robots.txt HTTP/1.0" 200 64 www.me.net "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" "-"
crawler12.googlebot.com - - [11/Feb/2003:08:14:43 +0100] "GET /dir/page1.html HTTP/1.0" 200 2962 www.me.net "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" "-"
crawler12.googlebot.com - - [11/Feb/2003:08:15:06 +0100] "GET /dir/page1.html HTTP/1.0" 200 2962 www.me.net "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" "-"

Normally a 304 should be returned the second time. Maybe it was 2 different machines with same rDNS.

jan

yetanotheruser

2:54 pm on Feb 11, 2003 (gmt 0)

10+ Year Member



I don't think I've ever seen GB get a 304? Maybe GB is running under an HTTP version that doesn't get 304's? Does HTTP1/0?

Brett_Tabke

2:59 pm on Feb 11, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I've seen that a couple of times Spica. The only think I could think of was that two different spiders got the same job list.

bull

6:51 pm on Feb 11, 2003 (gmt 0)

10+ Year Member



I don't think I've ever seen GB get a 304? Maybe GB is running under an HTTP version that doesn't get 304's? Does HTTP1/0?

216.239.46.88 - - [06/Feb/2003:16:14:05 +0100] "GET /a/b.html HTTP/1.0" 304 ....
- the last time I saw him getting a 304. Seems HTTP/1.0 has 304.

Also:
[webmasterworld.com...]

regards,
jan