Forum Moderators: open
Sorry if this has been discussed before, but I could not find detailed information on this topic. My question relates to the method used by googlebot and other spiders to grab various pages on a site. How do they decide which page to grab first? This month, I have been looking at my raw log to try and understand whether there is any pattern in the order in which pages are picked up, but it seems mostly random. I have a small PR5 site with only about 200 pages. However, I find the latest entries interesting. These look like the following:
crawler12.googlebot.com - - [11/Feb/2003:01:17:40 -0500] "GET /page4.htm
crawler10.googlebot.com - - [11/Feb/2003:01:17:40 -0500] "GET /page4.htm
crawler11.googlebot.com - - [11/Feb/2003:05:09:44 -0500] "GET /subdir1/page13.htm
crawler10.googlebot.com - - [11/Feb/2003:05:09:46 -0500] "GET /subdir1/page13.htm
crawler12.googlebot.com - - [11/Feb/2003:06:27:09 -0500] "GET /subdir1/page16.htm
crawler11.googlebot.com - - [11/Feb/2003:06:27:29 -0500] "GET /subdir1/page16.htm
Parallel crawling? Deliberate clean up of missing/incompletely indexed pages?
Very interesting behavior.. but I've personally never seen this.. I would hazard a guess that it's a glitch but I wouldn't swear to it.
Have been watching GB intently over the last couple of days as it goes through one of our sites. Thing is though from a quick look it has eaten 27k pages over the past few days, and only 28.2k since the log started.
Now, I know it took 1.2k pages before the last index, which would show that it has not looked at any pages more than once this time through.. (.. infact it has but just /robots.txt and /index.html a couple of times extra)
Wonder whether anyone else has seen similar behaviour to you?
ATB, :)
crawler12.googlebot.com - - [11/Feb/2003:08:14:40 +0100] "GET /robots.txt HTTP/1.0" 200 64 www.me.net "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" "-"
crawler12.googlebot.com - - [11/Feb/2003:08:14:43 +0100] "GET /dir/page1.html HTTP/1.0" 200 2962 www.me.net "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" "-"
crawler12.googlebot.com - - [11/Feb/2003:08:15:06 +0100] "GET /dir/page1.html HTTP/1.0" 200 2962 www.me.net "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" "-"
Normally a 304 should be returned the second time. Maybe it was 2 different machines with same rDNS.
jan
I don't think I've ever seen GB get a 304? Maybe GB is running under an HTTP version that doesn't get 304's? Does HTTP1/0?
216.239.46.88 - - [06/Feb/2003:16:14:05 +0100] "GET /a/b.html HTTP/1.0" 304 ....
- the last time I saw him getting a 304. Seems HTTP/1.0 has 304.
Also:
[webmasterworld.com...]
regards,
jan