Welcome to WebmasterWorld Guest from 54.162.248.199

Message Too Old, No Replies

Googlebot spidering same page up to 5 times in succession

     
9:47 am on Feb 24, 2012 (gmt 0)

10+ Year Member



Over the last few months Googlebot has found it necessary to spider an increasing number of pages on my sites more than once in succession.

When it first began doing this it would just happen for a few hours in the odd day, taking pages twice in succession. It has now reached the point where this takes place all the time, and yesterday I was dismayed to see it visit the same page 4 times in succession, and today 5!

It uses the same if-modified-since date for each request to a page. I only change this date when the page content is updated. Some pages even have 301 redirects and Googlebot will still visit these 3 times in succession.

I did raise this in Google's help forums and also found another guy experiencing the same problem, but found no answers.

Has anyone else experienced this and do you have any idea what purpose it may serve?
4:45 pm on Feb 24, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Are the accesses all from the same IP or from different IPs?

Do they all use the same UA or are they different?
5:10 pm on Feb 24, 2012 (gmt 0)

10+ Year Member



Usually a couple of IP addresses and the majority are Googlebot 2.1 although I have seen the iPhone UA spider in this manor occassionally.
5:29 pm on Feb 24, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



I've seen that on pages that return 500 server error. You may not see the error when you browse your site normally. But the bot often supplies its own weird combination of query string parameters that only Google knows where it came from and was not expected by the developers of the site. Sometimes it gets to be unnervingly close to penetration testing even though all IPs are legit Google's (unless they are spoofed of course).
What are your server's response codes, anyway? Have you tried to play with Expires header and set it to some (reasonable) time in the future? Or return 304 Not Modified if it did not in fact get modified in the seconds between the bot visits?
7:12 pm on Feb 24, 2012 (gmt 0)

10+ Year Member



These pages don't change very often so the bot sees not modified, unless I've updated the page's database record.

They are set to expire in 30 days.

I've seen Googlebot-Image add querystring parameters, but never Googlebot.

You may have something with the 500 error though. I see a small number of pages in WMT indicating this and also connection reset. But I can't reproduce any error when I check these pages!

Is there any way I can test for this? Perhaps the server logs may reveal something.
7:14 pm on Feb 24, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Does Xenu LinkSleuth find anything?
12:29 pm on Feb 25, 2012 (gmt 0)

10+ Year Member



I haven't used LinkSleuth before, so not sure what I'm looking for in the results, but with the exception of some unused missing icons linked to in the style sheet, all the html pages loaded ok.
1:07 am on Feb 26, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



If it's an index page, make sure all your links are in the same form. No matter how often they get redirected from
blahblah/index.html
to
blahblah/
search engines will keep checking both-- especially if you reinforce it by having internal links that use both. You may or may not find them in the Duplicate Titles section of wmt.

There's a whole army of googlebots and they don't seem to talk to each other much. You have to picture them having a meeting every few days to get up to speed on robots.txt and the like. But they don't communicate on the fly.
11:17 am on Feb 26, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



You may have something with the 500 error though.

This is strange. Where are you seeing the googlebot accesses? Isn't it from the raw server logs on your site? Because they will show the server response and the real IPs.
11:40 am on Feb 26, 2012 (gmt 0)

10+ Year Member



What I'm seeing is a page being spidered usually three times but up to five, normally in succession, with a delay of typically 3 minutes.

This pattern will repeat for probably 3 urls, then it might spider between 1 and 10 pages just once, then go back to the triple spidering.

If this was an issue with the server, I would expect the triple spidering to continue for longer. The 2 sites in question are on a dedicated server that gets a pitiful level of traffic.

The site is responsive and pages load in under .25 of a second.
11:48 am on Feb 26, 2012 (gmt 0)

10+ Year Member



@enigma1 I have my own stats script which is where I'm seeing this behaviour.

I'll contact my hosting support to get access to the server logs if it contains the server response.

I'm only seeing a handful of 500 errors in WMT, significantly less than the number of pages that are triple spidered.
12:17 pm on Feb 26, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Sorry acee, but just to make double sure: These are from recognized g### ranges, right? Not, say, for example, somewhere in the Ukraine? (It was the "three or five" that made me want to double-check.)
5:15 pm on Feb 26, 2012 (gmt 0)

10+ Year Member



66.249.72.241/66.249.66.5 etc
5:58 pm on Feb 26, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The google webmaster tool won't list all errors. In some cases it won't show any errors at all. If you have access to your server logs from the host locate the error responses like those 5x codes and go on from there - isolate the errors and fix them. Code 503 means service unavailable for instance so google will retry to access the page.

I am not sure how you ended up with these error server responses though. If it was like this from the beginning you wouldn't have any pages indexed.
11:41 am on Sep 6, 2012 (gmt 0)

10+ Year Member



This continues to be a problem and Google's WMT sends me a regular reminder!

Now that WMT actually lists the nature of the errors, I can see that it's a connection reset issue, although when I fetch the pages it has a problem with using fetch as Googlebot, it mainly returns the pages without error, and only occasionally returns 'unreachable page'.

What's likely to cause an intermittent connection reset?
 

Featured Threads

Hot Threads This Week

Hot Threads This Month