homepage Welcome to WebmasterWorld Guest from 54.226.80.55
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Googlebot spidering same page up to 5 times in succession
acee




msg:4421270
 9:47 am on Feb 24, 2012 (gmt 0)

Over the last few months Googlebot has found it necessary to spider an increasing number of pages on my sites more than once in succession.

When it first began doing this it would just happen for a few hours in the odd day, taking pages twice in succession. It has now reached the point where this takes place all the time, and yesterday I was dismayed to see it visit the same page 4 times in succession, and today 5!

It uses the same if-modified-since date for each request to a page. I only change this date when the page content is updated. Some pages even have 301 redirects and Googlebot will still visit these 3 times in succession.

I did raise this in Google's help forums and also found another guy experiencing the same problem, but found no answers.

Has anyone else experienced this and do you have any idea what purpose it may serve?

 

g1smd




msg:4421394
 4:45 pm on Feb 24, 2012 (gmt 0)

Are the accesses all from the same IP or from different IPs?

Do they all use the same UA or are they different?

acee




msg:4421410
 5:10 pm on Feb 24, 2012 (gmt 0)

Usually a couple of IP addresses and the majority are Googlebot 2.1 although I have seen the iPhone UA spider in this manor occassionally.

1script




msg:4421414
 5:29 pm on Feb 24, 2012 (gmt 0)

I've seen that on pages that return 500 server error. You may not see the error when you browse your site normally. But the bot often supplies its own weird combination of query string parameters that only Google knows where it came from and was not expected by the developers of the site. Sometimes it gets to be unnervingly close to penetration testing even though all IPs are legit Google's (unless they are spoofed of course).
What are your server's response codes, anyway? Have you tried to play with Expires header and set it to some (reasonable) time in the future? Or return 304 Not Modified if it did not in fact get modified in the seconds between the bot visits?

acee




msg:4421455
 7:12 pm on Feb 24, 2012 (gmt 0)

These pages don't change very often so the bot sees not modified, unless I've updated the page's database record.

They are set to expire in 30 days.

I've seen Googlebot-Image add querystring parameters, but never Googlebot.

You may have something with the 500 error though. I see a small number of pages in WMT indicating this and also connection reset. But I can't reproduce any error when I check these pages!

Is there any way I can test for this? Perhaps the server logs may reveal something.

g1smd




msg:4421456
 7:14 pm on Feb 24, 2012 (gmt 0)

Does Xenu LinkSleuth find anything?

acee




msg:4421672
 12:29 pm on Feb 25, 2012 (gmt 0)

I haven't used LinkSleuth before, so not sure what I'm looking for in the results, but with the exception of some unused missing icons linked to in the style sheet, all the html pages loaded ok.

lucy24




msg:4421827
 1:07 am on Feb 26, 2012 (gmt 0)

If it's an index page, make sure all your links are in the same form. No matter how often they get redirected from
blahblah/index.html
to
blahblah/
search engines will keep checking both-- especially if you reinforce it by having internal links that use both. You may or may not find them in the Duplicate Titles section of wmt.

There's a whole army of googlebots and they don't seem to talk to each other much. You have to picture them having a meeting every few days to get up to speed on robots.txt and the like. But they don't communicate on the fly.

enigma1




msg:4421912
 11:17 am on Feb 26, 2012 (gmt 0)

You may have something with the 500 error though.

This is strange. Where are you seeing the googlebot accesses? Isn't it from the raw server logs on your site? Because they will show the server response and the real IPs.

acee




msg:4421914
 11:40 am on Feb 26, 2012 (gmt 0)

What I'm seeing is a page being spidered usually three times but up to five, normally in succession, with a delay of typically 3 minutes.

This pattern will repeat for probably 3 urls, then it might spider between 1 and 10 pages just once, then go back to the triple spidering.

If this was an issue with the server, I would expect the triple spidering to continue for longer. The 2 sites in question are on a dedicated server that gets a pitiful level of traffic.

The site is responsive and pages load in under .25 of a second.

acee




msg:4421916
 11:48 am on Feb 26, 2012 (gmt 0)

@enigma1 I have my own stats script which is where I'm seeing this behaviour.

I'll contact my hosting support to get access to the server logs if it contains the server response.

I'm only seeing a handful of 500 errors in WMT, significantly less than the number of pages that are triple spidered.

lucy24




msg:4421920
 12:17 pm on Feb 26, 2012 (gmt 0)

Sorry acee, but just to make double sure: These are from recognized g### ranges, right? Not, say, for example, somewhere in the Ukraine? (It was the "three or five" that made me want to double-check.)

acee




msg:4421961
 5:15 pm on Feb 26, 2012 (gmt 0)

66.249.72.241/66.249.66.5 etc

enigma1




msg:4421975
 5:58 pm on Feb 26, 2012 (gmt 0)

The google webmaster tool won't list all errors. In some cases it won't show any errors at all. If you have access to your server logs from the host locate the error responses like those 5x codes and go on from there - isolate the errors and fix them. Code 503 means service unavailable for instance so google will retry to access the page.

I am not sure how you ended up with these error server responses though. If it was like this from the beginning you wouldn't have any pages indexed.

acee




msg:4491869
 11:41 am on Sep 6, 2012 (gmt 0)

This continues to be a problem and Google's WMT sends me a regular reminder!

Now that WMT actually lists the nature of the errors, I can see that it's a connection reset issue, although when I fetch the pages it has a problem with using fetch as Googlebot, it mainly returns the pages without error, and only occasionally returns 'unreachable page'.

What's likely to cause an intermittent connection reset?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved