Welcome to WebmasterWorld Guest from 54.211.135.32

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot hitting same page hundreds of times a day

     
2:49 pm on Apr 15, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 24, 2003
posts: 741
votes: 74


Googlebot (from IP 66.102.8.139) keeps hitting the same page on my site many hundreds of times a day, the requests are sometimes only a seconds apart. This has been going on for two weeks now. The page content is not something I would think is wildly popular, but Googlebot seems to have it's needle stuck. the URL is redirected and producing a 200 OK. Should I block this ardent admirer of my page?
10:24 pm on Apr 15, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4343
votes: 292


Google owns that IP, it is part of GoogleCloud, but unless they started today, they do not crawl from that range, I'd block abusive traffic, check your logs and see whether it ever requests robots.txt.

Since Google is one of the compliant robots that is happy to share the ranges they crawl from, and that is not on the list, I would guess someone is impersonating the googlebot UA. From the kind of traffic you describe, it is not doing this for your benefit imho.
11:03 pm on Apr 15, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:3605
votes: 344


66.102.8.139

That's a google-owned proxy. Rather than blocking the IP, it might be better if you can block the UA.
11:29 pm on Apr 15, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


it might be better if you can block the UA.
Well I wouldn't advise blocking the UA :)

As not2easy & aristotle stated above, you are being scraped by a poser using one of Google's many proxy ranges. Anybody can use these ranges.

The real Googlebot will only come from the designated crawl range:
crawl-66-249-66-1.googlebot.com
66.249.64.0 - 66.249.95.255
66.249.64.0/19

source: [webmasters.googleblog.com...]

However there are a couple other valid Google UAs that include the attribute "Googlebot" like Googlebot-Image & news, video, etc.
They will come from these other Google ranges:
74.125.0.0 - 74.125.255.255
74.125.0.0/16
173.194.0.0 - 173.194.255.255
173.194.0.0/16

source: [support.google.com...]

This htaccess code will block any fake Googlebot UA while still allowing the other valid Googlebot UAs:

RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REMOTE_ADDR} !^66\.249\.[6-9][0-9]\.
RewriteCond %{REMOTE_ADDR} !^74\.125\.
RewriteCond %{REMOTE_ADDR} !^173\.194\.
RewriteRule - [F]
1:43 am on Apr 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15698
votes: 810


RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REMOTE_ADDR} !^66\.249\.[6-9][0-9]\.
The Googlebot really shouldn't be crawling from the .80-.95 half of that range, just the .64-.79 half. The upper half is used for assorted other Googloid functions--including things like Translate that people have varying opinions on--but not the crawler as such.
1:48 am on Apr 16, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 2612
votes: 763


There are Google services such Google weblight and Google translate, and even page speed insights that use these proxies.

Does the log explicitly say Googlebot, or are you assuming that it is Googlebot because the IP is owned by Google?
1:59 am on Apr 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


The upper half is used for assorted other Googloid functions--including things like Translate that people have varying opinions on--but not the crawler as such.
Googlebot uses that range. Yes it historically crawls from the lower part, but the entire range is the official crawl range (see the source link.)

As far as allowing the proxies, I personally don't block any of them. I control who uses them with other filters.
2:06 pm on Apr 16, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 24, 2003
posts: 741
votes: 74


@NickMNS - No, the log says "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)". So it appears to be related to Google Images. I see that the same image is being hit from 66.102.8.205 and 66.102.9.45 over and over as well. I have that image 301 redirected to a page now. Perhaps redirecting the image to the page is causing google image bot some indigestion?
2:42 pm on Apr 16, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 2612
votes: 763


Did you check GSC search analytics report, filtering search type = image search, to see whether there was a spike in image search impressions?
2:51 pm on Apr 16, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 24, 2003
posts: 741
votes: 74


@NickMNS No such spike, and that image is not in any of the search terms either. Also not showing in page impressions, remember the image is being redirected to the page now.
4:22 pm on Apr 16, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:3605
votes: 344


"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)"


Just at a glance that UA looks like it would be easy to block.
9:06 pm on Apr 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


aristotle - why is your first reaction always to block?

ichthyous - so it is *not* Googlebot as you reported. OK. A little reading at the source links I posted above will tell you what GoogleImageProxy is.

This is the bot that retrieves images for many of Google's resources including Google Places, Google Plus, Google Search, etc.

IMO you do not want to block by UA. This is a "good" bot. It enhances your web presence by including an image when someone posts a link to your site or in mobile search when someone searchs for a nearby resource.
9:33 pm on Apr 16, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:3605
votes: 344


aristotle - why is your first reaction always to block?

I would bet that you do a lot more blocking than I do.

At any rate, I haven't seen an explanation for why the same image is being fetched hundreds of times per day. So there's still room for doubt as to what's happening in this case.
10:58 pm on Apr 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15698
votes: 810


Googlebot uses that range.

I got curious and checked (global search running in the background while busy with other stuff). In June-July last year, the DoCoMo user-agent (one of the Googlebot-Mobile family) did considerable crawling from 66.249.91-92. Other than that, nothing.

Huh.

aristotle, in the specific case of an image you may find it more useful to rewrite to a single-pixel gif instead of blocking. For most setups it comes out to less server overhead, if that's your main objective.
3:14 pm on Apr 17, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 24, 2003
posts: 741
votes: 74


@keyplyr The link you posted didn't specifically mention this bot, but I did some research and it seems that the various subdomains at ggpht.com are image storage for the old Picassa and other user generated content, so not sure what is going on there or why this bot is so fixated on that one image...
4:11 pm on Apr 27, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 24, 2003
posts: 741
votes: 74


I ended up blocking the entire range (see below). I am loathe to block Google, but this bot hit the same page over 22,000 times this month so far until I blocked it. I have no idea what this is that is hitting my site from Google so many times but I can't have my resources sucked up.

Order Allow,Deny
Deny from 66.102.8.137
Deny from 66.102.8.139
Deny from 66.102.8.141
Deny from 66.102.8.201
Deny from 66.102.8.203
Deny from 66.102.8.205
Allow from all