Forum Moderators: open
66.249.16.*** - - [17/Dec/2009:19:43:03 -0600] "GET / HTTP/1.1" 403 666 "http://whois.do*aint**ls.com/example.com" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
*All* additional common HTTP request headers:
Connection="keep-alive"
Accept="text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" Accept-Encoding="gzip,deflate" Accept-Language="en,en-us;q=0.5"
Accept-Charset="UTF-8,*"
Via="1.1 www.do*aint**ls.com"
X-Forwarded-For="66.249.67.72"
Note that the requesting address *is not* Google. The X-Forwarded-For IP address *is* the legitimate crawl-66-249-67-72.googlebot.com
Various aspects of the request headers are 'wrong' for Googlebot, as a result of its requests passing through a proxy.
So Googlebot's been given a proxy-through-put to my site (example.com) from that site for purposes unknown.
It's now met with:
# Redirect Googlebot crawling through xyz corp's proxy
RewriteCond %{REMOTE_ADDR} ^66\.249\.([012][0-9]¦3[012])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/[0-9.]+;\ \+http://www\.google\.com/bot\.html\)$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
#
RewriteCond %{REMOTE_ADDR} ^66\.249\.([012][0-9]¦3[012])\.
RewriteRule ^ - [E=Why_Deny:crawler-proxy-provider,F]
There are admittedly more 'complete' solutions to this problem, but some would allow the solution to be easily by-passed. Any enhancements should take into account "who is providing the header you're testing?" and err on the side of caution.
This code assumes that your existing code will fully-authenticate the Googlebot request subsequent to the 301 redirect.
Use this example code at you own risk. It's been tested for all of 20 minutes, and is posted only as a suggestion to help avoid duplicate-content problems potentially created by a third party.
Jim
deny from 66.249.16.0/23
Is this evidence that the mod_access method, at least in a proxy situation as described above, could be damaging?
Déjà vu ( [webmasterworld.com...] )
In a way, it 'steals back' the attempt to serve our content with their URL.
I think you'll be OK with a simple 403, but frankly, this code is a 'slap back' at them for allowing proxy-crawling.
Jim