Stinky crawler proxy

I'll offer the following data and code without further comment, "example.com" is my domain.

66.249.16.*** - - [17/Dec/2009:19:43:03 -0600] "GET / HTTP/1.1" 403 666 "http://whois.do*aint**ls.com/example.com" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

*All* additional common HTTP request headers:
Connection="keep-alive"
Accept="text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" Accept-Encoding="gzip,deflate" Accept-Language="en,en-us;q=0.5"
Accept-Charset="UTF-8,*"
Via="1.1 www.do*aint**ls.com"
X-Forwarded-For="66.249.67.72"

Note that the requesting address *is not* Google. The X-Forwarded-For IP address *is* the legitimate crawl-66-249-67-72.googlebot.com

Various aspects of the request headers are 'wrong' for Googlebot, as a result of its requests passing through a proxy.

So Googlebot's been given a proxy-through-put to my site (example.com) from that site for purposes unknown.

It's now met with:


# Redirect Googlebot crawling through xyz corp's proxy
RewriteCond %{REMOTE_ADDR} ^66\.249\.([012][0-9]¦3[012])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/[0-9.]+;\ \+http://www\.google\.com/bot\.html\)$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
#
RewriteCond %{REMOTE_ADDR} ^66\.249\.([012][0-9]¦3[012])\.
RewriteRule ^ - [E=Why_Deny:crawler-proxy-provider,F]

Replace the broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters.

There are admittedly more 'complete' solutions to this problem, but some would allow the solution to be easily by-passed. Any enhancements should take into account "who is providing the header you're testing?" and err on the side of caution.

This code assumes that your existing code will fully-authenticate the Googlebot request subsequent to the 301 redirect.

Use this example code at you own risk. It's been tested for all of 20 minutes, and is posted only as a suggestion to help avoid duplicate-content problems potentially created by a third party.

Jim

Stinky crawler proxy

Googlebot fed through a proxy

jdMorgan

keyplyr

jdMorgan

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week