Forum Moderators: open

Message Too Old, No Replies

Stinky crawler proxy

Googlebot fed through a proxy

         

jdMorgan

5:17 am on Dec 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'll offer the following data and code without further comment, "example.com" is my domain.

66.249.16.*** - - [17/Dec/2009:19:43:03 -0600] "GET / HTTP/1.1" 403 666 "http://whois.do*aint**ls.com/example.com" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

*All* additional common HTTP request headers:
Connection="keep-alive"
Accept="text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" Accept-Encoding="gzip,deflate" Accept-Language="en,en-us;q=0.5"
Accept-Charset="UTF-8,*"
Via="1.1 www.do*aint**ls.com"
X-Forwarded-For="66.249.67.72"

Note that the requesting address *is not* Google. The X-Forwarded-For IP address *is* the legitimate crawl-66-249-67-72.googlebot.com

Various aspects of the request headers are 'wrong' for Googlebot, as a result of its requests passing through a proxy.

So Googlebot's been given a proxy-through-put to my site (example.com) from that site for purposes unknown.

It's now met with:


# Redirect Googlebot crawling through xyz corp's proxy
RewriteCond %{REMOTE_ADDR} ^66\.249\.([012][0-9]¦3[012])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/[0-9.]+;\ \+http://www\.google\.com/bot\.html\)$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
#
RewriteCond %{REMOTE_ADDR} ^66\.249\.([012][0-9]¦3[012])\.
RewriteRule ^ - [E=Why_Deny:crawler-proxy-provider,F]

Replace the broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters.

There are admittedly more 'complete' solutions to this problem, but some would allow the solution to be easily by-passed. Any enhancements should take into account "who is providing the header you're testing?" and err on the side of caution.

This code assumes that your existing code will fully-authenticate the Googlebot request subsequent to the 301 redirect.

Use this example code at you own risk. It's been tested for all of 20 minutes, and is posted only as a suggestion to help avoid duplicate-content problems potentially created by a third party.

Jim

keyplyr

7:02 am on Dec 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Jim, I've been blocked the offending company this way:

deny from 66.249.16.0/23

Is this evidence that the mod_access method, at least in a proxy situation as described above, could be damaging?

Déjà vu ( [webmasterworld.com...] )

jdMorgan

4:57 pm on Dec 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The code above tells Google, "That 'foreign' URL you're using to crawl my site is wrong -- Use this canonical URL."

In a way, it 'steals back' the attempt to serve our content with their URL.

I think you'll be OK with a simple 403, but frankly, this code is a 'slap back' at them for allowing proxy-crawling.

Jim