Forum Moderators: Robert Charlton & goodroi
The solution is simple, and effective for Googlebot, and also most likely for Yahoo's Slurp and MSNbot. It only relies on G, Y, or M having properly set up DNS entries for the crawling IP's. It's a two step process and involves doing a reverse dns lookup, then a forward DNS lookup.
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
This is clean, simple, and brilliant. I have implemented the reverse IP lookup, but never followed up with the forward - which is key. By doing both you avoid someone filling out an erroneous reverse DNS entry, which is very simple to do.
Finally, my quality content is one step closer to staying mine.
< The full technique is outlined here:
[googlewebmastercentral.blogspot.com...] >
[edited by: tedster at 7:25 pm (utc) on July 5, 2007]
First question: Ok, you can detect and block scrapers who pretend to be Googlebot. What do you do with scrapers that don't? OIOW: Why would you care whether a scraper pretended to be Googlebot unless you were cloaking?
Detecting scrapers that don't pretend to be Google is a whole different problem and one that takes a LOT of different approaches to eliminate.
Brett and I posted a bunch of methods here:
[webmasterworld.com...]
Also, come to PubCon Vegas as there's a session on this topic in November where you can learn a lot more.
Second question: Now what exactly is a cloaked proxy hijacker?
It's a proxy site, that cloaks a list of rot13 encoded urls to Google, Tahoo and MSN to get them to crawl YOUR SITE but do it via the proxy server so the URL Google uses is pointing to the proxy server so they rank for your pages.
Here's a sample of a page that the the PHP and CGI proxy servers managed to hijack in Google:
The following is paraphrased to avoid specifics ;)
This is YOUR PAGE TITLE
This is YOUR PAGE CONTENT snippet as indexed by Google via a proxy server.
www.someproxysite.com/cgi-bin/garbage.cgi/somejunk/gibberish.url - 3k -
- Cached - Similar pages
See what happens?
Someone clicks that link, the proxy server loads your page, strips your ads and inserts THEIR ads with your content. Google is actually being used as a SCRAPER as the proxy server does nothing but give Google a cloaked list of rot13 encoded domains to crawl thru and VOILA! your page is hijacked.
function is_this_a_real_msnbot($remote_host_ip) {
$the_host_should_be="livebot-";
$the_host_should_be.=str_replace(".", "-", $remote_host_ip);
$the_host_should_be.=".search.live.com";
if ($the_host_should_be==gethostbyaddr($remote_host_ip)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_real_YahooSlurp($remote_host_ip) {
$the_host_should_be=".crawl.yahoo.net";
if ($the_host_should_be==substr(gethostbyaddr($remote_host_ip), -16)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_real_GoogleBot($remote_host_ip) {
$the_host_should_be=".googlebot.com";
if ($the_host_should_be==substr(gethostbyaddr($remote_host_ip), -14)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_real_Alexa_ia_archiver($remote_host_ip) {
$the_host_should_be=".alexa.com";
if ($the_host_should_be==substr(gethostbyaddr($remote_host_ip), -10)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_real_ArchiveORG_ia_archiver($remote_host_ip) {
$the_host_should_be=".archive.org";
if ($the_host_should_be==substr(gethostbyaddr($remote_host_ip), -12)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_valid_web_crawler($remote_host_ip) { //This function should return TRUE as soon as possible since it's testing to see if an IP address belongs to a vaild web crawler.
if (is_this_a_real_msnbot($remote_host_ip)) {return TRUE;}
elseif (is_this_a_real_GoogleBot($remote_host_ip)) {return TRUE;}
elseif (is_this_a_real_Alexa_ia_archiver($remote_host_ip)) {return TRUE;}
elseif (is_this_a_real_ArchiveORG_ia_archiver($remote_host_ip)) {return TRUE;}
else {return FALSE;}
}
[edited by: tedster at 5:49 am (utc) on Feb. 25, 2008]
[edit reason] remove specifics [/edit]
What if Google adds more IP address space? Or maybe they get creative (they're known for that) and have Googlebot crawl from outside their normal IP address space? There's no reason why they wouldn't do this. I'm not aware of Google saying, "We're only going to crawl from 66.249.64.0 through 66.249.95.255." Unless I'm missing something, they've only assured that Googlebot will reverse (PTR) back to "*.googlebot.com"