How to Verify Googlebot and Avoid Rogue Spiders - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How to Verify Googlebot and Avoid Rogue Spiders

«
1
2

jcoronella

1:57 am on Sep 22, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Matt [mattcutts.com] recently made a good recommendation over on his blog about how to "authenticate" Googlebot - that is, see if a given spider really is Googlebot, or if it is someone like me pretending to be Googlebot to find your cloaked pages.

The solution is simple, and effective for Googlebot, and also most likely for Yahoo's Slurp and MSNbot. It only relies on G, Y, or M having properly set up DNS entries for the crawling IP's. It's a two step process and involves doing a reverse dns lookup, then a forward DNS lookup.

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

This is clean, simple, and brilliant. I have implemented the reverse IP lookup, but never followed up with the forward - which is key. By doing both you avoid someone filling out an erroneous reverse DNS entry, which is very simple to do.

Finally, my quality content is one step closer to staying mine.

< The full technique is outlined here:
[googlewebmastercentral.blogspot.com...] >

[edited by: tedster at 7:25 pm (utc) on July 5, 2007]

incrediBILL

3:08 am on Sep 25, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

First question: Ok, you can detect and block scrapers who pretend to be Googlebot. What do you do with scrapers that don't? OIOW: Why would you care whether a scraper pretended to be Googlebot unless you were cloaking?

Detecting scrapers that don't pretend to be Google is a whole different problem and one that takes a LOT of different approaches to eliminate.

Brett and I posted a bunch of methods here:
[webmasterworld.com...]

Also, come to PubCon Vegas as there's a session on this topic in November where you can learn a lot more.

Second question: Now what exactly is a cloaked proxy hijacker?

It's a proxy site, that cloaks a list of rot13 encoded urls to Google, Tahoo and MSN to get them to crawl YOUR SITE but do it via the proxy server so the URL Google uses is pointing to the proxy server so they rank for your pages.

Here's a sample of a page that the the PHP and CGI proxy servers managed to hijack in Google:

The following is paraphrased to avoid specifics ;)

This is YOUR PAGE TITLE
This is YOUR PAGE CONTENT snippet as indexed by Google via a proxy server.
www.someproxysite.com/cgi-bin/garbage.cgi/somejunk/gibberish.url - 3k -
- Cached - Similar pages

See what happens?

Someone clicks that link, the proxy server loads your page, strips your ads and inserts THEIR ads with your content. Google is actually being used as a SCRAPER as the proxy server does nothing but give Google a cloaked list of rot13 encoded domains to crawl thru and VOILA! your page is hijacked.

Vienix

11:14 am on Sep 25, 2006 (gmt 0)

10+ Year Member

Proxy scrapers, horrible....

Take <domain removed>, 7000 pages listed in Google.... just a few are their own...

[edited by: tedster at 3:45 pm (utc) on Sep. 25, 2006]

kapow

2:30 pm on Sep 25, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This is one excellent thread! I only wish I understood it all :)
With official advice from Google and excellent notes from incrediBILL etc. This thread lifts the lid a little bit on an ugly and growing web plague, that most people just don't know about because the technicals don't fit into soundbites.

gmillikan

5:33 am on Feb 25, 2008 (gmt 0)

10+ Year Member

Calling the below function like this "is_this_a_valid_web_crawler('123.***.***.***');" will tell you if the IP address hitting your web pages is the real thing. Hope this is helpful to the community.

function is_this_a_real_msnbot($remote_host_ip) {
$the_host_should_be="livebot-";
$the_host_should_be.=str_replace(".", "-", $remote_host_ip);
$the_host_should_be.=".search.live.com";
if ($the_host_should_be==gethostbyaddr($remote_host_ip)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_real_YahooSlurp($remote_host_ip) {
$the_host_should_be=".crawl.yahoo.net";
if ($the_host_should_be==substr(gethostbyaddr($remote_host_ip), -16)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_real_GoogleBot($remote_host_ip) {
$the_host_should_be=".googlebot.com";
if ($the_host_should_be==substr(gethostbyaddr($remote_host_ip), -14)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_real_Alexa_ia_archiver($remote_host_ip) {
$the_host_should_be=".alexa.com";
if ($the_host_should_be==substr(gethostbyaddr($remote_host_ip), -10)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}
function is_this_a_real_ArchiveORG_ia_archiver($remote_host_ip) {
$the_host_should_be=".archive.org";
if ($the_host_should_be==substr(gethostbyaddr($remote_host_ip), -12)) { //If reverse DNS lookup looks good then proceed to
foreach (gethostbynamel(gethostbyaddr($remote_host_ip)) as $realip) { ///Forward Confirmed reverse DNS
if ($realip==$remote_host_ip) {return TRUE;}
}
} else {return FALSE;}
}

function is_this_a_valid_web_crawler($remote_host_ip) { //This function should return TRUE as soon as possible since it's testing to see if an IP address belongs to a vaild web crawler.
if (is_this_a_real_msnbot($remote_host_ip)) {return TRUE;}
elseif (is_this_a_real_GoogleBot($remote_host_ip)) {return TRUE;}
elseif (is_this_a_real_Alexa_ia_archiver($remote_host_ip)) {return TRUE;}
elseif (is_this_a_real_ArchiveORG_ia_archiver($remote_host_ip)) {return TRUE;}
else {return FALSE;}
}

[edited by: tedster at 5:49 am (utc) on Feb. 25, 2008]
[edit reason] remove specifics [/edit]

digitalv

6:18 pm on Feb 26, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Wouldn't it be easier just to block anyone with "google" or "googlebot" in the user agent that didn't come from 66.249.X.X ?

gmillikan

10:49 pm on May 2, 2008 (gmt 0)

10+ Year Member

Sure, white listing Google's IP address space may work for awhile but...

What if Google adds more IP address space? Or maybe they get creative (they're known for that) and have Googlebot crawl from outside their normal IP address space? There's no reason why they wouldn't do this. I'm not aware of Google saying, "We're only going to crawl from 66.249.64.0 through 66.249.95.255." Unless I'm missing something, they've only assured that Googlebot will reverse (PTR) back to "*.googlebot.com"

This 36 message thread spans 2 pages: 36

«
1
2