verifying ia archiver & Yahoo Bot

Forum Moderators: open

Message Too Old, No Replies

verifying ia archiver & Yahoo Bot

via RDNS

Hobbs

4:48 pm on Apr 5, 2008 (gmt 0)

How do you verify that the bot visiting you is really ia_archiver & Yahoo Bot

The following was courtesy of incrediBill in another thread and it works great with Google and MSN bots:

<?php
$ua = $_SERVER['HTTP_USER_AGENT'];
if(stristr($ua, 'msnbot') ŚŚ stristr($ua, 'googlebot')) {
$ip = $_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($ip);
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname)) {
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
$real_ip = gethostbyname($hostname);
if($ip!= $real_ip){
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
$block = FALSE;
}
}
}
?>

(Remember to fix the Ś character)

Does anyone have a version of the above that can do RDNS for all 4 bots: Yahoo, ia_archiver, MSN/Live & Google?

Hobbs

12:35 pm on Apr 6, 2008 (gmt 0)

Ok, since no one is home I'll hack away at it just let me know if the below looks ok:

To add Yahoo!:

Line 3:

if(stristr($ua, 'msnbot') ŚŚ stristr($ua, 'googlebot')) {

becomes

if(stristr($ua, 'yahoo')ŚŚ stristr($ua, 'msnbot') ŚŚ stristr($ua, 'googlebot')) {

And Line 5:

if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname)) {

becomes

if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/\.yahoo\.com$/", $hostname)) {

As for Alexa's ia_archiver, I have no idea as I remember it resolves to its ISP Global Crossing I think..

incrediBILL

10:22 pm on Apr 6, 2008 (gmt 0)

For Yahoo I'd call it "slurp" and look for "crawl.yahoo.com" as a result.

"Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
74.6.21.112 -> lj512812.crawl.yahoo.net

The ia_archive is a tough one and I have them blocked just because I don't approve of the archiving of my sites and the fact that snoopy SEO's and scrapers use the archives as well as lawyers, it's just a bad idea IMO.

Here's a couple of examples of their UAs and rdns:

"ia_archiver"
64.213.203.146 -> blx203-crawl146.alexa.com
64.208.172.181 -> no rdns returned

"ia_archiver-web.archive.org"
207.241.228.202 -> ia311407.us.archive.org

"Mozilla/5.0 (compatible;archive.org_bot/heritrix-1.9.0-200608171144 +http://pandora.nla.gov.au/crawl.html)"
207.241.233.35 -> no rdns returned

"Mozilla/5.0 (compatible; archive.org_bot/1.10.0 +http://www.loc.gov/minerva/crawl.html)"
207.241.232.191 -> no rdns returned

some other stuff too...

So as you can see the bots from archive.org / ia_archiver are quite a challenge overall but the most I would validate to let in would be "ia_archiver" with rdns of ".alexa.com which will probably bounce a few valid instances of the bot as the reverse DNS seems spotty, so beware.

if(stristr($ua, 'slurp')ŚŚ stristr($ua, 'msnbot') ŚŚ stristr($ua, 'googlebot') ŚŚ stristr($ua, 'ia_archiver')) {

if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/\.crawl\.yahoo\.com$/", $hostname) &&!preg_match("/\.alexa\.com$/", $hostname)) {

Hobbs

9:59 am on Apr 7, 2008 (gmt 0)

Thanks Bill!

Here's the final version for the record:
(Remember to fix the Ś character in line 3 unless you like seeing a "parse error, unexpected T_STRING" - yes I forgot to do it :-) )

<?php
$ua = $_SERVER['HTTP_USER_AGENT'];
if(stristr($ua, 'slurp') ŚŚ stristr($ua, 'msnbot') ŚŚ stristr($ua, 'googlebot') ŚŚ stristr($ua, 'ia_archiver')) {
$ip = $_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($ip);
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/\.crawl\.yahoo\.com$/", $hostname) &&!preg_match("/\.alexa\.com$/", $hostname)) {
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
$real_ip = gethostbyname($hostname);
if($ip!= $real_ip){
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
$block = FALSE;
}
}
}
?>

I only allow alexa as direct advertisers seem to value it, otherwise have nocache on my pages so the archiver waybackmachine is not keeping a copy.

ia_archiver right now looks like it's being run from a hotel! 209.234.171.zz