Forum Moderators: open
The following was courtesy of incrediBill in another thread and it works great with Google and MSN bots:
<?php
$ua = $_SERVER['HTTP_USER_AGENT'];
if(stristr($ua, 'msnbot') ¦¦ stristr($ua, 'googlebot')) {
$ip = $_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($ip);
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname)) {
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
$real_ip = gethostbyname($hostname);
if($ip!= $real_ip){
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
$block = FALSE;
}
}
}
?>
(Remember to fix the ¦ character)
Does anyone have a version of the above that can do RDNS for all 4 bots: Yahoo, ia_archiver, MSN/Live & Google?
To add Yahoo!:
Line 3:
if(stristr($ua, 'msnbot') ¦¦ stristr($ua, 'googlebot')) {
becomes
if(stristr($ua, 'yahoo')¦¦ stristr($ua, 'msnbot') ¦¦ stristr($ua, 'googlebot')) {
And Line 5:
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname)) {
becomes
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/\.yahoo\.com$/", $hostname)) {
As for Alexa's ia_archiver, I have no idea as I remember it resolves to its ISP Global Crossing I think..
"Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
74.6.21.112 -> lj512812.crawl.yahoo.net
The ia_archive is a tough one and I have them blocked just because I don't approve of the archiving of my sites and the fact that snoopy SEO's and scrapers use the archives as well as lawyers, it's just a bad idea IMO.
Here's a couple of examples of their UAs and rdns:
"ia_archiver"
64.213.203.146 -> blx203-crawl146.alexa.com
64.208.172.181 -> no rdns returned
"ia_archiver-web.archive.org"
207.241.228.202 -> ia311407.us.archive.org
"Mozilla/5.0 (compatible;archive.org_bot/heritrix-1.9.0-200608171144 +http://pandora.nla.gov.au/crawl.html)"
207.241.233.35 -> no rdns returned
"Mozilla/5.0 (compatible; archive.org_bot/1.10.0 +http://www.loc.gov/minerva/crawl.html)"
207.241.232.191 -> no rdns returned
some other stuff too...
So as you can see the bots from archive.org / ia_archiver are quite a challenge overall but the most I would validate to let in would be "ia_archiver" with rdns of ".alexa.com which will probably bounce a few valid instances of the bot as the reverse DNS seems spotty, so beware.
if(stristr($ua, 'slurp')¦¦ stristr($ua, 'msnbot') ¦¦ stristr($ua, 'googlebot') ¦¦ stristr($ua, 'ia_archiver')) {
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/\.crawl\.yahoo\.com$/", $hostname) &&!preg_match("/\.alexa\.com$/", $hostname)) {
Here's the final version for the record:
(Remember to fix the ¦ character in line 3 unless you like seeing a "parse error, unexpected T_STRING" - yes I forgot to do it :-) )
<?php
$ua = $_SERVER['HTTP_USER_AGENT'];
if(stristr($ua, 'slurp') ¦¦ stristr($ua, 'msnbot') ¦¦ stristr($ua, 'googlebot') ¦¦ stristr($ua, 'ia_archiver')) {
$ip = $_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($ip);
if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/\.crawl\.yahoo\.com$/", $hostname) &&!preg_match("/\.alexa\.com$/", $hostname)) {
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
$real_ip = gethostbyname($hostname);
if($ip!= $real_ip){
$block = TRUE;
header("HTTP/1.0 403 Forbidden");
exit;
} else {
$block = FALSE;
}
}
}
?>
I only allow alexa as direct advertisers seem to value it, otherwise have nocache on my pages so the archiver waybackmachine is not keeping a copy.
ia_archiver right now looks like it's being run from a hotel! 209.234.171.zz