Forum Moderators: goodroi

Message Too Old, No Replies

How-to: robots.txt white-listing with PHP/Apache

         

encyclo

8:01 pm on Jan 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ever since WebmasterWorld moved to a scripted robots.txt [webmasterworld.com] to help curtail spidering by certain non-useful bots I've been wanting to try out a similar solution. Here's how I did it.

The idea

The idea is to build a white-list of known bots that you want to allow on your site, and serve a disallow to all others. The standard robots.txt syntax is an exclusion standard, but here we are making it an inclusion standard - only including those bots we specify.

What this will not do

This will not block: scrapers, rogue bots, bots which do not obey robots.txt. The aim is only to block sometimes heavy crawling from "official" bots such as copyright infringement detectors run by commercial companies which use your bandwidth then sell you the results, obscure research projects, or various lower-tier Illyrian search engine startups.

The script

First a note: the original script I wrote was reviewed and so comprehensively changed by coopster [webmasterworld.com] that I claim no merit whatsoever. Knowing my generally terrible PHP skills, this is a very good thing. Anything that works is his, any bugs are mine. :)

<?php 
// robots.txt whitelisting script
//
// First add the UA for the bot into the array
$bots = array(
'googlebot',
'slurp',
'msnbot',
'jeeves'
);
// Now list the directories or files disallowed for the above bots
$disallow = array(
'/cgi-bin/',
'/images/',
'/contact.html'
);
// Nothing to change after this point
$pattern = '/' . implode('¦', $bots) . '/i';
switch (true)
{
case (preg_match($pattern, $_SERVER['HTTP_USER_AGENT'])):
$text = "Disallow: " . implode("\nDisallow: ", $disallow) . "\n";
break;
default:
$text = "Disallow: /";
}
header('Content-Type: text/plain');
print "User-agent: *\n".$text;
exit;
?>

How to use the script

There are two arrays in which you put the required values. The first is

$bots
- which is the list of "good" bots you want to visit. Firstly you need to know the "user agent" string for each bot. For the case of Googlebot, the user agent string looks like:

Googlebot/2.1 (+http://www.google.com/bot.html)

or also:

Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)

In both cases there is an identifying string "googlebot" (the above script is case-insensitive just in case), so the first entry in the array is that string. Same goes for the others in what is a very conservative list: Google, Yahoo (Slurp), MSNBot and Ask Jeeves.

$bots = array( 
'googlebot',
'slurp',
'msnbot',
'jeeves'
);

To add a new bot, add a new entry. For AdSense, you need to add "mediapartners" to the list:

$bots = array( 
'googlebot',
'slurp',
'msnbot',
'jeeves'[b],
'mediapartners'[/b]
);

The second array is the

$disallow
array: this is where you list the directories and files you want to disallow even for the white-listed bots:

$disallow = array( 
'/cgi-bin/',
'/images/',
'/contact.html'
);

In the above example, the directories /cgi-bin/ and /images/ as well as the file in the root directory /contact.html are always disallowed. Add, delete, adjust as required.

Implementation

Once you have saved the above script as

robots.txt
then you need to make it parseable for PHP. In your root-level .htaccess file, you need to add the following:

<Files robots.txt>
ForceType application/x-httpd-php
</Files>

If you cannot have a .htaccess file or are not running Apache then this script is not for you unless you can find another way of either getting a text file parsed as PHP or a method of transparently redirecting calls to robots.txt to a script. Sorry!

Also, be careful if you have subdomains stored within your public_html directory above the main domain's root - the above rule will mean that all the subdomains' robots.txt files will be parsed for PHP too, and if you are not using a script then it will break the plain text versions with an incorrect mime type.

How to test

Grab a copy of Firefox plus the User Agent Switcher extension, or a browser with customizeable UA spoofing built-in such as Konqueror or Safari. Visit your robots.txt and test with each of your chosen bot user agents (where the robots.txt should show something like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /contact.html

Then check with the default browser UA and you should see:

User-agent: *
Disallow: /

You should check the file type (mime type) for the served page: in Firefox, press Ctrl+I and "Type" should be listed as

text/plain

Is this cloaking? But I will be banned!

Yes, this is cloaking. Not real cloaking (which involves checking IP addresses), but transparent, obvious user-agent cloaking. Anyone can see what you're doing - and really you don't have very much to hide. Officially there are no search engines which condone cloaking of any kind, however as you are not cloaking your content (ie. doing anything to affect your search results) you are unlikely to have any problems. I cannot obviously offer any guarantee. :)

So there you go - I hope someone finds it useful!

Lord Majestic

2:00 pm on Jan 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I was very reluctant to make this post - I wrote it a few times and deleted it, normally I would just avoid posting, but the topic is just too important for me to ignore it, so here we go.

I'd love to see a bit more text with the following header:

Long term implications for the Web

What if every site blocks all bots apart from Big 3? Even bigger post Google update threads spanning tens of thousands of posts? What would have happened if Googlebot was banned at the time when they were nobody - would you still be happy with the way Altavista and Yahoo acted? They sure would not have any reasons to change!

IMO WebMasterWorld influences very influential people and sets standards - it should be acting in a less selfish manner: so much fuss about bad bots consuming traffic with the solution that does not affect bad bots - only decent guys who cared to implement robots.txt in the first place. At the very least you could have allowed bots that honour Crawl-Delay to crawl site at reasonable speed, surely 1 request per 30-60 seconds will not overburden server and won't cost much? If you did that at least you would have encourages bot owners to implement Crawl-Delay, this would be the kind of move that helps improve the Web.

Before anyone implements this consider if you want to act in a way that would help existing oligopoly of Big-3. Do you want to depend on 3 search engines or would rather prefer to have more diversified traffic streams? If you ban all good bots (bad bots can still roam free!) then you vote with your robots.txt for further monopolisations of search market.

This is my first (in this thread) and last post on this topic, if moderators feel that long-term implications of "simple" decisions like this are not relevant then feel free to delete this post.