How-to: robots.txt white-listing with PHP/Apache

Ever since WebmasterWorld moved to a scripted robots.txt [webmasterworld.com] to help curtail spidering by certain non-useful bots I've been wanting to try out a similar solution. Here's how I did it.

The idea

The idea is to build a white-list of known bots that you want to allow on your site, and serve a disallow to all others. The standard robots.txt syntax is an exclusion standard, but here we are making it an inclusion standard - only including those bots we specify.

What this will not do

This will not block: scrapers, rogue bots, bots which do not obey robots.txt. The aim is only to block sometimes heavy crawling from "official" bots such as copyright infringement detectors run by commercial companies which use your bandwidth then sell you the results, obscure research projects, or various lower-tier Illyrian search engine startups.

The script

First a note: the original script I wrote was reviewed and so comprehensively changed by coopster [webmasterworld.com] that I claim no merit whatsoever. Knowing my generally terrible PHP skills, this is a very good thing. Anything that works is his, any bugs are mine. :)

<?php 
// robots.txt whitelisting script
//
// First add the UA for the bot into the array
$bots = array( 
'googlebot', 
'slurp', 
'msnbot', 
'jeeves'
);
// Now list the directories or files disallowed for the above bots
$disallow = array( 
'/cgi-bin/',
'/images/',
'/contact.html'
); 
// Nothing to change after this point
$pattern = '/' . implode('Ś', $bots) . '/i'; 
switch (true) 
{ 
case (preg_match($pattern, $_SERVER['HTTP_USER_AGENT'])): 
$text = "Disallow: " . implode("\nDisallow: ", $disallow) . "\n"; 
break; 
default: 
$text = "Disallow: /"; 
} 
header('Content-Type: text/plain'); 
print "User-agent: *\n".$text; 
exit; 
?>

How to use the script

There are two arrays in which you put the required values. The first is

$bots

- which is the list of "good" bots you want to visit. Firstly you need to know the "user agent" string for each bot. For the case of Googlebot, the user agent string looks like:

Googlebot/2.1 (+http://www.google.com/bot.html)

or also:

Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)

In both cases there is an identifying string "googlebot" (the above script is case-insensitive just in case), so the first entry in the array is that string. Same goes for the others in what is a very conservative list: Google, Yahoo (Slurp), MSNBot and Ask Jeeves.

$bots = array( 
'googlebot', 
'slurp', 
'msnbot', 
'jeeves'
);

To add a new bot, add a new entry. For AdSense, you need to add "mediapartners" to the list:

$bots = array( 
'googlebot', 
'slurp', 
'msnbot', 
'jeeves'[b],
'mediapartners'[/b]
);

The second array is the

$disallow

array: this is where you list the directories and files you want to disallow even for the white-listed bots:

$disallow = array( 
'/cgi-bin/',
'/images/',
'/contact.html'
);

In the above example, the directories /cgi-bin/ and /images/ as well as the file in the root directory /contact.html are always disallowed. Add, delete, adjust as required.

Implementation

Once you have saved the above script as

robots.txt

then you need to make it parseable for PHP. In your root-level .htaccess file, you need to add the following:

<Files robots.txt>
ForceType application/x-httpd-php
</Files>

If you cannot have a .htaccess file or are not running Apache then this script is not for you unless you can find another way of either getting a text file parsed as PHP or a method of transparently redirecting calls to robots.txt to a script. Sorry!

Also, be careful if you have subdomains stored within your public_html directory above the main domain's root - the above rule will mean that all the subdomains' robots.txt files will be parsed for PHP too, and if you are not using a script then it will break the plain text versions with an incorrect mime type.

How to test

Grab a copy of Firefox plus the User Agent Switcher extension, or a browser with customizeable UA spoofing built-in such as Konqueror or Safari. Visit your robots.txt and test with each of your chosen bot user agents (where the robots.txt should show something like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /contact.html

Then check with the default browser UA and you should see:

User-agent: *
Disallow: /

You should check the file type (mime type) for the served page: in Firefox, press Ctrl+I and "Type" should be listed as

text/plain

Is this cloaking? But I will be banned!

Yes, this is cloaking. Not real cloaking (which involves checking IP addresses), but transparent, obvious user-agent cloaking. Anyone can see what you're doing - and really you don't have very much to hide. Officially there are no search engines which condone cloaking of any kind, however as you are not cloaking your content (ie. doing anything to affect your search results) you are unlikely to have any problems. I cannot obviously offer any guarantee. :)

So there you go - I hope someone finds it useful!

How-to: robots.txt white-listing with PHP/Apache

encyclo

Lord Majestic

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week