Forum Moderators: phranque
I do not have the ability to execute PHP in txt files nor do I have access to httpd.conf to allow this. So...
1.) How do we detect 'Gecko'?
2.) How do we forbid 'Gecko' from accessing robots.txt?
I've been searching and this is my current best guess though it generates a server error (Apache 1.3.39).
- John
RewriteCond %{HTTP_USER_AGENT}!Gecko [NC]
RewriteRule!^(robots\.txt) - [F]
[webmasterworld.com...]
Jus add the following to your robots.txt (which does reduce the count drastically, however doesn't eliminate):
User-agent: Fasterfox
Disallow: /
SetEnvIf User-Agent "Gecko" Gecko
<Files /error/error-403.php>
order allow,deny
allow from all
</Files>
deny from env=Gecko
I'm not sure how to specifically only target the robots.txt file. I'm looking through my htaccess file for anything that resembles targeting a single file and code online.
- John
This looks even better if I can figure it out...simply NOT logging Gecko useragents requesting robots.txt. This would be acceptable as well.
- John
Step One
.htaccess
First you must allow PHP to execute on files with the txt extension...
AddType application/x-httpd-php .txt
Do not use this however (with it set to specifically PHP5 as I encountered problems on Apache 1.3.39 running Apache 5.2.4)...
AddType application/x-httpd-php5 .txt
Step Two
robots.txt
You must ensure that the media type (or mime) is still the same as it was. It will be changed (and in Firefox it will ask you to save the file instead of simply displaying it). Since I'm using PHP you must insert the following code at the very top of the file without any whitespace...
<?php
header("Content-type: text/plain");
?>
Step Three
robots.txt
Now it's time to have PHP define Gecko requests to the file as HTTP code 403 (Forbidden)...
<?php
header("Content-type: text/plain");
$useragent = $_SERVER['HTTP_USER_AGENT'];
if (preg_match("/Gecko/", $useragent)) {header('HTTP/1.0 403'); die('Error 403: This file is forbidden for browser access.');}
else if (preg_match("/MSIE/", $useragent)) {header('HTTP/1.0 403'); die('Error 403: This file is forbidden for browser access.');}
?>
You can remove the die syntax if you wish to still display the contents of the file (such if you are manually checking it online yourself).
- John
Edited Part...
Confirm Fix
To test this use Chris Pederick's Web Developer Toolbar, click the Information menu, and at the very bottom click on "View Response Headers" while visiting your robots.txt file.
[edited by: JAB_Creations at 11:51 pm (utc) on Jan. 29, 2008]
# mod_rewrite setup - Use as needed
Options +FollowSymLinks -MultiViews
RewriteEngine on
#
# Forbid Gecko access to robots.txt
RewriteCond %{HTTP_USER_AGENT} Gecko/ [NC]
RewriteRule ^robots\.txt - [F]
RewriteCond %{HTTP_USER_AGENT} Gecko/ [NC]
RewriteRule ^robots\.txt /tiny-robots.txt [L]
User-agent: *
Disallow: / - John
RewriteEngine on
RewriteRule ^robots.txt$ robots.php
and in robots.php:
header('Content-Type: text/plain');
...
readfile('robots.txt');
If you want your robots.php to behave like a normal robots.txt, I recommend adding a "Last-Modified: " header (based on the date of robots.txt, for example), and some code to handle "If-Modified-Since:" HTTP requests (and reply with 304 if needed).
- John
You still have the solution, if you have the server under control, to put the configuration in the virtual host bloc.
The problem with parsing .txt files as .php, is that they are all sent to the php parser instead of going out directly to the browser. Plus you don't have content negociation anymore (with "Last-Modified" and "If-Modified-Since").
[edited by: Achernar at 6:57 pm (utc) on Jan. 31, 2008]
This is "Web 1990's a la GeoCities" and is horribly inefficient and just plain "wrong" from a search engine perspective. A 404 error page is and should be a dead-end error -- It may have links to the home page and to the site map and search page, but it should never redirect anywhere; This is a very efficient way to tank your search rankings.
Code is .htaccess is comparable to code in PHP -- If anything, the .htaccess code is faster because the Apache module parsers are simpler than PHP's.
Code in .htaccess is interpreted for each and every HTTP request, while the same code in httpd.conf or conf.d (server config files) is compiled into executable form once at server re-start, and subsequently executed as native code for each HTTP request.
A variant of the php robots.txt file is to have multiple, simple robots.txt files, and steer various robots to each text file:
# Top 'bots get the top-robots.txt file
RewriteCond %{HTTP_USER_AGENT} Googlebot/¦Yahoo!\ Slurp/¦msnbot/¦Teoma
RewriteRule ^robots\.txt$ /top-robots.txt [L]
#
# Mobile 'bots get the mobile version
RewriteCond %{HTTP_USER_AGENT} Googlebot-Mobile¦YahooSeeker/M1A1-R2D2¦MSNBOT_Mobile
RewriteRule ^robots\.txt$ /mobile-robots.txt [L]
#
# (All others fall through and get the standard robots.txt)
#
I did not mean a universal 301 redirect on the 404 page but rather only moving redirects from .htaccess to the 404 page. If there isn't a redirect it would simply end and the page would be served as 404. How Apache and PHP adds to server load comparatively was a question in a vague sense. Thanks for your answers!
- John
If I'm understanding what you propose, it's a terrible solution, and deadly to your rankings. Do not do this.
(There is a long history of this method, which dates from the days of GeoCities, when it was the only way to redirect to a dynamic script on their intentionally-limited-user-capabilities server configurations. That is why you still see it suggested on the Web, but most of those posts are pre-SEO, and now bad advice.)
Jim