Firefox Extension requests robots.txt: makes spider detection hard!

Forum Moderators: phranque

Message Too Old, No Replies

Firefox Extension requests robots.txt: makes spider detection hard!

Statistics script only logs successful hits, can we 403 'Gecko'?

JAB Creations

10:38 pm on Jan 27, 2008 (gmt 0)

There is an extension for Firefox and other Gecko based browsers that requests robots.txt. I do not rememberer the name of the extension off hand but I remember actively trying to contact the creator unsuccessfully. This is very obnoxious as it makes it difficult to detect new spiders. To detect new spiders I have to manually delete the line from the access log, delete the script's log, run the script again, and repeat for every single request!

I do not have the ability to execute PHP in txt files nor do I have access to httpd.conf to allow this. So...

1.) How do we detect 'Gecko'?
2.) How do we forbid 'Gecko' from accessing robots.txt?

I've been searching and this is my current best guess though it generates a server error (Apache 1.3.39).

- John

RewriteCond %{HTTP_USER_AGENT}!Gecko [NC]
RewriteRule!^(robots\.txt) - [F]

wilderness

11:34 pm on Jan 27, 2008 (gmt 0)

Here's an old thread

[webmasterworld.com...]

Jus add the following to your robots.txt (which does reduce the count drastically, however doesn't eliminate):

User-agent: Fasterfox
Disallow: /

JAB Creations

4:44 am on Jan 28, 2008 (gmt 0)

That fails to help since Firefox will still recieve the robots.txt file and Apache will log it as code 200! Unless there is another method of doing this though again I can't adjust Apache on the live server to execute PHP in files with txt extensions.

- John

JAB Creations

5:05 am on Jan 28, 2008 (gmt 0)

This works good...though a little too good...

SetEnvIf User-Agent "Gecko" Gecko
<Files /error/error-403.php>
order allow,deny
allow from all
</Files>
deny from env=Gecko

I'm not sure how to specifically only target the robots.txt file. I'm looking through my htaccess file for anything that resembles targeting a single file and code online.

- John

JAB Creations

5:13 am on Jan 28, 2008 (gmt 0)

Look what I found! :)
[httpd.apache.org...]

This looks even better if I can figure it out...simply NOT logging Gecko useragents requesting robots.txt. This would be acceptable as well.

- John

JAB Creations

11:09 pm on Jan 29, 2008 (gmt 0)

I still haven't figured this out if anyone cares to join in?

- John

JAB Creations

11:45 pm on Jan 29, 2008 (gmt 0)

Working Answer: I was unaware of the ability to allow PHP to be executed by modifying .htaccess. So there are three simple steps to do this though correctly...

Step One
.htaccess
First you must allow PHP to execute on files with the txt extension...

AddType application/x-httpd-php .txt

Do not use this however (with it set to specifically PHP5 as I encountered problems on Apache 1.3.39 running Apache 5.2.4)...

AddType application/x-httpd-php5 .txt

Step Two
robots.txt
You must ensure that the media type (or mime) is still the same as it was. It will be changed (and in Firefox it will ask you to save the file instead of simply displaying it). Since I'm using PHP you must insert the following code at the very top of the file without any whitespace...

<?php
header("Content-type: text/plain");
?>

Step Three
robots.txt
Now it's time to have PHP define Gecko requests to the file as HTTP code 403 (Forbidden)...

<?php
header("Content-type: text/plain");
$useragent = $_SERVER['HTTP_USER_AGENT'];
if (preg_match("/Gecko/", $useragent)) {header('HTTP/1.0 403'); die('Error 403: This file is forbidden for browser access.');}
else if (preg_match("/MSIE/", $useragent)) {header('HTTP/1.0 403'); die('Error 403: This file is forbidden for browser access.');}
?>

You can remove the die syntax if you wish to still display the contents of the file (such if you are manually checking it online yourself).

- John

Edited Part...
Confirm Fix
To test this use Chris Pederick's Web Developer Toolbar, click the Information menu, and at the very bottom click on "View Response Headers" while visiting your robots.txt file.

[edited by: JAB_Creations at 11:51 pm (utc) on Jan. 29, 2008]

jdMorgan

12:31 am on Jan 30, 2008 (gmt 0)

Plain .htaccess solution:


# mod_rewrite setup - Use as needed
Options +FollowSymLinks -MultiViews
RewriteEngine on
#
# Forbid Gecko access to robots.txt
RewriteCond %{HTTP_USER_AGENT} Gecko/ [NC]
RewriteRule ^robots\.txt - [F]

Alternate code to rewrite these requests to a smaller robots.txt file:


RewriteCond %{HTTP_USER_AGENT} Gecko/ [NC]
RewriteRule ^robots\.txt /tiny-robots.txt [L]

And in tiny-robots.txt:

User-agent: * Disallow: /

Jim

JAB Creations

1:36 am on Jan 30, 2008 (gmt 0)

Thanks Jim, your alternative version works great! Since earlier today when I first started working on this I decided I'm going to attempt to use PHP to automate the detection of yet-unknown spiders so I will most likely be using an array in the least if not start using a MySQL table to deal with various spiders. I figure it would be easier to update and reduce overall server load (I imagine .htaccess is somehow parsed for every request versus this PHP script on just the robots.txt file which isn't executed unless the file is requested). I'll be posting in the PHP forums if you want to follow. :) Thanks again!

- John

Achernar

2:31 pm on Jan 31, 2008 (gmt 0)

Either use .htaccess to filter clients, or PHP if more fine-tuning is necessary.
I use this method:

RewriteEngine on
RewriteRule ^robots.txt$ robots.php

and in robots.php:

header('Content-Type: text/plain');
...
readfile('robots.txt');

If you want your robots.php to behave like a normal robots.txt, I recommend adding a "Last-Modified: " header (based on the date of robots.txt, for example), and some code to handle "If-Modified-Since:" HTTP requests (and reply with 304 if needed).

JAB Creations

6:45 pm on Jan 31, 2008 (gmt 0)

Wouldn't it be better to off load serverside instructions intended for a single file to a single file? So all I needed to do was allow PHP to execute in files with txt extensions. I figure it would be better to reserve more universal stuff for the .htaccess file because mine is a little large. For example I have numerous 301s setup though the thought crossed my mind to actually setup a PHP 301 script based off of the 404 page so that the .htaccess file wouldn't have to be parsed (as much) every time a file is requested. Or are 200KB+ sized .htaccess files not a big deal in regards to CPU load?

- John

Achernar

6:54 pm on Jan 31, 2008 (gmt 0)

.htaccess is only parsed on startup (I think), and when it has been modified.

You still have the solution, if you have the server under control, to put the configuration in the virtual host bloc.

The problem with parsing .txt files as .php, is that they are all sent to the php parser instead of going out directly to the browser. Plus you don't have content negociation anymore (with "Last-Modified" and "If-Modified-Since").

[edited by: Achernar at 6:57 pm (utc) on Jan. 31, 2008]

jdMorgan

4:40 am on Feb 1, 2008 (gmt 0)

> 301 script based off of the 404 page

This is "Web 1990's a la GeoCities" and is horribly inefficient and just plain "wrong" from a search engine perspective. A 404 error page is and should be a dead-end error -- It may have links to the home page and to the site map and search page, but it should never redirect anywhere; This is a very efficient way to tank your search rankings.

Code is .htaccess is comparable to code in PHP -- If anything, the .htaccess code is faster because the Apache module parsers are simpler than PHP's.

Code in .htaccess is interpreted for each and every HTTP request, while the same code in httpd.conf or conf.d (server config files) is compiled into executable form once at server re-start, and subsequently executed as native code for each HTTP request.

A variant of the php robots.txt file is to have multiple, simple robots.txt files, and steer various robots to each text file:


# Top 'bots get the top-robots.txt file
RewriteCond %{HTTP_USER_AGENT} Googlebot/�Yahoo!\ Slurp/�msnbot/�Teoma
RewriteRule ^robots\.txt$ /top-robots.txt [L]
#
# Mobile 'bots get the mobile version
RewriteCond %{HTTP_USER_AGENT} Googlebot-Mobile�YahooSeeker/M1A1-R2D2�MSNBOT_Mobile
RewriteRule ^robots\.txt$ /mobile-robots.txt [L]
#
# (All others fall through and get the standard robots.txt)
#

Jim

JAB Creations

6:47 am on Feb 1, 2008 (gmt 0)

That makes sense, I had no clue how .htaccess versus PHP parsing would work.

I did not mean a universal 301 redirect on the 404 page but rather only moving redirects from .htaccess to the 404 page. If there isn't a redirect it would simply end and the page would be served as 404. How Apache and PHP adds to server load comparatively was a question in a vague sense. Thanks for your answers!

- John

jdMorgan

6:22 pm on Feb 1, 2008 (gmt 0)

Moving redirects to the 404 page will result in a witches brew of server response codes, and poison your search results. Even if there's a redirect after the 404 page, the client will see a 404, then followed by a redirect.

If I'm understanding what you propose, it's a terrible solution, and deadly to your rankings. Do not do this.

(There is a long history of this method, which dates from the days of GeoCities, when it was the only way to redirect to a dynamic script on their intentionally-limited-user-capabilities server configurations. That is why you still see it suggested on the Web, but most of those posts are pre-SEO, and now bad advice.)

Jim

JAB Creations

1:01 am on Feb 2, 2008 (gmt 0)

Thanks Jim, it sounds like keeping the way things are the way they are now is the best thing to do.

- John