Forum Moderators: phranque

Message Too Old, No Replies

Testing .htaccess?

How can I tell if I'm blocking spiders properly?

         

madpenguin

3:43 am on Apr 26, 2004 (gmt 0)

10+ Year Member



Greetings everyone!
I run a high traffic site and currently have a rather large .htaccess file in an attempt to stop evil spiders from crawling my pages, but I am thinking for some reason it isn't working the way it should. Is there a way I can test it? Even better yet, I'll share my spider list with everyone if someone might lend their insight and expertise... it's a pretty comprehensive listing I think. Lotsa bots :-)

Any thoughts?

[edited by: jdMorgan at 3:52 am (utc) on April 26, 2004]
[edit reason] Removed specifics per TOS [/edit]

jdMorgan

4:02 am on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



madpenguin,

Welcome to WebmasterWorld [webmasterworld.com]!

We've had an ongoing discussion -- complete with huge 'bot lists in this 3-part thread [webmasterworld.com], which you might find interesting. That's the place to post your list -- or maybe just your unique discoveries to keep the size down.

You might want to try using wannabrowser to check your .htaccess code. If you're having problems with a ban list implemented using mod_rewrite, the two most common problems are missing [OR] flags on RewriteConds, and incorrect user-agent string anchoring.

Also, this thread [webmasterworld.com] and its predecessors describe a useful script developed to make your access-control job easier by automating part of the process.

Jim

madpenguin

5:08 am on Apr 26, 2004 (gmt 0)

10+ Year Member



Thanks jdMorgan!
Some of this is way over my head, but I will start my reading now :) In the "close to perfect" thread I'll dump my list for others to use if they like. It's a long list but I'm sure someone will find use. In the meantime, I've condensed my htaccess to show the relevant parts for review. Is this the proper way to go about it? I tried wannabrowser and it seems to load pages on our site just fine without being redirected at all. In this example I've slashed my list of bots down to only ^webvac for ease of reading :)

Is there anything wrong with this syntax at all that would still allow bots to get to our site?

Options +FollowSymlinks
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^webvac [NC,OR]
RewriteRule ^.*$ http:[i][/i]//www.example.org/404.shtml [L,R]
<Files 403.shtml>
order allow,deny
allow from all
</Files>

[edited by: jdMorgan at 1:33 pm (utc) on April 26, 2004]
[edit reason] removed specifics per TOS [/edit]

jdMorgan

2:00 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



  • You will need to add lines such as the first RewriteCond below to allow access to your 403-Forbidden custom error page and all the files that it includes (I assume it contains includes because of the .shtml file extension).
  • The final RewriteCond in your list -- the one preceding RewriteRule -- *must not* have an [OR] flag -- This will "break" the rule.
  • Examine your raw access log for a request from webvac. If the raw user-agent string does not start with "webvac", i.e. if it starts "Mozilla/4.0 (compatible; webvac)", then your pattern anchoring will prevent a match. Patterns should be anchored if possible, but with care to get the anchoring right.
  • The usual approach is to use the [F] flag on the RewriteRule to immediately return a 403-Forbidden response, rather than attempting to redirect the user-agent. (It's unlikely that bad 'bots will follow a external 302 redirect as you specified, and I don't understand the logic of trying to send it to your 404-Not Found custom error page, either.)

    Options +FollowSymlinks
    RewriteEngine On
    RewriteCond %{REQUEST_URI} !^/403\.shtml$
    RewriteCond %{HTTP_USER_AGENT} ^webvac [b][NC][/b]
    RewriteRule .* - [F]
    <Files 403.shtml>
    order allow,deny
    allow from all
    </Files>

    I'm also not sure what your intent is with the allow from all Files container. It contains mod_access directives, which are processed separately from those of mod_rewrite. So, the two don't normally inter-operate; They can co-exist, but mod_rewrite can still block access to a file, even if it is Allowed by mod_access.

    Jim

  • bcolflesh

    2:02 pm on Apr 26, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    For user agent and referrer rule testing, try:

    [wannabrowser.com...]

    <edit> duh, just saw it in a previous post - sorry!</edit>