Forum Moderators: phranque

Message Too Old, No Replies

Bot redirection with custom 403

works as expected for everything except domain root

         

bubster119

11:03 am on Feb 19, 2008 (gmt 0)

10+ Year Member



I've been working on an htaccess file to block bad bots and redirect them to my custom 403 while excluding the robots.txt file.

I've checked it using wannabrowser as a blacklisted bot and it works as expected in most cases.

If you try to access the robots.txt you get a HTTP 200 and the contents of the file.

If you try to access a page outside of the root eg. http://www.example.com/page.htm you get a HTTP 403 and the custom error page. However...

If you try and access the site root http://www.example.com you get the desired HTTP 403 but no custom error page - instead it returns an Apache HTTP Server Test Page.

Below is the htaccess file - Any help is greatly appreciated.

<.htaccess file start>

# CUSTOM ERROR PAGES
ErrorDocument 400 /error/400.htm
ErrorDocument 401 /error/401.htm
ErrorDocument 403 /error/403.htm
ErrorDocument 404 /error/404.htm
ErrorDocument 500 /error/500.htm

# EXTENSION FREE URI'S
Options +MultiViews
Options +followsymlinks

# 301 REDIRECT WWW/NON-WWW CANONICAL ISSUE
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} ^example\.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

# BLOCK BAD BOTS
RewriteCond %{HTTP_USER_AGENT} SurveyBot/2.3 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget
RewriteRule !^(error/403\.htm¦robots\.txt)$ - [F,L]

<.htaccess file end>

Bubster

wilderness

5:57 pm on Feb 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This may help:
[webmasterworld.com...]
or
Google
[google.com...]

bubster119

5:58 pm on Feb 19, 2008 (gmt 0)

10+ Year Member



Having spent half the day working through this I still haven't arrived at a solution, but it has set me thinking about this in alternative way.

I'm wondering if it really makes any difference what the content of the page returned is for the blocked bots so along as they receive a valid HTTP 403 response?

I've checked the log for wannabrowser and I get the following for a request of the main site root (www.example.com) from a bad bot:

HTTP/1.1 403 Forbidden
Date: Tue, 19 Feb 2008 17:51:53 GMT
Server: Apache
Vary: Host
Accept-Ranges: bytes
Content-Length: 5044
Connection: close
Content-Type: text/html

Does it really matter if it doesn't return my custom 403 error file to the blocked bots? or am I missing something fundamental?

Thanks again for any assistance.

bubster119

6:14 pm on Feb 19, 2008 (gmt 0)

10+ Year Member



Thanks for the reply wilderness, just to clarify it seems to be working correctly for all requests via bad bots to absolute paths:

http://www.example.com/page1.htm
http://www.example.com/page2.htm
http://www.example.com/directory1/
http://www.example.com/directory2/ etc.

and returns a 403 with my custom errors file - as desired.

It is only when the request is to the domain root http://www.example.com/ that to 403 is returned without the custom error page; which is replaced with the standard Apache test page.

I'm wondering if this could be a server configuration issue !?

wilderness

6:55 pm on Feb 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm wondering if this could be a server configuration issue !?

Bots are using a variety of tactics today that were not formerly used.
One of which is an incomplete URL, which serves the same purpose as a ping and returns a 301 (which is actual access), even though the bot or it's IP range may be denied.

Have no clue if that is your occurence or not, however, I'm seeing this issue in my visitor logs more and more frequently.

jdMorgan

4:03 am on Feb 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



First thing I'd do is disable MultiViews, unless you're actually using content-negotiation. If the problem persists, then you've got a server config problem -- Nothing's fatally wrong with your code.

The only quibble I'd have with it it, "Why bother redirecting non-canonical requests if the user-agent is a bad bot?" Consider reversing the two rules.

Jim

bubster119

5:49 pm on Feb 20, 2008 (gmt 0)

10+ Year Member



Thanks for the reply Jim, much appreciated.

I've spoken to my tech support and apparently the Apache test server page is only served when an index page isn't available (which there is) so I'm not sure how the redirect is missing it.

I don't have access to the config files so I may need to take another approach - oh, and MultiViews doesn't seem to make any difference.

Does anyone know the benefits of using a redirect instead of just using a - F or G flag which I presume just stops the bot dead?

Cheers

bubster119

6:15 pm on Feb 20, 2008 (gmt 0)

10+ Year Member



additionally, I've also noticed that if the request is www.example.com/index.htm it actually works as it should. I'm not sure if this is an indication that something is amiss with the DirectoryIndex setting in the main conf file.

jdMorgan

7:04 pm on Feb 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can use the DirectoryIndex directive in your .htaccess file as well, if index.htm is not defined as a directory index in the server config.

Jim

bubster119

10:03 am on Feb 22, 2008 (gmt 0)

10+ Year Member



Thanks Jim

The DirectoryIndex changes didn't make any difference.

I think I've managed to sort it though - I kept my code exactly the same apart from changing the [F] flag to a [G] flag and everything works fine.

Thanks for your help.

Bub