Welcome to WebmasterWorld Guest from 54.146.28.90

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

Excluding bad bots via htaccess: why I get http 500 errors?

     
11:41 am on Sep 23, 2013 (gmt 0)

Junior Member from IT 

5+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0


Hi, please check [webmasterworld.com...] if you want to see details about the origin and the reasons behind this .htaccess file.

I'm blocking bad and useless bot using:

RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} 360Spider [OR]
RewriteCond %{HTTP_USER_AGENT} A(?:ccess|ppid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} C(?:apture|lient|opy|rawl|url) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} D(?:ata|evSoft|o(?:main|wnload)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} E(?:ngine|zooms) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} f(?:etch|ilter) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} genieo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Ja(?:karta|va) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Li(?:brary|nk|bww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Pr(?:oxy|ublish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} robot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} s(?:craper|istrix|pider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} W(?:get|(?:in(32|Http))) [NC]
RewriteRule .? - [F]


Complete htaccess file:

#ban bots, the whole china and #*$!
Order Allow,Deny
allow from all
deny from 1.0.1.0/24
deny from 1.0.2.0/23
[...]

AddDefaultCharset UTF-8

RewriteEngine on

#inherit from root htaccess and append at last, necessary in root too
RewriteOptions inherit

#block bad bots
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} 360Spider [OR]
RewriteCond %{HTTP_USER_AGENT} A(?:ccess|ppid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} C(?:apture|lient|opy|rawl|url) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} D(?:ata|evSoft|o(?:main|wnload)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} E(?:ngine|zooms) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} f(?:etch|ilter) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} genieo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Ja(?:karta|va) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Li(?:brary|nk|bww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Pr(?:oxy|ublish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} robot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} s(?:craper|istrix|pider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} W(?:get|(?:in(32|Http))) [NC]
RewriteRule .? - [F]

#include caching for images
<IfModule mod_expires.c>
ExpiresActive On
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
ExpiresByType image/png "access plus 1 month"
ExpiresByType image/x-icon "access plus 360 days"
ExpiresByType text/css "access plus 1 day"
ExpiresByType text/html "access plus 1 week"
ExpiresByType text/javascript "access plus 1 week"
ExpiresByType text/x-javascript "access plus 1 week"
ExpiresByType application/javascript "access plus 1 week"
ExpiresByType application/x-javascript "access plus 1 week"
ExpiresByType application/x-shockwave-flash "access plus 1 week"
ExpiresByType font/truetype "access plus 1 month"
ExpiresByType font/opentype "access plus 1 month"
ExpiresByType application/x-font-otf "access plus 1 month"
</IfModule>

RewriteCond %{HTTP_HOST} ^nix.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.nix.foo.com$
RewriteRule ^(.*)$ "http\:\/\/www\.foo\.com\/nix\.php" [R=301,L]

RewriteCond %{HTTP_HOST} ^gallery.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.gallery.foo.com$
RewriteRule ^(.*)$ "http\:\/\/www\.foo\.com\/gallery\.php" [R=301,L]

RewriteCond %{HTTP_HOST} ^blog.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.blog.foo.com$
RewriteRule ^(.*)$ "http\:\/\/www\.foo\.com\/blog" [R=301,L]

RewriteCond %{HTTP_HOST} ^id.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.id.foo.com$
RewriteRule ^/?$ "http\:\/\/foo\.myopenid\.com\/" [R=301,L]

redirect 301 /map.php http://www.foo.com/maps/map.php

RedirectMatch 301 ^/(map(?!pa_area51\.)[^/.]+\.php)$ http://www.foo.com/maps/$1

Options +FollowSymLinks
RewriteCond %{HTTP_HOST} !^www\.
RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]


It worked good (http 403) until I switched from a Litespeed webserver hosting to an Apache's one. They're both shared hosting services. Now I get:

Forbidden

You don't have permission to access /robots.txt on this server.

Additionally, a 500 Internal Server Error error was encountered while trying to use an ErrorDocument to handle the request.


Here's a sample from access log:

208.115.111.68 - - [22/Sep/2013:17:56:48 +0200] "GET /robots.txt HTTP/1.1" 500 576 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"


Any hints on that http 500 error? Thanks in advance
3:25 pm on Sept 23, 2013 (gmt 0)

Junior Member from IT 

5+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0


I forgot the error log messages:
Request exceeded the limit of 10 internal redirects due to probable configuration error. Use 'LimitInternalRecursion' to increase the limit if necessary. Use 'LogLevel debug' to get a backtrace
4:30 pm on Sept 23, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13740
votes: 459


ExpiresByType

Why don't you set an ExpiresDefault, and then just list the exceptions?

Crystal ball says that you forgot to include these lines, placed before all other RewriteRules:

RewriteRule ^robots\.txt - [L]
(not needed if you constrain all your access-control RewriteRules to specified extensions, which can save time)

and most crucially
RewriteRule ^my-403-page\.html - [L]

Also (assuming you have some lockouts using mod_authz-whatever)
<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>
5:38 pm on Sept 23, 2013 (gmt 0)

Junior Member from IT 

5+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0


Thanks for your tips, especially the one on ExpiresDefault.
However, granting robots.txt access to malicious bots will prevent such errors, but it won't explain why I see all those http 500s and not just http 403s as it happend on the old web hosting.
I can't find any redirect infinite loops, so I really don't know where could the cause of those http 500 errors be.
8:29 pm on Sept 23, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13740
votes: 459


I can't find any redirect infinite loops

OK, let's go to the long version.

The infinite loops are created by your own 403. It works like this:

--bad robot makes request
--server consults RewriteRule and says Nuh-uh, can't have that, and tries to send back 403 header accompanied by custom 403 page
--server makes internal request for 403 page, still attached to original requesting IP, UA and so on
--server consults RewriteRule and says Nuh-uh, can't have that, and tries to send back 403 header accompanied by custom 403 page
--server makes internal request for 403 page, still attached to original requesting IP, UA and so on
--server ...

et cetera.

See how that works? That's why you need to code an exemption for your 403 page. Make a separate one for each mod that issues a 403-- most likely one for mod_rewrite and another for mod_authz-thingummy. So along with the RewriteRule quoted above, you should also have a

<Files "my-custom-403-page.html">
Order Allow,Deny
Allow from all
</Files>

Some hosts have a built-in error document, for example
/forbidden.html
The server will look in your root for a document by this name. And the config file includes a <Files> or similar section allowing everyone access to the document.

Another host may not have this built-in setup, or may use a different name by default. You're always safe adding your own rules.
8:30 pm on Sept 23, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10858
votes: 67


you are forbidding access by the Ezooms U-A so that explains the 403.

you'll have to look for the "ErrorDocument 403 ..." directive in your server config to find the source of your 500 error.

you should also change all mod_alias directives (Redirect/RedirectMatch) to mod_rewrite directives since they don't mix well.

your RewriteRules need work - no quoting or (backslash) escaping of substitution urls required.
also, in .htaccess the target will never start with a leading slash.

[edit]typo[/edit]

[edited by: phranque at 12:55 am (utc) on Sep 24, 2013]

9:26 pm on Sept 23, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Having used at least one RewriteRule you must use zero Redirect and RedirectMatch rules.

Convert all Redirect and RedirectMatch rules to use RewriteRule syntax.
5:50 pm on Sept 24, 2013 (gmt 0)

Junior Member from IT 

5+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0


Thank you so much, now everything's clear.