homepage Welcome to WebmasterWorld Guest from 54.167.244.71
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Excluding bad bots via htaccess: why I get http 500 errors?
flapane




msg:4611993
 11:41 am on Sep 23, 2013 (gmt 0)

Hi, please check [webmasterworld.com...] if you want to see details about the origin and the reasons behind this .htaccess file.

I'm blocking bad and useless bot using:

RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} 360Spider [OR]
RewriteCond %{HTTP_USER_AGENT} A(?:ccess|ppid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} C(?:apture|lient|opy|rawl|url) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} D(?:ata|evSoft|o(?:main|wnload)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} E(?:ngine|zooms) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} f(?:etch|ilter) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} genieo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Ja(?:karta|va) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Li(?:brary|nk|bww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Pr(?:oxy|ublish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} robot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} s(?:craper|istrix|pider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} W(?:get|(?:in(32|Http))) [NC]
RewriteRule .? - [F]


Complete htaccess file:

#ban bots, the whole china and #*$!
Order Allow,Deny
allow from all
deny from 1.0.1.0/24
deny from 1.0.2.0/23
[...]

AddDefaultCharset UTF-8

RewriteEngine on

#inherit from root htaccess and append at last, necessary in root too
RewriteOptions inherit

#block bad bots
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} 360Spider [OR]
RewriteCond %{HTTP_USER_AGENT} A(?:ccess|ppid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} C(?:apture|lient|opy|rawl|url) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} D(?:ata|evSoft|o(?:main|wnload)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} E(?:ngine|zooms) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} f(?:etch|ilter) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} genieo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Ja(?:karta|va) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Li(?:brary|nk|bww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Pr(?:oxy|ublish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} robot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} s(?:craper|istrix|pider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} W(?:get|(?:in(32|Http))) [NC]
RewriteRule .? - [F]

#include caching for images
<IfModule mod_expires.c>
ExpiresActive On
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
ExpiresByType image/png "access plus 1 month"
ExpiresByType image/x-icon "access plus 360 days"
ExpiresByType text/css "access plus 1 day"
ExpiresByType text/html "access plus 1 week"
ExpiresByType text/javascript "access plus 1 week"
ExpiresByType text/x-javascript "access plus 1 week"
ExpiresByType application/javascript "access plus 1 week"
ExpiresByType application/x-javascript "access plus 1 week"
ExpiresByType application/x-shockwave-flash "access plus 1 week"
ExpiresByType font/truetype "access plus 1 month"
ExpiresByType font/opentype "access plus 1 month"
ExpiresByType application/x-font-otf "access plus 1 month"
</IfModule>

RewriteCond %{HTTP_HOST} ^nix.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.nix.foo.com$
RewriteRule ^(.*)$ "http\:\/\/www\.foo\.com\/nix\.php" [R=301,L]

RewriteCond %{HTTP_HOST} ^gallery.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.gallery.foo.com$
RewriteRule ^(.*)$ "http\:\/\/www\.foo\.com\/gallery\.php" [R=301,L]

RewriteCond %{HTTP_HOST} ^blog.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.blog.foo.com$
RewriteRule ^(.*)$ "http\:\/\/www\.foo\.com\/blog" [R=301,L]

RewriteCond %{HTTP_HOST} ^id.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.id.foo.com$
RewriteRule ^/?$ "http\:\/\/foo\.myopenid\.com\/" [R=301,L]

redirect 301 /map.php http://www.foo.com/maps/map.php

RedirectMatch 301 ^/(map(?!pa_area51\.)[^/.]+\.php)$ http://www.foo.com/maps/$1

Options +FollowSymLinks
RewriteCond %{HTTP_HOST} !^www\.
RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]


It worked good (http 403) until I switched from a Litespeed webserver hosting to an Apache's one. They're both shared hosting services. Now I get:

Forbidden

You don't have permission to access /robots.txt on this server.

Additionally, a 500 Internal Server Error error was encountered while trying to use an ErrorDocument to handle the request.


Here's a sample from access log:

208.115.111.68 - - [22/Sep/2013:17:56:48 +0200] "GET /robots.txt HTTP/1.1" 500 576 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)"

Any hints on that http 500 error? Thanks in advance

 

flapane




msg:4612049
 3:25 pm on Sep 23, 2013 (gmt 0)

I forgot the error log messages:
Request exceeded the limit of 10 internal redirects due to probable configuration error. Use 'LimitInternalRecursion' to increase the limit if necessary. Use 'LogLevel debug' to get a backtrace

lucy24




msg:4612079
 4:30 pm on Sep 23, 2013 (gmt 0)

ExpiresByType

Why don't you set an ExpiresDefault, and then just list the exceptions?

Crystal ball says that you forgot to include these lines, placed before all other RewriteRules:

RewriteRule ^robots\.txt - [L]
(not needed if you constrain all your access-control RewriteRules to specified extensions, which can save time)

and most crucially
RewriteRule ^my-403-page\.html - [L]

Also (assuming you have some lockouts using mod_authz-whatever)
<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>

flapane




msg:4612106
 5:38 pm on Sep 23, 2013 (gmt 0)

Thanks for your tips, especially the one on ExpiresDefault.
However, granting robots.txt access to malicious bots will prevent such errors, but it won't explain why I see all those http 500s and not just http 403s as it happend on the old web hosting.
I can't find any redirect infinite loops, so I really don't know where could the cause of those http 500 errors be.

lucy24




msg:4612134
 8:29 pm on Sep 23, 2013 (gmt 0)

I can't find any redirect infinite loops

OK, let's go to the long version.

The infinite loops are created by your own 403. It works like this:

--bad robot makes request
--server consults RewriteRule and says Nuh-uh, can't have that, and tries to send back 403 header accompanied by custom 403 page
--server makes internal request for 403 page, still attached to original requesting IP, UA and so on
--server consults RewriteRule and says Nuh-uh, can't have that, and tries to send back 403 header accompanied by custom 403 page
--server makes internal request for 403 page, still attached to original requesting IP, UA and so on
--server ...

et cetera.

See how that works? That's why you need to code an exemption for your 403 page. Make a separate one for each mod that issues a 403-- most likely one for mod_rewrite and another for mod_authz-thingummy. So along with the RewriteRule quoted above, you should also have a

<Files "my-custom-403-page.html">
Order Allow,Deny
Allow from all
</Files>

Some hosts have a built-in error document, for example
/forbidden.html
The server will look in your root for a document by this name. And the config file includes a <Files> or similar section allowing everyone access to the document.

Another host may not have this built-in setup, or may use a different name by default. You're always safe adding your own rules.

phranque




msg:4612139
 8:30 pm on Sep 23, 2013 (gmt 0)

you are forbidding access by the Ezooms U-A so that explains the 403.

you'll have to look for the "ErrorDocument 403 ..." directive in your server config to find the source of your 500 error.

you should also change all mod_alias directives (Redirect/RedirectMatch) to mod_rewrite directives since they don't mix well.

your RewriteRules need work - no quoting or (backslash) escaping of substitution urls required.
also, in .htaccess the target will never start with a leading slash.

[edit]typo[/edit]

[edited by: phranque at 12:55 am (utc) on Sep 24, 2013]

g1smd




msg:4612153
 9:26 pm on Sep 23, 2013 (gmt 0)

Having used at least one RewriteRule you must use zero Redirect and RedirectMatch rules.

Convert all Redirect and RedirectMatch rules to use RewriteRule syntax.

flapane




msg:4612346
 5:50 pm on Sep 24, 2013 (gmt 0)

Thank you so much, now everything's clear.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved