Welcome to WebmasterWorld Guest from

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

Reducing load on webserver (malicious bots): am I missing something?

htaccess bots spam crawlers

10:12 pm on Sep 8, 2013 (gmt 0)

Junior Member from IT 

5+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0

I'm currently dealing with a suddenly server load spike (shared hosting server, policy managed by Cloudlinux).
As I know the code inside the website better than anyone else, whatever can be cached is cached, whatever can be losslessy optimized is optimized, I didn't recently modify the code and thus I can't be responsible of the high load, the traffic/hits/ppv is constant, I noticed that my daily access logs were full of thousands of malicious bots, chinese (or sometime russians) IPs trying to locate login-admin-manage.php pages or to open hundreds of pages in a few seconds, and so on.
Yesterday I added a few rules to all the .htaccess files, and it seems that this managed to reduce the load (cPanel didn't notice any resource limit warnings in the last 24 hours).
Here are the details:
- A few malicious bots are supposed to respect robots.txt. Actually they don't, but I added them just in case...
- Chinese IP ranges banned using ip2location.com list. I don't get any valuable traffic from there but a few users a year. However, most troubles are coming from there.
- HTTP_USER_AGENT limitations in order to 404 malicious bots
- Disable hotlinking but display the images if HTTP_REFERER is google (I get some interesting traffic from images) or a few other "good" referers.
- Admin pages renamed

Here are the files. They seem to work, but I'd appreciate if somebody would take a couple of minutes to double check them and notice anything wrong. In case nobody does, then those lines will just be useful to other users dealing with sudden high server loads.
As a sidenote, a .htaccess file with such a huge IP range is over 100kB. It still is way less than the bandwith wasted by malicious attempts, though. I wonder if such a huge list may lead to possible new high server loads on a shared web hosting.

Root .htaccess
#ban bots, the whole china and stuff
Order Allow,Deny
allow from all
deny from CHINA ip ranges

AddDefaultCharset UTF-8

<IfModule mod_headers.c>
<FilesMatch "\.(js|css)$">
Header append Vary Accept-Encoding
Cache-Control: Private

RewriteEngine on

#inherit from root htaccess and append at last, necessary in root too
RewriteOptions inherit

#block bad bots
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} 360Spider [OR]
RewriteCond %{HTTP_USER_AGENT} (Access|appid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Capture|Client|Copy|crawl|curl) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Data|devSoft|Domain|download) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Engine|Ezooms|fetch|filter|genieo) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Jakarta|Java|Library|link|libww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (MJ12bot|nutch|Preview|Proxy|Publish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (robot|scraper|sistrix|spider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Wget|Win32|WinHttp) [NC]
RewriteRule .* - [F]

#include caching for images
<IfModule mod_expires.c>
ExpiresActive On
ExpiresByType image/gif "access plus 1 week"
ExpiresByType image/jpg "access plus 1 week"
ExpiresByType image/png "access plus 1 week"
ExpiresByType image/x-icon "access plus 360 days"
ExpiresByType text/css "access plus 1 day"
ExpiresByType text/html "access plus 1 week"
ExpiresByType text/javascript "access plus 1 week"
ExpiresByType text/x-javascript "access plus 1 week"
ExpiresByType application/javascript "access plus 1 week"
ExpiresByType application/x-javascript "access plus 1 week"
ExpiresByType application/x-shockwave-flash "access plus 1 week"
ExpiresByType font/truetype "access plus 1 month"
ExpiresByType font/opentype "access plus 1 month"
ExpiresByType application/x-font-otf "access plus 1 month"

RewriteCond %{HTTP_HOST} ^nix.foo.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.nix.foo.com$
RewriteRule ^(.*)$ "http\:\/\/www\.foo\.com\/nix\.php" [R=301,L]

Options +FollowSymLinks
RewriteCond %{HTTP_HOST} !^www\.
RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]

Images folder .htaccess
RewriteEngine On

#inherit from root htaccess and append at last
RewriteOptions inherit

#disable hotlinking but allow image bots and users from good search engines
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?foo.com [NC]
RewriteCond %{HTTP_REFERER} !google\. [NC]
RewriteCond %{HTTP_REFERER} !images.google\. [NC]
RewriteCond %{HTTP_REFERER} !yahoo\. [NC]
RewriteCond %{HTTP_REFERER} !bing\. [NC]
RewriteCond %{HTTP_REFERER} !msn\. [NC]
RewriteCond %{HTTP_REFERER} !ask\. [NC]
RewriteCond %{HTTP_REFERER} !arianna\. [NC]
RewriteCond %{HTTP_REFERER} !yandex\. [NC]
RewriteCond %{HTTP_REFERER} !delta-search.com\. [NC]
RewriteCond %{HTTP_REFERER} !search.findeer.com\. [NC]
RewriteCond %{HTTP_REFERER} !search\?q=cache [NC]
RewriteCond %{HTTP_REFERER} !search\/cache [NC]
RewriteCond %{HTTP_REFERER} !cache [NC]
RewriteRule \.(jpg|jpeg|png|gif)$ - [NC,F,L]

Options -Indexes

Robots.txt, just in case
User-Agent: *
Allow: /

# Block due to SEO or pseudo-SEO which is not useful to me.
User-agent: AhrefsBot
Disallow: /

User-agent: Ezooms
Disallow: /

User-agent: MJ12bot
Disallow: /

Disallow: /lastfm/
Sitemap: http://www.foo.com/sitemap.xml
Sitemap: http://www.foo.com/bar/sitemap_blog.xml.gz
10:51 pm on Sept 8, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
votes: 390

There's only so much you can do at the "Deny from..." level. Blocking a robot from a page will prevent it from knowing about any non-page files attached to the page, such as stylesheets or images. But it's rare for a robot other than a search engine to request those files anyway.

Malicious robots will ask for nonexistent pages such as the 40 most likely wp-admin filenames, and there's not a thing you can do to stop them.

It's more emotionally gratifying to hit them with a 403, but if you don't say anything they'll get a 404. I don't know if one response or the other is significantly more work for the server: reading and responding to a line in htaccess vs. checking whether a particular file exists. The response itself will probably come out about the same size; it's choosing a response that makes the difference.

It could be argued that redirecting to (or similar, such as redirecting to the asker's own IP) is less work for the server, because all they're sending back is the redirect header. When you return a 403 or 404 the server also has to send back the appropriate error document, even if the robot never bothers to look at it.

Some robots seem to go away faster if you pick one response over another: for example some Ukrainians pack up and leave almost right away when you do the 127.xx redirect. But unless your server is absolutely getting hammered, the mere act of studying your malicious robots and customizing a response is already more work than they deserve ;)

!images.google\. [NC]

Be careful with the [NC] flag. A bad robot is a bad robot no matter how it cases itself. But a good robot only has one correct casing.

The [F] flag implies [L]. And psst!
can be reduced to

You don't need to escape / slashes in mod_rewrite. (Also not in mod_alias and mod_setenvif. Basically, slashes only need to be escaped if the language itself uses /blahblah/ to delimit a Regular Expression, as in javascript and a handful of less common Apache mods.)
11:09 pm on Sept 8, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
votes: 0

RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} 360Spider [OR]
RewriteCond %{HTTP_USER_AGENT} (Access|appid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Capture|Client|Copy|crawl|curl) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Data|devSoft|Domain|download) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Engine|Ezooms|fetch|filter|genieo) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Jakarta|Java|Library|link|libww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (MJ12bot|nutch|Preview|Proxy|Publish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (robot|scraper|sistrix|spider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Wget|Win32|WinHttp) [NC]
RewriteRule .* - [F]

Quite a few of the above can be edited to break sooner on a non-match.

RewriteCond %{HTTP_USER_AGENT} 360Spider [OR]
RewriteCond %{HTTP_USER_AGENT} A(?:ccess|ppid) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} C(?:apture|lient|opy|rawl|url) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} D(?:ata|evSoft|o(?:main|wnload)) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} E(?:ngine|zooms) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} f(?:etch|ilter) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} genieo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Ja(?:karta|va) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Li(?:brary|nk|bww) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} P(?:r(?:eview|oxy)|ublish) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} robot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} s(?:craper|istrix|pider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} W(?:get|(?:in(32|Http))) [NC]
RewriteRule .? - [F]

?: = Non-capturing grouping
.? = A blip more efficient than .* when we're going to check everything requested.
8:54 am on Sept 9, 2013 (gmt 0)

Junior Member from IT 

5+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0

Thanks, I really appreciate your suggestions. Unfortunately, most part of the sudden high load was cause by malicious chinese traffic requesting REAL pages (not only admin pages, but mostly pages which a real user would request, let alone pages with a photogallery on them). Redirecting to localhost seems a really good idea. I need to check in the next days if a 120kB .htaccess file causes slowdowns, and in that case I may consider choosing a localhost redirection rather than a 403.
You're right on the [NC] flag. The referers in that list will always use lower case letters.

Thank you, I need to improve my regex skills. This helps saving a few resources, which is always positive.
2:48 pm on Sept 9, 2013 (gmt 0)

Junior Member from IT 

5+ Year Member

joined:Mar 17, 2009
posts: 60
votes: 0

Just a note on blocking user_agents containing "preview". I've just discovered that there's a new Bing bot called BingPreview/xxx ( [bing.com...] ) which you may want or not want to allow on your pages.

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members