Forum Moderators: phranque

Message Too Old, No Replies

htaccess Performance Killers

Help needed with htaccess Performance Killers

         

EastTexas

10:20 pm on Jan 12, 2014 (gmt 0)

10+ Year Member



I need Help needed with htaccess Performance Killers - in some cases I'm blocking the innocent & good bots.

I know enough to be dangerous 8)

This works like a charm:

# libww Blocks W3C-checklink/4.81 libwww-perl/5.836

# Bot Blocker [perishablepress.com...]

<IfModule mod_setenvif.c>
SetEnvIfNoCase User-Agent !^(127\.0\.0\.0|localhost) keep_out
# SetEnvIfNoCase User-Agent (de | IE Trident/4.0 BUG) keep_out

SetEnvIfNoCase User-Agent (ar-sa|ee-es|fr|ru|zh-CN) keep_out

SetEnvIfNoCase User-Agent (360spider|80legs|a6-indexer|aboundex|access|ahrefs|appid|archiver|atwatch|auto|babya|baidu|bandit|
blog|bullseye|classbot|capture|catalog|cfnetwork|chinaclaw|clip|clshttp|client|
collector|commerce|control|copier|copy|copyscape|copubbot|copyrightcheck|cr4nk|
craftbot|crawler|curl|darwin|data|deepnet|devsoft|disco|domain|dotbot|download|
ecatch|elefent|email|emailsiphon|emailwolf|engine|enhancer|exabot|extract|extractor) keep_out

SetEnvIfNoCase User-Agent (ezooms|fetch|flash|filter|flip|free|genieo|getright|go.?is|go!zilla|grab|grabber|
grapeshot|harvest|httpclient|httrack|ichiro|indy|ipod|jakarta|java|larbin|leacher|
library|libww|linkdexbot|loader|mail.ru|majestic|master|meanpathbot|missigua|mj12bot|
moget|mojeekbot|mot-mpx220|mutant|myie2|naver|netants|netscape|netseer|news|newt|
niki-bot|nikto|ninja|miner|nutch|offbyone|offline|pages|pecl|phantomjs|piranha|pix|proxy) keep_out

SetEnvIfNoCase User-Agent (publish|python|quester|reaper|regbot|rma|sauger|scan|scout|scraper|sistrix|
siteexplorer|snippets|sogou|spbot|spider|sqworm|stripper|sucker|super|teleport|
urllib|vampire|voila|webpictures|webspider|webster|wells|wget|whack|win32|winhttp|
wotbox|widow|wisenutbot|wotbox|wwwoffle|xaldon|y!oasis|yandex|yisou|youdao|yrspider|
yx|zeus|zip|zoom|zyborg) keep_out

<Limit GET POST PUT>
Order Allow,Deny
Allow from all
Deny from env=keep_out
</Limit>
</IfModule>


This section does not always work for some odd reason?

*NOTE: I'm having problems with Old Firefox UA's trying to Scape &/or Hack

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} ^"". [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Comodo\ Spider\." [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} "Firefox/[1-5]\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Firefox/3.6a1pre [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Firefox/4.0b8pre [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/10\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/11\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/12\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/13\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/14\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/15\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/16\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/17\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/19\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/21\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/22\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Media\ Center\ PC\ 5.0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^monzilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SEO\ ROBOT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Windows\ 2000 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Win\ 9x [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Windows\ 98 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Windows\ 95 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Windows\ 3.1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Windows\ 3.11 [NC,OR]

RewriteCond %{QUERY_STRING} (\<|%3C).*iframe.*(\>|%3E) [NC]

RewriteRule ^.* - [F,L]

# Do not put an [OR] at the end of the list - or a [NC,OR] - or you will ban yourself from your site - and everybody else too!


I find in most cases for me blocking domains are better than by endless list of IP's.

Sample:

# Ukraine (UA)
deny from .ua
deny from 5.248.0.0/16
deny from 31.41.88.0/21
deny from 134.249.0.0/16
deny from 195.242.218.0/23
deny from colocall.net
deny from ddlmega.net
deny from freenet.com.ua
deny from galahost.net
deny from goldentele.com
deny from hostenko.com
deny from imena.ua
deny from ip.net.ua
deny from isphost.com.ua
deny from kiev.ua
deny from kyivstar.net
deny from kyivstar.net.ua
deny from lviv.ua
deny from lan.ua
deny from lan.com.ua
deny from layer6.net
deny from mirohost.net
deny from pautina.ua
deny from shiksabd.com
deny from sovam.net.ua
deny from sovamua.com
deny from steephost.com
deny from steephost.net
deny from sunnet.com.ua
deny from svitonline.com
deny from synapse.net.ua
deny from tapochek.net
deny from thefds.net
deny from triolan.net
deny from uadomen.com
deny from ukrnames.com
deny from ukrservers.com
deny from uaservers.net
deny from ukr.net
deny from ukrindex.com
deny from ukrindex.net
deny from ukrindex.ua
deny from ukrtel.net
deny from ukrtelecom.net
deny from ukrtelecom.com
deny from ukrtelecom.ua
deny from united.net.ua
deny from volia.com
deny from volia.net
deny from xeonn.org


<IfModule mod_setenvif.c>

# Mozilla prior to 4.0
BrowserMatchNoCase ^Mozilla/[0-3] legacy=mozilla

# MSIE prior to 8.0
BrowserMatchNoCase MSIE\D+[0-7]\.[\d.]* legacy=msie

# Firefox prior to 9.0
BrowserMatchNoCase Firefox\D+[0-9]\.[\d.]* legacy=firefox

# Chrome prior to 9.0
BrowserMatchNoCase Chrome\D+[0-9]\.[\d.]* legacy=chrome

# Safari (inc Mobile) prior to 534
BrowserMatchNoCase Safari\D+(?:[0-4]+|\d?53[0-3]\.[\d.]*) legacy=safari

*NOTE: Has Opera Mobile Bug
# Opera prior to 9.80
# BrowserMatchNoCase Opera\D+(?:[0-8][\d.]*|9\.[0-7]) legacy=opera

# Seamonkey prior to 2.6
BrowserMatchNoCase SeaMonkey\D+(?:[01]|2\.[0-5]) legacy=seamonkey

<Limit GET POST PUT>
Order Allow,Deny
Allow from all
Deny from env=legacy
</Limit>
</IfModule>


Anyone else having problems with digitalocean.com sending a ton of bots? I have blocked them for the most part.

[edited by: incrediBILL at 11:46 pm (utc) on Jan 12, 2014]
[edit reason] line breaks [/edit]

lucy24

11:42 pm on Jan 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<tangent>
# libww Blocks W3C-checklink/4.81 libwww-perl/5.836

I had the identical problem, and got tired of commenting-out the line every time I ran the link checker. Solution:

SetEnvIf Remote_Addr ^128\.30\.52 !keep_out

</tangent>

I suspect, but don't know for sure, that putting each of those SetEnvIf directives on a separate line is more efficient. The question is whether the time spent compiling the Regular Expression-- which has to be done on every single request in htaccess --is more or less than the time saved by not searching the UA string for additional hits after there has been a match. Putting it all on separate lines is definitely easier to read.

mod_setenvif has a useful shorthand:
BrowserMatch = SetEnvIf User-Agent
BrowserMatchNoCase = SetEnvIfNoCase User-Agent

In fact this is used later in the same htaccess, confirming the impression that this is a cut-and-paste job from various sources.

RewriteCond %{HTTP_USER_AGENT} "Firefox/10\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/11\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/12\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/13\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/14\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/15\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/16\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/17\." [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Firefox/19\." [NC,OR]

What the bleep? Why not simply Firefox/1[0-79]? And why isn't this with all the SetEnvIf or BrowserMatch directives? mod_rewrite is your heavy artillery. Save it for things that can't be handled any other way. (This group is even smaller in Apache 2.4 with the new <If> construction than in Apache 2.2.)

deny from colocall.net

Do you really want to do this? The moment you use a single non-IP argument in mod_authz-thingummy, your raw logs become unreadable because the leading 12.34.56.78 is now expressed as a resolved name. Everywhere, not just on the affected request. And the server has to look it up somewhere. (Possibly just in its own database compiled earlier in the day, but still.)

<IfModule mod_setenvif.c>

Get rid of all IfModule envelopes. They're a mark of boilerplate htaccess; once you're on your own site, you either have the module or you don't. In the case of mod_rewrite or mod_setenvif, the possibility that you don't have the module (or don't have use of it in htaccess) simply doesn't bear thinking about ;)

Do you eventually do something with all those "legacy" environmental variables? If their sole use is to trigger a "Deny from" directive, there's no reason to set them to a value.


Uhm. What was the question again?

Edit:
htaccess Performance Killers

If you are concerned purely about the size of an htaccess file, don't be. The two concerns are

#1 what a directive does-- which is generally the same in htaccess as in config, with some crucial exceptions like the act of compiling a Regular Expression

#0 (really) the mere existence of .htaccess files

If your htaccess is in the multi-megabyte range there might be cause for concern. But other than that, the main performance hit comes from the server having to look for htaccess files, as happens whenever "AllowOverrides" directives have been set.

incrediBILL

11:55 pm on Jan 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Black lists are a waste of time.

Also, if you want to get rid of performance killers, move it into PHP where you can use MYSQL or move it out of .htaccess in Apache and use DBF files with RewriteMAPs. Either way it faster that big long linear lists of junk in Apache.

I find in most cases for me blocking domains are better than by endless list of IP's.


Except you have a never ending list of domains as blacklisting is playing user agent whack-a-mole when it comes to bot blocking, you always have a new mole pop up, it's a big old time suck. Instead, set up a reverse list of user agents to allow, like Googlebot, Bingbot, and a few browser user agents and allow just those and you'll have a really short and fast list.

If you don't know how to set up browser whitelisting, one method is to use browscap.ini for PHP (http://tempdownloads.browserscap.com/) and just look up the user agent and see if it's flagged as a bot or not and only allow those that are identified as browsers. This could be done in PHP or called from Apache using RewriteMAP's external rewriting program to include a script, but that's not available in .htaccess, only the conf file.

To avoid the big list of IP addresses in .htaccess, get the free GeoLite database by MaxMind and make a simple API call using the visitor IP address and it returns a country code and if it the country code matches "UA" you block it. Simple to implement in PHP that be used to globally protect all PHP and HTML files, images, etc. if you want.

Anyway, good luck with it.

EastTexas

3:22 am on Jan 13, 2014 (gmt 0)

10+ Year Member



I'm a Front-end designer beginning to play in the back-end a little.

I'm still trying to wrap my head around this concept:

Why is blocking a domain is slower than a list if IP's?
deny from kyivstar.net

Just for the record - all this ISP does is send hackers & spammers to my small site!

lucy24

4:13 am on Jan 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why is blocking a domain slower than a list of IP's?

The IP is right there in the request. In fact it's the very first thing that comes through. A hostname has to be looked up; the requester (whether browser or robot) doesn't say "Hi! I live at Hetzner and I'd like to come in."

Imagine yourself as a human checking the various things: Does the IP start in 12.34? Does the user-agent contain "Bork-Edition"? Is the referer www.robotstxt.org? Does the request come from Telus? Each of those things takes a certain amount of time to find out. Some take longer than others. For the server, the total time involved is less by many orders of magnitude-- but proportionally some things still take longer than others.

EastTexas

4:41 am on Jan 13, 2014 (gmt 0)

10+ Year Member



Thank everyone for the answers, I have much work ahead of me 8)

Any recommended sites that will help me fix my Frankenstein monster?

No torches please ;}

lucy24

7:24 am on Jan 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You want a site that's better than this one? ;)

:: pause here for shoe to drop as I realize the question I've just been answering in another thread is also from you, which must convey a very distorted idea about how this forum operates ::

If you sneak over to That Other Forum, they're more likely to write your code for you-- a practice WebmasterWorld frowns on-- and there's a better than 50% chance it will be correct. But really all you need is a text editor.

:: wandering off to compose boilerplate on how to clean up an htaccess file, mainly because past experience* suggests that if I put a lot of time into this, all questions on the subject will dry up within a few months ::


* I've got some perfectly lovely boilerplate on the redirect-to-rewrite two-step, and another one about query strings, and I haven't had to deploy either one in ages!

EastTexas

7:57 am on Jan 13, 2014 (gmt 0)

10+ Year Member



I don't want someone to write the code for me.
All I need good example like jQuery has.

I am creative, but I also like to hand code using CoffeeCup.
I went to it out of self defense because dreamweaver kept messing with my code.

lucy24

8:41 am on Jan 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, here is my boilerplate anyway. The cat (15 pounds) is parked in my lap, so I am stuck at the computer.

First draft. If g1smd or someone like him says I've blundered-- probably something trivial like omitting a "not" --pay attention.

Cleaning up an htaccess file

Step 1: Organize. Collect all the directives for each module in one place. The server doesn't care, but you-- and anyone who comes along after you-- will appreciate it.

Tip: Use a text editor with a "Find All" window to pull up all lines beginning with the element "Rewrite..." That takes care of mod_rewrite; dump them all at the end for now.

Step 2: Get rid of all <IfModule> envelopes. Not their contents, just the envelopes themselves. These envelopes are hallmarks of mass-produced htaccess files that have to work anywhere, on any server. You are now on your own site. Any given mod is either available to you or it isn't.

Step 3: Sort by module. The server doesn't care what order the directives are listed in, or even if rules from different modules all garbled together. Each module works separately, seeing only its own directives. But humans need to be able to find things.

For most people it will be most practical to group one-liners at the beginning:

Options -Indexes


is a good start. If your htaccess file contains only one line, that's probably it. Other quick directives are ones starting with words like AddCharset or Expires. Then list your error documents.

If you have any very short Files or FilesMatch envelopes, put them near the top too. For example:
<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>

<FilesMatch "\.(css|js)">
Header set X-Robots-Tag "noindex"
</Files>


Be sure to have an "Allow from all" envelope for your custom 403 page. If you are on shared hosting and they provide default error-document names such as "forbidden.html", this has probably already been done in the config file. But it does no harm to repeat it.

Step 4: Consolidate redirects.

Step 4a: Get rid of mod_alias. If your htaccess file contains any mod_rewrite directives, it can't use mod_alias (Redirect... by that name), or things may happen in the wrong order. For large-scale updating, use these Regular Expressions, changing \1 to $1 if that's what your text editor uses. Each of these can safely be run as an unsupervised global replace.

# change . to \. in pattern
^(Redirect \d\d\d \S+?[^\\])\.
TO
\1\\.

# now change Redirect to Rewrite
^Redirect(?:Match)? 301 /(.+)
TO
RewriteRule \1 [R=301,L]

# and if needed
^Redirect(?:Match)? 410 /(.+)
TO
RewriteRule \1 - [G]

^Redirect(?:Match)? 403 /(.+)
TO
RewriteRule \1 - [F]


Step 4b: Sort your RewriteRules. At the beginning is the single line

RewriteEngine on


A RewriteBase is almost never needed; get rid of any lines that mention it. Instead, make sure every target begins with either protocol-plus-domain or a slash / for the root.

Sort RewriteRules twice.

First group them by severity. Access-control rules (flag [F]) go first. Then any 410s (flag [G]). Not all sites will have these. Then external redirects (flag [R=301,L] unless there is a specific reason to say something different). Then simple rewrite (flag [L] alone). Finally, there may be a few rules without [L] flag, such as cookies or environmental variables.

Function overrides flag. If your redirects are so complicated that they've been exiled to a separate .php file, the RewriteRule will have only an [L] flag. But group it with the external redirects. If certain users are forcibly redirected to an "I don't like your face" page, the RewriteRule will have an R flag. But group it with the access-control [F] rules.

Then, within each functional group, list rules from most specific to most general. In most htaccess files, the second-to-last external redirect will take care of "index.html" requests. The very last one will fix the domain name, such as with/without www.

Leave a blank line after each RewriteRule, and put a
# comment

before each ruleset (Rule plus any preceding Conditions). A group of closely related rulesets can share an explanation.

Step 5: Notes on error documents.

Reminder: ErrorDocument directives must not include a domain name, or else everything will turn into a 302 redirect. Start each one with a / representing the root.

Caution: Since each module is an island, any module that can issue a 403 must have its own error-document override. "Allow from all" covers mod_authzzzz. If you have RewriteRules that end in [F], make sure your 403 documents can bypass these rules.

g1smd

10:00 pm on Jan 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A quick proof-read of the above didn't expose any glaring errors. Good job!

EastTexas

11:33 pm on Jan 13, 2014 (gmt 0)

10+ Year Member



It Works! But how does it look?

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} ^.*(360spider|80legs|a6-indexer|aboundex|access|ahrefs|appid|archiver|atwatch|auto|babya|baidu|bandit|blog|
bullseye|classbot|capture|catalog|cfnetwork|chinaclaw|clip|clshttp|client|collector|
commerce|control|confusion|copier|copy|copyscape|copubbot|copyrightcheck|cr4nk|
craftbot|crawler|curl|darwin|data|deepnet|devsoft|disco|domain|dotbot|download|
ecatch|elefent|email|emailsiphon|emailwolf|engine|enhancer|exabot|extract|extractor|
eyenetie|ezooms|fetch|flash|filter|flip|free|genieo|getright|go.?is|go!zilla|grab|
grabber|grapeshot|harvest|httpclient|httrack|ichiro|indy|ipod|jakarta|java|kkman|
ktxn|larbin|leacher|library|libww|linkdexbot|loader|mail.ru|majestic|master|
meanpathbot|missigua|mj12bot|moget|mojeekbot|mot-mpx220|mutant|myie2|naver|netants|
netscape|netseer|news|newt|niki-bot|nikto|ninja|miner|nutch|
offbyone|offline|pages|pecl|phantomjs|piranha|pix|proxy|publish|python|quester|
reaper|regbot|rma|sauger|scan|scout|scraper|sistrix|siteexplorer|sitesnagger|snippets
|sogou|spbot|spider|sqworm|stripper|sucker|super|teleport|urllib|vampire|voila|
webpictures|webspider|webster|wells|wget|whack|win32|winhttp|wotbox|widow|win98|
wisenutbot|wotbox|wwwoffle|xaldon|y!oasis|yandex|yisou|youdao|yrspider|yx|zeus|zip|
zoom|zyborg).*$ [NC]

RewriteRule . - [F,L]


Simply did NOT work for some odd reason?
RewriteCond %{HTTP_USER_AGENT} "Firefox/1[0-79]" [NC,OR]

[edited by: bill at 5:09 am (utc) on Jan 14, 2014]
[edit reason] fixed side-scroll [/edit]

lucy24

3:14 am on Jan 14, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_USER_AGENT} ^.*(360spider|80legs|
<ship>
<snip>
<snip>
|zoom|zyborg).*$ [NC]

RewriteRule . - [F,L]

Simply did NOT work


One of these days I will run up a piece of boilerplate specifically addressing the phrase "does not work" ;)

Server crashed? Everyone including yourself got a 403? Everyone including the sogouspider waltzed on in? All requests got redirected to /cutestuff/cats.html?

Give us a hint.

Incidentally, [F,L] does no harm, but it isn't necessary. Certain flags including [F] and [G] carry an implied [L]. (Note that [R] does not!)

RewriteCond %{HTTP_USER_AGENT} ^.*(blahblah

The formulation
^.*
is never necessary unless you're capturing. Just leave off both the opening anchor and the .* --and similarly at the end with
.*$
Here you're simply checking for "if any of the following are included in the UA string".

email|emailsiphon|emailwolf

The string "email" is contained within the strings "emailsiphon" and "emailwolf". Since there are no word-break anchors, you don't need all three. There may be other examples of the same issue in your list. But I won't know until the next passing moderator comes along and delivers a truckload of hard line breaks. (Which browser inserts line breaks automatically? I checked once to make sure it wasn't just Camino being old-fashioned. None of my browsers will split a word.)

EastTexas

3:38 am on Jan 14, 2014 (gmt 0)

10+ Year Member



Good Catch! Now I have too new fangs: "siphon wolf" ;}