Forum Moderators: phranque

Message Too Old, No Replies

htaccess Questions

htaccess, questions, ua, user, agent

         

lanesharon

7:51 pm on Dec 31, 2011 (gmt 0)

10+ Year Member



Right now, I am waiting for my monthly bandwidth to renew, thanks to 10 hits a second, all day, by one 'so-called' search engine. I need help stopping them. A sample of my htaccess:
Options +FollowSymlinks
RewriteEngine on

<Files .htaccess>
Order allow,deny
Deny from all
</Files>

#Block by IP
RewriteCond %{REMOTE_HOST} 62.212.73.211 [OR]

# Block By User Agent UA
RewriteCond %{HTTP_USER_AGENT} AhrefsBot/2.0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FLR [NC]

# Block By Browser and User Agent
RewriteCond %{HTTP_USER_AGENT} ^Mozilla\.*Indy [NC]

#Block By Country IP
order allow,deny
allow from all

deny from 2.132.0.0/14 31.11.43.0/24


This has been put together by different tutorials on the web, and I am not sure if it will work correctly or not (waiting for bandwidth).

My questions are:

1. The first deny, on the htaccess file, is for reading content purposes only?
2. The OR, within the brackets, allows me to continue and add more rewrite statements? If so, would the last entry I have under 'Block by User Agent' stop the processing of the entire htaccess file? And, not have those last entries read (browser ua, denys)?
3. Can someone explain the different formats for UA (straight up, wildcard, starting with characters)?
4. How do I decipher the UA from my logs? For example:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

I entered the name Baidu and that does not seem to work.

And, my last, most important question. When I entered a UA in my htaccess file, will I still see an entry in my logs for that UA if it tries to access my website.

I really need some help learning and so much of this is fragmented on the web. There seems to be no thorough primer on how to do this, with complete explanations of why you do it certain ways. There are only examples, with no thorough explanation. So, I am asking you to PLEASE help me learn.

lucy24

6:40 am on Jan 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Unless you have the worst host in the entire world, you do not need to mention htaccess at all. Nobody, but nobody, is allowed to see your htaccess file. They don't have to read it like robots.txt, and they don't have the option of ignoring it; .htaccess functions silently. What you do need is a Files or FilesMatch directive for robots.txt and for your custom 403 page if you've got one, saying Allow from all.

You don't need mod_rewrite to block by IP alone, or by any other single condition. That's what your core Deny from... directives are for. And don't waste time blocking by exact aa.bb.cc.dd address unless there is an especially loathsome robot living splat in the middle of a range of normal attractive humans. (Got a vague idea RoadRunner never met a customer they didn't like. That kind of thing.)

There hasn't been any recent discussion of the overall structure of your htaccess file, as opposed to the arrangement of directives within any one module. Mine goes (from memory, so don't quote me on exact spelling or punctuation, or your server may explode)
____

Options -Indexes

<Files
{various allow-only exceptions here here}

SetEnvIf blahblah keep_out
{for some extremely simple conditions that don't need mod_rewrite}

Order allow,deny
Deny from env=keep_out
Deny from aa.bb.cc.dd
{all the core directives using IP or environment alone, separate line for each Deny from}

RewriteEngine On

... and from here on you do the more specific or conditional stuff.
____

How do I decipher the UA from my logs? For example:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
I entered the name Baidu and that does not seem to work.

Baidu alone will work perfectly well if you use it in the right place with the right syntax. "I entered..." does not give an awful lot of information. There are thousands and thousands of UAs, so you don't want to get too specific.

And, my last, most important question. When I entered a UA in my htaccess file, will I still see an entry in my logs for that UA if it tries to access my website.

Yes, it will be listed with "403" followed by the approximate size (in bytes) of your "forbidden" document, instead of "200" followed by the size of the file it was aiming for. The exact amount of information in both your access logs and your error logs depends on the host. But you should certainly expect to see 403s. My access logs list all 2xx, 3xx and 4xx, but only 500; the error logs list all 4xx and 5xx. Your mileage may vary.

keyplyr

9:54 am on Jan 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My advice is to read-up and learn RegEx (regular expressions) which is the language you're using mostly in your .htaccess file. Do some internet searches and find a few good RegEx sites and save them for reference.

The Apache [proper place for this question] and Webmaster General forums here at WW will help too, and sometimes your host may have some code snippets available somewhere on their site.

Be careful about cut'n pasting someone's code, which may have mistakes or too heavily coded for simple tasks.

g1smd

11:25 am on Jan 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond and deny are used by completely different Apache modules. They are processed separately.

After the list of RewriteCond conditions you need a RewriteRule to actually do something.

lanesharon

5:33 pm on Jan 1, 2012 (gmt 0)

10+ Year Member



I wish I could repost this in the proper forum here, but I get an error that the topic has already been posted. BTW, the rewrite rule is at the end the rewrites, prior to the denies and it is:
RewriteRule ^.* - [F,L]

g1smd

5:49 pm on Jan 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Make that rule:

RewriteRule .* - [F]

lanesharon

6:50 pm on Jan 1, 2012 (gmt 0)

10+ Year Member



g1smd I appreciate the suggestion, but could you explain the difference so that I can learn? Thank You.

lucy24

8:59 pm on Jan 1, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to see if mods can be persuaded to merge this thread into its apache sibling, since it definitely belongs over there ::

Bookmark this page:

[httpd.apache.org...]

Don't try to read and assimilate every word of it all at once. The parts you need are RewriteRule and RewriteCond.

mod_rewrite works on a "two steps forward, one step back" system. That means: Each request moves through mod_rewrite, looking only at the Rules. If the Rule potentially fits the request-- for example, if it's a request for an html page, and the Rule includes \.html$ --then and only then will mod_rewrite stop and look at the Conditions listed immediately before the Rule. Each Rule has its own Conditions.

If the request also meets the extra conditions-- for example, the referer is such-and-such-- then the Rule will execute. Normally each Rule ends in [L], so that's the last you will see of mod_rewrite. Unless it's a redirect; then the whole thing starts over again from the top.

The flags [F] and [G] (also [P], but don't mess with that one) imply [L], so the extra flag is optional. But it does no harm. It's safest to include [L] in all RewriteRules, unless you have some specific reason to leave it out.

If your Rule specifies .* --which you'd normally try to avoid, but we can get to that later-- you don't need an opening anchor ^ or a closing anchor $. By default, Regular Expressions are greedy. (Technical term. Really. The opposite is "stingy".) That means they start as soon as they can, and go on as long as they can.

lanesharon

12:20 am on Jan 2, 2012 (gmt 0)

10+ Year Member



So, on the RewriteRule, the ^ (caret/circumflex) is an opening anchor that says 'the rule starts here'. And, the $ is telling me that 'the rule stops here', normally.

In a Rewrite conditional expression (RewriteCond, the ^ says that only the very beginning of the characters is going to be looked at in any expression.

So, in this user agent string found in my logs:
FLR-Bot/1.0 (First Life Research LTD, www.firstliferesearch.com, mailto: botadmin@firstliferesearch.com)

The entry that I had for the RewriteCond:
RewriteCond %{HTTP_USER_AGENT} ^FLR [NC,OR]

Should have 'activated' the actions of the RewriteRule:
RewriteRule ^.* - [F,L]

Looking at this page for some of this info:
[zytrax.com...]

lucy24

3:54 am on Jan 2, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, on the RewriteRule, the ^ (caret/circumflex) is an opening anchor that says 'the rule starts here'. And, the $ is telling me that 'the rule stops here', normally.


Aaaack! Not the rule. And not the condition. The pattern-- the part you're matching against something else.

In the Rule, the pattern can only match against the request: the "meat" of the url, ignoring the domain and the query (if any). In Conditions, you can match against other things, like referers or query strings.

Anchors have exactly the same meaning in Rules and Conditions, because they have nothing to do with mod_rewrite. They are part of Regular Expressions syntax. There's a link somewhere on Apache's mod_rewrite page, and also in the Forums Charter. Currently it's the very last thing in the "links" section. Or look up "Regular Expressions" in the search engine of your choice and find a page that you can get along with.

^FLR means "the part you're matching against-- in this case, the complete user-agent string-- must begin with FLR". Without the ^ anchor it would mean "must contain FLR".

^.* means "the part you're matching against-- in this case, the middle of the URL-- must begin with something or nothing". Without the anchor it would mean "must contain something or nothing". See why the anchor doesn't make any difference?

The entry that I had for the RewriteCond:
RewriteCond %{HTTP_USER_AGENT} ^FLR [NC,OR]

Should have 'activated' the actions of the RewriteRule:
RewriteRule ^.* - [F,L]

Ouch! What did it actually do? On my setup, the exact wording

RewriteCond %{HTTP_USER_AGENT} ^FLR [NC,OR]
RewriteRule ^.* - [F,L]

would have led to a 500 error, because you can't have [OR] at the end of your last (or only) Condition. Oops, you asked about that in your first post and I missed it.

The OR, within the brackets, allows me to continue and add more rewrite statements?

Ouch again. The [OR] flag is used with Conditions to mean "either this condition applies, or the next condition applies". The [OR] operator is a last resort when you can't use pipe-separated options in a single line. Your favorite Regular Expressions tutorial will explain about pipes. They look like this:
(php|html?)$

lanesharon

4:48 am on Jan 2, 2012 (gmt 0)

10+ Year Member



Okay, let me take it one line at a time:
The entry that I had for the RewriteCond:
RewriteCond %{HTTP_USER_AGENT} ^FLR [NC]

This is in a string of OR lines, so the OR is not a problem. I am used to if or else statements, so I will explain based on that.

Start Process
If the UA has the letters FLR in the very beginning, (case insensitive)
then serve up the 403 Forbidden page (Send the HTTP response, 403 )
End the rewrite

Having said it this way, I understand all but the ^.* - part of it.

lucy24

9:09 am on Jan 2, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Remember: two steps forward, one step back.

So the process starts here:

RewriteRule .* - [F,L]

mod_rewrite stops and asks "Does the current request fit into this pattern?" Answer: Yes, it does. Since .* means "anything or nothing", all requests match, including null requests (the ones for the top-level index page).

Now that we've established that the rule might apply, mod_rewrite steps back and looks for any conditions attached to this rule. If the conditions are also met, then execute the rule.

The rule itself is:

Given a request of form .* (i.e. any request at all)
do not change it into anything else (that's what the - means)
but kick the request out the door (F = Fail, and L = Last rule for good measure).

It is rarely necessary to use .* This form forces mod_rewrite to stop and evaluate every single request it ever gets. Not only the pages that a user asks for, but all associated files such as images, stylesheets, external javascript, includes, et cetera et cetera. It is very rare for a robot to ask for anything but pages-- partly because it doesn't know the other files exist if it hasn't seen the page that asks for them. So you can save a lot of processor time by saying

RewriteRule \.html?$ - [F,L]

or \.php or \.jsp or whatever extension you actually use. (You generally don't need to block requests for nonexistent files, because that's what the 404 is for.) If someone comes in asking for a file in \.css or \.jpg, you can assume they've already got permission to be there. Unless they're hotlinking-- and that's an entirely different routine.

If you do not understand what \.html?$ means, your chosen Regular Expressions reference site will walk you through it.

* * *

Then again, you could dump it on mod_setenvif, in conjunction with the core, and simply say

BrowserMatch FLR keep_out {or bad_bot or whatever label you choose}

and let a collective
Deny from env=keep_out
deal with the whole thing. I do it this way if the user-agent contains some very simple and distinctive word, so all you have to type is three words and you're done. Covario, Clipish, FairShare, that kind of thing.

lanesharon

4:19 pm on Jan 2, 2012 (gmt 0)

10+ Year Member



lucy, Thank you so much for your help. Your guidance has helped me to remove the errors I was encountering. I will look into your suggestions. Do you think that this is a good tutorial to teach me that, step by step? - [thesitewizard.com...]

May I ask you to help me, and others, walk through one last step? Can you dissect 2 log entries for me and tell me the elements in them? (ie, UA, browser, etc.):
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51


Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)


Also, I was able to get a couple of online book references for regex books, but wish I could find a written book, since staring at a screen is not kind on my eyes. Do you know of a book, that I can purchase, that would be good.

Again, lucy, Thank You So Much

lucy24

9:25 pm on Jan 2, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Retroactive Edit:

For log-reading purposes, "User Agent" and "browser" are the same thing. A browser is what a human uses to visit the Internet; a user-agent is a broader term that includes non-humans. The last set of quotation marks in your logs encloses the entire User Agent, otherwise known as %{HTTP_USER_AGENT}.

* * *

Well, there seems to exist a book called "Regular Expressions for Dummies", though I don't know if it's any good ;) When you're looking for books, they do not have to be published this year (er, I mean 2011). Regular expressions don't change, though they may add dialects ("flavors" is the technical term) to go with new programming languages.

I've had this site

[regular-expressions.info...]

bookmarked for ages. There's other stuff on the site, but by the time I found it, the Reference page was what I needed most. And after about the 20th visit to the page, I printed it out and kept it next to the computer.

:: detour here to mod_rewrite page to see how long it would be if I printed that out ::

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51

means "Chrome is wilfully trying to annoy you by including the exact word 'Safari' in its UA string so if you're sorting by user-agent you have to pull out Chrome first". Oops, that wasn't what you asked. The key phrase here is "Google Web Preview". Unlike visits from normal robots, previews will generally be accompanied by requests for all associated files, just like a human visit. There are Forums threads about Preview. If you are small, Previews are generated on the fly. For bigger sites they are cached. The same seems to apply to the various forms of Google Translate.

To block by User-Agent you have to say (at a minimum)

Web\ Preview


Notice that the space in the middle is \ escaped. I think this is the only RegEx detail that is specific to mod_rewrite. Since a space has syntactic meaning, any literal spaces have to be escaped. Unless, haha, you are testing to see whether your custom 500 page works as intended.

There will not-- or, ahem, should not-- ever be any spaces in your URL. (We don't need to talk about spaces in query strings just yet.) But there will be lots of spaces in the User Agent. Matter of fact, if there are no spaces in the User-Agent, it's just about got to be a robot.

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

You can do a Forums search on Baidu. This search engine is used in both China and Japan. My own experience is that the one from Japan behaves itself while the Chinese one doesn't. So I don't bother about the User Agent itself, but instead block by IP. This is a matter of personal preference.

When there is a +http et cetera construction in the user-agent string, it means "You can go here for more information". But don't bother with Baidu; it just takes you to a blank page with a title in Kanji or possibly Chinese. A quick look at the page source, followed by a further detour to the referenced javascript file, suggests that they simply forgot to code the page. Unless they're pulling some rewrite hanky-panky of their own, so you think you're getting a static html page while they're really doing php business behind the scenes.