Forum Moderators: phranque
I've been successfully blocking all permutations of Heritrix via --
SetEnvIfNoCase User-Agent
-- and now want to block using mod_rewrite. I've seen the UA in two forms and initially thought I'd do this:
RewriteCond %{REMOTE_USER_AGENT} ^Mozilla.*Heritrix [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} ^heritrix [NC,OR]
(etc.)
For the sake of streamlining, could I just drop the new-line carat instead, re Heritrix or any similar multi-version UA?
RewriteCond %{REMOTE_USER_AGENT} heritrix [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} twiceler [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} larbin [NC,OR]
(etc.)
These UAs can be used so nastily that I'm reluctant to test on a live site, only to discover too late that no-carat blocks don't work. TIA for your thoughts!
-Annie
.
P.S.
Which is correct, please, if either, to allow top level access to robots.txt? Or are they basically the same?
RewriteCond %{REQUEST_URI}!^/robots\.txt$
(or)
RewriteCond %{REQUEST_URI}!^robots\.txt$
TIA redux!
[1][[b]edited by[/b]: Pfui at 4:04 pm (utc) on June 18, 2007][/1]
Was the above line spurious? It's missing the name of the variable to set...
> For the sake of streamlining, could I just drop the new-line carat instead, re Heritrix or any similar multi-version UA?
Yes, that's fine. Although it will benefit you to think of and refer to the carat as a "pattern-start-anchor," since it has nothing to do with newlines.
Don't start-anchor a pattern unless you know that the string will always start with that pattern. In case of doubt, when failure to match will have a 'bad result,' leave it out.
The advantage of using start and end anchors is performance and unambiguous matching. When trying to match fixed character-strings, it is faster to match an exact string (both start and end-anchored) or a start-anchored string. Next comes end-anchored-only strings (the match has to be done in reverse), and finally, unanchored strings. These are slowest of all, because the regex matching engine has to look for a "floating" match, leading to a potentially-large number of trial matches.
Things get even worse when the patterns contain regex patterns instead of fixed character strings. Badly-coded patterns containing multiple ambiguous subpatterns like "ab(.*)cd(.*)ef(.*)" can require thousands of matching attempts, and should be avoided if at all possible by using much more specific subpatterns.
[added] URLs "seen" by RewriteCond testing %{REQUEST_URI} will always start with a slash. URLs "seen" by RewriteRule will not start with a slash if the code is in .htaccess, but will start with a slash if the code is in httpd.conf, conf.d, etc. In .htaccess "!^/robots\.txt$" is correct. [/added]
Jim
[edited by: jdMorgan at 4:20 pm (utc) on June 18, 2007]
>> SetEnvIfNoCase User-Agent
> Was the above line spurious? It's missing the name of the variable to set...
Not so much spurious as obfuscated by omission. Here's the original:
SetEnvIfNoCase User-Agent "heritrix" no-way
-----
>> For the sake of streamlining, could I just drop the new-line carat instead, re Heritrix or any similar multi-version UA?
> Yes, that's fine. Although it will benefit you to think of and refer to the carat as a "pattern-start-anchor," since it has nothing to do with newlines.
How odd. I've had newline, or more specifically, the start of a newline, in my head for I don't know how long. I must've mentally pattern-matched it with '\n' in Perl, C, etc. Whatta goof. Thanks for the graceful restart! :)
-----
> Don't start-anchor a pattern unless you know that the string will always start with that pattern. In case of doubt, when failure to match will have a 'bad result,' leave it out.
I'm finally remembering that (well, more often than not). But pattern-matching still throws me. (sighs) That's why I wanted to double-check before I left it out and inadvertently let all manner of hooligans in.
-----
> The advantage of using start and end anchors is performance and unambiguous matching. When trying to match fixed character-strings, it is faster to match an exact string (both start and end-anchored) or a start-anchored string. Next comes end-anchored-only strings (the match has to be done in reverse), and finally, unanchored strings. These are slowest of all, because the regex matching engine has to look for a "floating" match, leading to a potentially-large number of trial matches.
Ah. Balancing utility with efficiency. Hmm. I hope it's the latter when using one unanchored string --
RewriteCond %{REMOTE_USER_AGENT} heritrix [NC,OR]
-- versus three start-anchored ones:
RewriteCond %{REMOTE_USER_AGENT} ^heritrix [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} ^my-heritrix-crawler [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} ^Mozilla.*Heritrix [NC,OR]
-----
> Things get even worse when the patterns contain regex patterns instead of fixed character strings. Badly-coded patterns containing multiple ambiguous subpatterns like "ab(.*)cd(.*)ef(.*)" can require thousands of matching attempts, and should be avoided if at all possible by using much more specific subpatterns.
Jim, with the exception of you and Ralf Engelschall, I suspect everyone could benefit from better-coded patterns:)
> [...] In .htaccess "!^/robots\.txt$" is correct.
Thank you again, Jim! -Annie