Forum Moderators: phranque

Message Too Old, No Replies

mod rewrite {REMOTE USER AGENT} ^new-line carat Q

(w/ {REQUEST_URI} P.S.)

         

Pfui

4:02 pm on Jun 18, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month


[Note: The RewriteCond examples are properly 'spaced'; this board's program tends to strip same in some code.]

I've been successfully blocking all permutations of Heritrix via --

SetEnvIfNoCase User-Agent

-- and now want to block using mod_rewrite. I've seen the UA in two forms and initially thought I'd do this:

RewriteCond %{REMOTE_USER_AGENT} ^Mozilla.*Heritrix [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} ^heritrix [NC,OR]
(etc.)

For the sake of streamlining, could I just drop the new-line carat instead, re Heritrix or any similar multi-version UA?

RewriteCond %{REMOTE_USER_AGENT} heritrix [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} twiceler [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} larbin [NC,OR]
(etc.)

These UAs can be used so nastily that I'm reluctant to test on a live site, only to discover too late that no-carat blocks don't work. TIA for your thoughts!

-Annie

.
P.S.
Which is correct, please, if either, to allow top level access to robots.txt? Or are they basically the same?

RewriteCond %{REQUEST_URI}!^/robots\.txt$
(or)
RewriteCond %{REQUEST_URI}!^robots\.txt$

TIA redux!

[1][[b]edited by[/b]: Pfui at 4:04 pm (utc) on June 18, 2007][/1]

jdMorgan

4:17 pm on Jun 18, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> SetEnvIfNoCase User-Agent

Was the above line spurious? It's missing the name of the variable to set...

> For the sake of streamlining, could I just drop the new-line carat instead, re Heritrix or any similar multi-version UA?

Yes, that's fine. Although it will benefit you to think of and refer to the carat as a "pattern-start-anchor," since it has nothing to do with newlines.

Don't start-anchor a pattern unless you know that the string will always start with that pattern. In case of doubt, when failure to match will have a 'bad result,' leave it out.

The advantage of using start and end anchors is performance and unambiguous matching. When trying to match fixed character-strings, it is faster to match an exact string (both start and end-anchored) or a start-anchored string. Next comes end-anchored-only strings (the match has to be done in reverse), and finally, unanchored strings. These are slowest of all, because the regex matching engine has to look for a "floating" match, leading to a potentially-large number of trial matches.

Things get even worse when the patterns contain regex patterns instead of fixed character strings. Badly-coded patterns containing multiple ambiguous subpatterns like "ab(.*)cd(.*)ef(.*)" can require thousands of matching attempts, and should be avoided if at all possible by using much more specific subpatterns.

[added] URLs "seen" by RewriteCond testing %{REQUEST_URI} will always start with a slash. URLs "seen" by RewriteRule will not start with a slash if the code is in .htaccess, but will start with a slash if the code is in httpd.conf, conf.d, etc. In .htaccess "!^/robots\.txt$" is correct. [/added]

Jim

[edited by: jdMorgan at 4:20 pm (utc) on June 18, 2007]

Pfui

5:44 am on Jun 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Howdy, Jim! Long time no type!

>> SetEnvIfNoCase User-Agent
> Was the above line spurious? It's missing the name of the variable to set...

Not so much spurious as obfuscated by omission. Here's the original:

SetEnvIfNoCase User-Agent "heritrix" no-way

-----
>> For the sake of streamlining, could I just drop the new-line carat instead, re Heritrix or any similar multi-version UA?

> Yes, that's fine. Although it will benefit you to think of and refer to the carat as a "pattern-start-anchor," since it has nothing to do with newlines.

How odd. I've had newline, or more specifically, the start of a newline, in my head for I don't know how long. I must've mentally pattern-matched it with '\n' in Perl, C, etc. Whatta goof. Thanks for the graceful restart! :)

-----
> Don't start-anchor a pattern unless you know that the string will always start with that pattern. In case of doubt, when failure to match will have a 'bad result,' leave it out.

I'm finally remembering that (well, more often than not). But pattern-matching still throws me. (sighs) That's why I wanted to double-check before I left it out and inadvertently let all manner of hooligans in.

-----
> The advantage of using start and end anchors is performance and unambiguous matching. When trying to match fixed character-strings, it is faster to match an exact string (both start and end-anchored) or a start-anchored string. Next comes end-anchored-only strings (the match has to be done in reverse), and finally, unanchored strings. These are slowest of all, because the regex matching engine has to look for a "floating" match, leading to a potentially-large number of trial matches.

Ah. Balancing utility with efficiency. Hmm. I hope it's the latter when using one unanchored string --

RewriteCond %{REMOTE_USER_AGENT} heritrix [NC,OR]

-- versus three start-anchored ones:

RewriteCond %{REMOTE_USER_AGENT} ^heritrix [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} ^my-heritrix-crawler [NC,OR]
RewriteCond %{REMOTE_USER_AGENT} ^Mozilla.*Heritrix [NC,OR]

-----
> Things get even worse when the patterns contain regex patterns instead of fixed character strings. Badly-coded patterns containing multiple ambiguous subpatterns like "ab(.*)cd(.*)ef(.*)" can require thousands of matching attempts, and should be avoided if at all possible by using much more specific subpatterns.

Jim, with the exception of you and Ralf Engelschall, I suspect everyone could benefit from better-coded patterns:)

> [...] In .htaccess "!^/robots\.txt$" is correct.

Thank you again, Jim! -Annie

jdMorgan

1:31 pm on Jun 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> But pattern-matching still throws me.

Nothing intimidating about it. Here's the whole concept in four lines:

^foo$ - Match only "foo" exactly
^foo -- Match anything that starts with "foo"
foo$ -- Match anything that ends with "foo"
foo ---- Match anything that contains "foo"

Jim