Forum Moderators: phranque

Message Too Old, No Replies

RewriteCond globs?

         

Dan99

7:28 pm on Mar 26, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



When I want to reject a referer "x.com", I put in my .htaccess file

RewriteEngine on
RewriteCond %{HTTP_REFERER} x\.com [NC]
RewriteRule .* - [F]


But I'm surprised to see that this text ALSO rejects max.com, thisandthatx.com and whoknowswhatx.com. That is, it rejects any referer that has the text "x.com" in it's URL. That's not spelled out anywhere, is it? I was rejecting referers I didn't want to reject!

I guess what I'm supposed to put there is

RewriteCond %{HTTP_REFERER} http://x\.com [NC]


Is that the workaround? Sheesh.

penders

8:31 pm on Mar 26, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why do you put a backslash before the dot?

Dan99

9:01 pm on Mar 26, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



That's what everyone says to do. All the examples out there say you have to escape the dot. A period matches any character, so to match a literal period, it must be escaped with a backslash.

Actually, in my guess about what I'm supposed to put there, I missed a caret. The examples I see call for

RewriteCond %{HTTP_REFERER} ^http://x\.com [NC]

lucy24

9:46 pm on Mar 26, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's not spelled out anywhere, is it?

Yes, it is. Unless you use anchors, any text match simply means "containing this specified string". In your case, I'd use a word-break \b anchor rather than a full-text ^ anchor:
RewriteCond %{HTTP_REFERER} \bx\.com(/|$) [NC]

This will match subdomain.x.com, www.x.com, x.com and so on. (It will also match new-x.com but this is probably not a problem.) Use a closing anchor to protect against referers that happen to contain the string "x.com" somewhere further along, like
example.com/newx.computers/morestuff.html

This may or may not ever really occur, but let's be safe.

All the examples out there say you have to escape the dot.

In a pattern you have to escape certain characters because they have special meaning in a regular expression. In a target you do not need to escape anything, ever. Here the escape is correct, because otherwise-- especially without anchors-- you'd be matching things like
xycom.org
or even
x/computers
as well.

Dan99

10:01 pm on Mar 26, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thank you. It may be explained deep in the man pages. But in all the examples that are presented for this particular application of RewriteCond, no one ever seems to spell out that you're just matching a specified string. I find that omission to be pretty serious.

That's very interesting about word-break anchors and closing anchors. I need to read up on those, but the example you've given seems to be what I need.

Not sure I understand what's wrong here with a full-text anchor. Doesn't that require an exact match to the full text?

lucy24

11:31 pm on Mar 26, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A full-text anchor matches from the beginning of the string. So
^http etcetera
would work ... except that you then have to allow for (www\.)? and https? and possibly more stuff if there are subdomains. A word-break anchor means "there may or may not be more stuff here, but there definitely isn't another word character (alphanumerics and _ lowline).

Unlike most anchors, \b doesn't come in pairs; it's the same at both ends of the pattern. So it's also useful for matching things like browser version numbers:
[1-5]\b
will match 3 but not 34, 2 but not 20 and so on. Useful especially with things like Chrome and Firefox that introduce a new version every other week and who can keep track of the multiples of ten. (I really did once accidentally block a MSIE 10 immediately after it was introduced.)

penders

12:30 pm on Mar 27, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All great info from lucy24, as usual.

I was just going to try and address a few fundamentals that seem to be missing in the understanding...

A period matches any character, so to match a literal period, it must be escaped with a backslash.


Exactly.

.... I missed a caret.


So, you are aware of anchors as well. Good.

This should give you a big clue... "x\.com" is not an ordinary string, it is "special", it is a "pattern", a "regular expression" (or "regex" for short). And this is as stated in the documentation. In fact the whole of mod_rewrite revolves around the use of regex, as stated on the very first line of the docs, "a rule-based rewriting engine (based on a regular-expression parser)". If you know the basics of regex, which you need to if you are using mod_rewrite, then you will realise why it matches the way it does.

Incidentally, these are the same regex (for the sake of this thread) as you see in all modern languages that support regex: JavaScript, PHP, Python, C#, etc... and used in many text editors when searching a document.

But lets take one step back... even if you've never heard of regex, you already know (or have accepted) that "x\.com" matches against the string "http://x.com" (the HTTP Referer). If it's not immediately obvious why, it should raise the obvious question, "Why?". And once you realise why that matches it should be obvious that it would also match "http://max.com", "<anything>x.com" and consequently "<anything>x.com<anything>".

Dan99

1:19 pm on Mar 27, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thank you. I think the confusion in my mind was what the word "match" means. Simplistically, a bag of apples doesn't "match" a bag of apples and bananas. It just matches a bag of apples. But in the case of mod_rewrite, a "match" is evidently where a regex is a parseable part of a string. It matches *something* in that string. So the issue isn't really what a pattern block is, but how it is applied. The word "match" is, to my dim mind, a poor way of describing what's going on here.

penders

3:17 pm on Mar 27, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think maybe you are confusing "match" with "equal"? (Although "x\.com" is not equal to "http://x.com" either.)

...a bag of apples doesn't "match" a bag of apples and bananas. It just matches a bag of apples.


Although a bag of "fruit" does match a bag of apples and bananas, and a bag of apples. "x\.com" is the "fruit".

I think the confusion is also strengthened by the fact that "x\.com" looks more like an ordinary string of characters. If the "regex" is more cryptic, something like "([0-7]{3,}|[xyz]+)?" then it becomes reasonably clear we are not trying to equal a bunch of literal gobbledygook like this, but instead we are matching the pattern (regex) that this group of characters represents.

Dan99

3:35 pm on Mar 27, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Well, I'd rather not argue words, but "match" is pretty much defined as "equal" or "exact counterpart". Look it up. A better way to describe this is that mod_rewrite is trying to find a regex *within* a string, and not about matching the regex to the string. My point is that mod_rewrite is not looking for an exact counterpart to the string. I understand regex, but I guess I just didn't understand "match". That is "x.com", is found within the string :<anything>x.com<anything>, but it doesn't match that string. It isn't the exact counterpart of that string.

lucy24

8:37 pm on Mar 27, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...a bag of apples doesn't "match" a bag of apples and bananas. It just matches a bag of apples.

Although a bag of "fruit" does match a bag of apples and bananas, and a bag of apples.

Good image. When I read the "apples and bananas" part I was about to object that a bag of apples doesn't match, because I was picturing a pattern
aaa
and a test string
ababab
No match. But it would match a pattern of
[ab][ab][ab]
or possibly
\w\w\w

The important thing to walk away with is that this has nothing to do with mod_rewrite and everything to do with Regular Expressions. A lot of other Apache mods use RegExes too, so once you've learned them you can apply the knowledge elsewhere. (Files and Redirect: no RegEx. FilesMatch and RedirectMatch: RegEx. And so on. Except of course you won't be using RedirectAnything once you've got RewriteRules ;))

And always escape \ literal spaces. That's specific to Apache; most RegEx engines have rules of their own overlaid on the basics. If you ever see a RewriteRule containing \/ (escaped slashes)-- or, worse, escaped everything, or escaped anything at all in the target-- stop reading, because your source doesn't really understand what they're doing.

Dan99

9:23 pm on Mar 27, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



The important thing to walk away with is that this has nothing to do with mod_rewrite and everything to do with Regular Expressions. A lot of other Apache mods use RegExes too, so once you've learned them you can apply the knowledge elsewhere.

OK, that is indeed important. It's not that Rewrite just happens to use a regex to look for coincidences in a string, but regex everywhere corresponds to coincidences in that string.

If you ever see a RewriteRule containing \/ (escaped slashes)-- or, worse, escaped everything, or escaped anything at all in the target-- stop reading, because your source doesn't really understand what they're doing.

Uh, you lost me there, because the example you gave above ...

RewriteCond %{HTTP_REFERER} \bx\.com(/|$) [NC] 


... had the dot escaped.

phranque

10:39 pm on Mar 27, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you ever see a RewriteRule containing \/ (escaped slashes)-- or, worse, escaped everything, or escaped anything at all in the target-- stop reading, because your source doesn't really understand what they're doing.

Uh, you lost me there, because the example you gave above ...

RewriteCond %{HTTP_REFERER} \bx\.com(/|$) [NC]

... had the dot escaped.

[my emphasis added]
while a RewriteRule directive specifies a target, the RewriteCond directive specifies a pattern.

Dan99

10:48 pm on Mar 27, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Ah, *bonk*, yes that should have been obvious. Thanks.