Forum Moderators: phranque

Message Too Old, No Replies

regex -- in depth

         

finlander

2:10 am on Nov 29, 2010 (gmt 0)

10+ Year Member



I have a pretty good understanding of basic regular expressions, but after completing some rewrite code and actually examining the RewriteRule, I am totally stumped as to how the guts of it are actually working. I have looked for a breakdown of the syntax on this and other sites and cannot find the answer I am looking for. Would anyone mind giving a brief step-by-step?

This is the rule, followed by my misunderstanding.

RewriteRule ^(.*)$ [www\.example\.com...] [R=301,L]

If the pattern part ^(.*)$ matches the whole original URL request, and the $1 represents that match, then why wouldn't the new URL end up, erroneously, looking like this:

[example.com...]

when this was the original URL request:

http://www.example.com/index.php?main_page=login

How is it that $1 represents only

index.php?main_page=login

finlander

3:13 am on Nov 29, 2010 (gmt 0)

10+ Year Member



I've been looking for the answer and may have found it, but I don't know yet. Is this it:

The pattern ^(.*)$ is only checked against characters of the original URL after .com? So, the string of characters that is captured by the regex -- for use in the reference $1 -- is the string of characters:

/index.php?main_page=login

Then, we do not end up with two slashes, because the slash in front of the $1 is not used?

So, the key to whether this is the answer or not, is whether the regex pattern is only looking for a match after .com. If so, does rewrite know what to do with .org etc? And what about .uk and such for other countries?

finlander

3:54 am on Nov 29, 2010 (gmt 0)

10+ Year Member



okay, I must be an idiot. It seems that I am thinking about it upside-down or something. Is this the correct way to think about backreferences:

The $1 is a backreference, but is not simply pulling in static text, but rather a matching is still occurring. The matching that is occurring is 'any string of characters in the full original URL that ....."

This is where I am hung up. I don't know how the backreference knows to only grab

/index.php?main_page=login

instead of grabbing anything else, such as

dex.php?main_page=

finlander

4:35 am on Nov 29, 2010 (gmt 0)

10+ Year Member



I think I have it. Someone please correct me if I am wrong.

from another site:

Accessing URL Parts from a Rewrite Rule

It is important to understand how certain parts of the URL string can be accessed from a rewrite rule.

For an HTTP URL in this form: http(s)://<host>:<port>/<path>?<querystring>

•The <path> is matched against the pattern of the rule.


That would solve my question. Only index.php in this example becomes the rule pattern, later referenced by $1. Then, substitution only involves that path string plus everything in front of that path string. The query_string is actually not being substituted -- it just stays 'as is' w/o actively being substituted (which is good since it does not need to be substituted). The path also does not 'need' to be substituted, but is used in the rule in order to invoke the substituting of protocol and host.

Does this sound correct?

g1smd

7:59 am on Nov 29, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The (.*) matches only the URL path. This may be the originally requested URL path, or a rewritten path of a previous rule processed for the current HTTP transaction.

If you need to match the domain name, you need a RewriteCond looking at HTTP_HOST.

If you need to match the query string, you need a RewriteCond looking at QUERY_STRING.

If you need to match the http/https protocol, you need a RewriteCond looking at SERVER_PORT.

The original query string data is automatically added to the target unless you have specified a different query string instead. To re-append the original query string onto the end of the new query string use the [QSA] flag. To completely clear the query string, add a question mark to the end of the target.

Use the [L] flag on EVERY RewriteRule.

One final note, escaping is used on periods and a few other characters only in the RegEx pattern. It is not needed for anything in the rule target.

finlander

8:31 am on Nov 29, 2010 (gmt 0)

10+ Year Member



thanks again for your explanation. I was being dense and not realizing that the pattern is only checking against the path. That all makes sense now, and about the other components as well.

jdMorgan

2:42 am on Dec 2, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Further, in a .htaccess context or within a <Directory> section in a config file, the path to the current directory is stripped. So if the requested URL is http://example.com/dir1/dir2/dir3/foo.bar and the code is located in /dir1/dir2/.htaccess or in a <Directory /dir1/dir2/> container, then the URL-path examined by RewriteRule will be only "dir3/foo.bar".

> For an HTTP URL in this form: http://<host>:<port>/<path>?<querystring>

# Require HTTP (not HTTPS)
RewriteCond %{SERVER_PORT} !=443
# Require specific requested hostname (non-FQDN format and no port number appended)
RewriteCond %{HTTP_HOST} ^my-great-site\.com$
# Require specific query string
RewriteCond %{QUERY_STRING} ^size=small&color=celadon&texture=fuzzy$
# If URL-path and all RewriteConds match, then do a redirect to replace the discontinued color
RewriteRule ^widgets\.html$ http://my-great-site.com/widgets.html?size=small&color=green&texture=fuzzy [R=301,L]

Jim