Forum Moderators: phranque

Message Too Old, No Replies

mod_rewrite regexp problem?

Dealing with Dots.

         

rocknbil

10:38 pm on Dec 28, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Finally cornered an admin to try the approach mentioned in this thread: [webmasterworld.com] Directing a request for a member's page, as in /membername, to a script designed to parse out environment variables and print out the page.

It works PERFECTLY unless there is a dot in the member's name:

example.com/john.h.doe

Does anyone see the mistake in our regexp?

# IF requested URL does not resolve to an existing directory
RewriteCond %{REQUEST_FILENAME}!-d
# AND IF requested URL does not resolve to an existing file
RewriteCond %{REQUEST_FILENAME}!-f
# THEN internally rewrite the request to the search script
RewriteRule ^/([^\.]+)$ /cgi/search_script?d=1 [L]

We've tried several other regexps, and couldn't get any of them to work, does anyone have any ideas? Thanks in advance.

jdMorgan

12:13 am on Dec 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The mistake is at a much higher level:

example.com/john.h.doe

According to the rules of Apache and other servers, this is a request for a file named "john" with an extension (filetype) of ".h.doe".

My advice would be to disallow the use of reserved characters such as periods in account names. Use hyphens or underscores or just run it all together.

We are not really 'free to use any characters we want' in URLs; See RFC2396 - Uniform Resource Identifiers (URI): Generic Syntax [faqs.org]

The code you posted explicitly requires that no periods be present in the filename. If you want to allow periods, which may make this rule apply to other types of URLs that you don't intend it to, then you could change your rule to:


RewriteRule ^/(.+)$ /cgi/search_script?d=1 [L]

which applies to all non-blank paths including subdirectories, so efficiency is affected as well.

For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com]. The regular-expressions tutorial might be most helpful in understanding what all the 'hats and dollars' in the above rewrite patterns mean.

The beauty of regular expressions and mod_rewrite is that you can precisely state what URLs (or types or classes of URLs) you do and don't want to rewrite. But you must define what you do and don't want to rewrite in precise terms first.

For example, if this username is always present as the first and only element in the URL, and there will never be a username followed by a subdirectory, but there *are* other subdirectory URLs on the server, then you might use:


RewriteRule ^/([^/]+)$ /cgi/search_script?d=1 [L]

That would apply the Rule and the RewriteCond checks only to URL-paths that don't contain any slashes other than the one leading slash. It is not a good idea to do file-exists and directory-exists checks on all of your URL-paths, as done by the code above. So anything you can do in the rule or in the design of your URL and directory structure to avoid them would be beneficial to server performance.

For example, if you required the URL-path to be /user/<username>, then your RewriteRule could match on '/user' and the RewriteConds would not be processed for any URL-paths that do not start with 'user', thus avoiding all the overhead of checking the filesystem. Some corporations and ISPs follow the convention of giving their users directory-paths like '~john.h.doe' for just this reason; the '~' clearly identifies the request as one for a user's subdirectory or 'account.' And there's actually an Apache module, mod_userdir, to handle this automatically without using rewrites.

Jim

rocknbil

8:07 pm on Dec 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you Jim, this is all very helpful. I'm pretty adept at regexps in perl just not sure if they act the same in Apache, which is why I passed it along to our admin.

Believe me if I had the option of disallowing ".", I would have already - but am dealing with legacy stuff involving thousands of members. The programming indeed prevents anything but [a-z0-9] from being added. But this

^/([^/]+)$

could very easily work, as there are indeed no member directories. Thank you again, back into the grind I go!