Forum Moderators: phranque

Message Too Old, No Replies

Regex in .htaccess failing on periods and plus signs

         

ntbgl

1:15 pm on Apr 12, 2009 (gmt 0)

10+ Year Member



I'm making a wiki using MediaWiki, the same software that runnings Wikipedia.

By default, the software program gives you really ugly URLs:

/index.php?title=This_is_the_file_name

But with a little editing to an internal file, and the .htaccess file, you can turn that into:

/This_is_the_file_name

The .htaccess code looks like this:

RewriteRule ^[^:]*\. - [L]
RewriteRule ^[^:]*\/ - [L]
RewriteRule ^(.+)$ /index.php?title=$1 [L,QSA]

I've never seen regex looking like that, but the code worked great.

My problem is, I'm noticing some page titles result in the wrong file being pulled or an error.

Pages with a period (.) in the title result in an error.

/...Baby_One_More_Time

and pages with a plus sign (+) redirect to a different title

/1+1=2

takes you to

/1 2=3

There might be more issues, but I've only come across those two errors.

If somebody could help me modify the regex, it would be much appreciated.

jdMorgan

3:30 pm on Apr 12, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Neither problem can be fixed simply by modifying the regex.

In the first case, the code is specifically "skipping" the rewrite if a period or slash is found in the requested (pretty) URL and no colon precedes those characters (whether immediately-preceding or not). The intent is apparently to avoid rewriting URLs that have filetypes appended, such as "robots.txt" or "index.php" itself.

In the second case, it is likely the script itself changing plus signs to spaces, as this is a common thing, for example among most search engines.

So the first question to ask is, "Can you do without periods and plus signs in titles?" Noting that these are titles --such as used in newspapers-- periods are never used, and I doubt that plus signs are ever used, either.

If the answer is "No," then you'll probably have to go with a rewriting solution that is a lot less efficient. And by that I mean "very inefficient" and may not be suitable for a high-traffic site. You will also have to modify the Wiki script to fix the plus sign problem, as that can't be fixed in mod_rewrite.

Modification for "period" problem:


# Rewrite URL-path requests with no periods or slashes (unless preceded by a colon) to script
RewriteRule ^([^./]+¦[^:]*:[^./]*[./].*)$ /index.php?title=$1 [QSA,L]
#
# Else rewrite only if requested URL-path does not end with any of the listed
# filetypes and does not resolve to a physically-existing directory or file
RewriteCond $1 !\.(gif¦jpe?g¦png¦php[345]?¦s?html?¦xml¦js¦css¦txt¦rdf¦pdf¦flv¦swf¦wmv¦mpe?g[234]?¦avi¦mp3)$
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.+)$ /index.php?title=$1 [QSA,L]

It is the additional second rule and how it works that is inefficient: Any requested URL-path with a period or slash which is not preceded by a colon and that does not end with the specifically-listed filetypes will result in two calls to the operating system's file manager --and possibly in two disk reads-- to go see if the requested URL-path resolves to an actually-existing physical file or directory. These extra disk functions may slow the site perceptibly if you get a *lot* of traffic, meaning hundreds of thousands of hits per day. You can reduce the number of checking operations by specifically excluding certain filetypes from being checked as shown, but note that any "filetype" that you exclude here cannot be used at the end of a 'title' in your Wiki. The list shown here is meant to be illustrative, rather than comprehensive: add or remove filetypes to suit your site's needs. Filetypes appearing in the list cannot appear at the end of Wiki titles, but won't result in a disk check. You must decide where to strike the 'balance' based on what is important to you.

Also, order the filetype-exclusion list based on your 'hit-count' stats, with the most frequently-requested filetypes at the beginning of the list.

Again it's a balance between executing more mod_rewrite code, titling freedom, and disk-check time.

Important: Replace the broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters, and the code won't work if the broken pipes are not corrected.

And again, the problem with plus signs is not resolvable by mod_rewrite -- I'd suggest you either avoid the use of that character, or use preg_replace in a "patch" in the script to substitute either a different character or the word "plus" for all occurrences of that character in your title links (look for the "link printing" or "get title from database" directives, and add the preg_replace in or near those lines).

I should also say that I don't know why the original code looked for colons -- It may have something to do with the URLs produced by the Wiki script; Perhaps you can include periods and slashes in Wiki titles if you precede them with a colon (check the documentation, as you may not even need to modify this code). But I attempted to copy that functional behaviour into the new code above, and will leave that issue for you to investigate.

Jim

ntbgl

7:41 pm on Apr 12, 2009 (gmt 0)

10+ Year Member



Thank you very much for such a through reply.

I can deal with the articles with a plus very easily. With 80,000 pages, I've only found one with a plus sign in the title.

The period however, is a bigger issue, because I really want to have titles with periods in them. Say for example on article on Google.com which would be different than Google.org. I wouldn't just want to call it Google, and Google_com and Google_org looks sloppy.

I was thinking maybe a decent solution would be to have a 404 error page that first looks to see if there is a page under the ugly URL with the same name before finally resolving into a 404 error.

Could this be an acceptable solution for sites aspiring for heavy traffic and with an occasional title with a period?

ntbgl

9:15 pm on Apr 12, 2009 (gmt 0)

10+ Year Member



Scratch that, it seems like the wiki program already has a method of handeling 404 errors.

I think I'll just do the sensible thing and prevent titles from being created with periods in them, unless they also contain a colon. For URLs, I'm going to block them unless they're in the format URL:domainname.com

I'm trying to fix up a regex statement to do this:

[^URL:]*\. seems close, but not quite there.

g1smd

12:44 am on Apr 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You do realise that
[^URL:]*
means:

"does not contain a U and does not contain an R and does not contain an L and does not contain a colon, zero or more times".

I am not quite sure if that is what you want. It maybe matches some stuff that you don't want matched.

jdMorgan

3:35 am on Apr 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not sure why this complication is needed, at least at the .htaccess level. The code above (either the original or the modified version) will rewrite requested URL paths with periods to the script if a colon precedes them, and the modified version will additionally rewrite any requested URL-path with a period to the script as long as it does not resolve to a physically-existing file or directory.

Jim