Forum Moderators: phranque
...since then, I've heard hyphens seem to have emerged the definite winner, and I'd like to replace the underscores with hyphens, but of course I don't want all our incoming links broken afterwards.
Instead of making a huge 302 redirect list for every underscored URL on the site, I'd like to put a single rule in my .htaccess that would rewrite incoming underscore-infested URLs to their new, hyphenated versions.
Is it possible to use mod rewrite to universally replace a single character (underscore) in ANY incoming URL with another character (hyphen), no matter where in the URL the character appears?
There is also a solution in mod_rewrite:
RewriteRule ^([^_]*)_(.*)$ $1-$2 [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ http://www.example.com/$1-$2-$3 [R=301,L]
RewriteRule ^([^_]*)_(.*)$ http://www.example.com/$1-$2 [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ http://www.example.com/$1-$2-$3-$4-$5 [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ http://www.example.com/$1-$2-$3 [R=301,L]
RewriteRule ^([^_]*)_(.*)$ http://www.example.com/$1-$2 [R=301,L]
However, this can get very slow at some point if you have a lot of hyphens in the URLs. You'll have to find the right trade-off between the number of rules and external redirects versus the overhead of processing these rules for every request.
If only a particular type of file is named with the underscore convention, then the rules can and should be rewritten so that they are only invoked for those file types - the more selective, the better. There are numerous other tweaks you can do if you have a lot of hyphens to replace, such as avoiding the (slow) external redirect until it is required. Here's an example that only checks .html files, and avoids the external redirect until the last step:
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)\.html$ $1-$2-$3-$4-$5.html [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_(.*)\.html$ $1-$2-$3.html$ [E=unscors:Yes]
RewriteRule ^([^_]*)_(.*)\.html$ $1-$2.html [E=unscors:Yes]
#
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule ^(.*)\.html$ http://www.example.com/$1.html [R=301,L]
Ref:
Apache mod_rewrite documentation [httpd.apache.org]
Apache URL Rewriting Guide [httpd.apache.org]
Regular-Expressions Tutorial [mnot.net]
Jim
[edited by: jdMorgan at 9:19 pm (utc) on June 24, 2004]
Eek. I really need to study regex... heheh.
Most of the URLs are in the form of www.domain.com/directory_name/file_name.html, there are a few that I went a little nuts on and ended up with www.domain.com/directory_name/really_long_file_name_with_keywords.html
Shall I just go RTFM and figure out for myself which of the rules you took the time to write would actually work best? ;)
Would it be most efficient to just use one rule for www.domain.com/directory_name/file_name.html, and then put the few exceptions in regular redirect format?
I like the idea of limiting it to .html files only. The 3-4 pdf files that ended up with underscored names wouldn't be too much to do a standard redirect for.
RTM is good, but you might just want to figure out what the maximum underscore count it, and provide for that.
> Would it be most efficient to just use one rule for www.domain.com/directory_name/file_name.html, and then put the few exceptions in regular redirect format?
Well, I'm really not sure. This depends on so many things about your site -- The mix of filetypes, whether you can take advantage of the directory structure to minimize the performance impact of the rewrites (if all the files that need to be rewritten are of a certain type or are contained in a limited number of directories, you can take advantage of that to minimize performance impact.)
> I like the idea of limiting it to .html files only. The 3-4 pdf files that ended up with underscored names wouldn't be too much to do a standard redirect for.
The focus here is more toward NOT running the rule when it is NOT needed. So, in this case, just skip all four of the rules unless .html and/or .pdf filetypes are requested:
RewriteRule !\.(html¦pdf)$ - [S=4]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ $1-$2-$3-$4-$5 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ $1-$2-$3 [E=unscors:Yes]
RewriteRule ^([^_]*)_(.*)$ $1-$2 [E=unscors:Yes]
#
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
I believe in being efficient both with the code and with my time, but more with my time. So, I'll trade off CPU time for my own time. Occasionally, I'll come up with some half-baked idea that cripples the server, and then I go back and rewrite the code to make it much more efficient. So again, it all depends on your site, your server, what the load is now, and how much load the new code adds. A pragmatic approach is to write the code in the simplest way possible and then test. If your server falls to its knees, then rewrite for better performance.
I don't mean to scare anyone here; mod_rewrite is certainly not any less efficient that any of the server-side scripting languages in common use on many sites, and there are lots of sites out there with thousands of lines of complex scripts running for each request. But be aware that mod_rewrite code is going to be executed for each and every HTTP request that accesses a file in or below the directory where the code resides. So it's always good to limit the code's execution to certain circumstances if those are easily identifiable; In this case, we make it skip execution for anything except html and pdf files -- No use running it for each and every gif and jpg file on your site! If the code is only intended to affect requests for files in one (or a few) subdirectories, then consider putting the code *in* that subdirectory.
As always, change the broken vertical pipes "¦" in the code above to solid vertical pipes before use. I didn't test this code, so post again if you have trouble.
Jim
[corrected as noted below]
[edited by: jdMorgan at 1:45 am (utc) on May 1, 2004]
RewriteRule (.*) [example.com...] [R=301,L]
Being dumb here... what does this line do? It looks to the untrained eye like it would send everyone to a single destination URL...
Basically, there are two directories I can think of offhand that have underscored files. One is the pdf directory, and I'm going to leave that to a regular redirect in an .htaccess file in that specific directory (There are hundreds of technical pdfs on the site, and only three of them have underscored names). Heck, now that I think about it, I might just not mess with the PDFs at all...
The underscored .html files are all in a single directory (I think), but the directory name itself has an underscore, so I'd need to put code in the site root directory to deal with that.
You mentioned using mod_speling if there was only one underscore involved... Could I use mod_speling in the root directory .htaccess, just to deal with the directory name, and then put the mod_rewrite multiple-underscore code in the directory itself, so the rest of the site didn't have to deal with processing it? I'm really trying to think of the simplest solution here, from the server processing load standpoint.
mod_rewrite is one of those things I deal with SO rarely, that everytime I go back to it it's like starting all over again from the beginning. Your help is GREATLY appreciated.
When I get the Apache install on my laptop set up to my liking, I plan on actually drilling this stuff into my head... but I don't want to experiment with my own half-baked ideas on my employer's site. ;)
So I guess I'm back to putting all the rewrite code in the root directory, but the rule you wrote out doesn't seem TOO excessive.
>what does this line do? It looks to the untrained eye like it would send everyone to a single destination URL
Yes, it would redirect any request to itself (and create an infinite loop, too), except that it is preceded by a RewriteCond, and the condition must be met in order for the Rule to be invoked. In this case, the 'unscors' variable must be set to "Yes", which it won't be if one of the previous three rules has not already been invoked.
Also, note that all four rules are skipped for files which are not .html or .pdf type.
The initial three rules change underscores in the URI to hyphens, but they don't tell anyone -- they just change the URI string locally, and set 'unscors' to "Yes" to indicate that they changed the URI. The final rule is invoked in order to do an external redirect and give the client the new URI. Then, in accordance with the definition of an external redirect, the client will use the new URI to re-request the resource it asked for intially, but at the new address.
This is pretty fancy small code and there are some nuances to it. However, the links above will tell you everything you need to know to understand it... I'm sure, because that's where I learned it! (That and a few hundred server crashes) :o </added>
If you're at the 10,000 visitors per day level, you might not even notice it. At a million, you would.
Try it and see how it goes. :)
Jim
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule \.(html¦pdf)$ - [S=4]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ $1-$2-$3-$4-$5 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ $1-$2-$3 [E=unscors:Yes]
RewriteRule ^([^_]*)_(.*)$ $1-$2 [E=unscors:Yes]
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule (.*) [mysiteurl.com...] [R=301,L]
...and I'm getting 404s on the underscored URLs.
I'm not even sure where to start looking for a fix. Which is, of course, the problem with letting someone else do your homework for you. ;-)
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*).html$ $1-$2-$3-$4-$5 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_([^_]*).html$ $1-$2-$3 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*).html$ $1-$2 [E=unscors:Yes]
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule (.*) [mysiteurl.com...] [R=301,L]
...and I got this impressive ever-repeating url, and a forbidden error. hehehehe.
Good thing I saved my old .htaccess file for quick replacements when I inevitably broke something. ;)
<added>The last chunk of code posted in msg.2 did the same thing... pretty neat, but not quite the effect I was looking for.</added>
[edited by: mivox at 10:32 pm (utc) on April 26, 2004]
You don't have permission to access
/directory-name/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html on this server.
Hmm... it IS rewriting them. It's just not stopping when it's done.
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
It's a combo of a suggestion from post two with .html limiting... I noticed it had "L" at the end of each line, so I figured it might fix the repeating url problem, which it did.
:) :) :)
Thanks SO much for your help!
I tested it here, and it works fine, with two qualifiers: First, I'm testing in .htaccess, and second, I commented-out the RewriteBase directive.
Options +FollowSymLinks
RewriteEngine on
#RewriteBase /
RewriteRule !\.(html¦pdf)$ - [S=4]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ $1-$2-$3-$4-$5 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ $1-$2-$3 [E=unscors:Yes]
RewriteRule ^([^_]*)_(.*)$ $1-$2 [E=unscors:Yes]
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
Jim
Brings to mind the quote from the Apache docs for it:
Despite the tons of examples and docs, mod_rewrite is voodoo. Damned cool voodoo, but still voodoo.
<added>
I can't comment out the RewriteBase directive, because it will break other rules I have in there. Right now, everything is working perfectly, so I'm reluctant to mess with it again. I don't want the mod_rewrite gods to feel I'm being ungrateful or anything... ;)
</added>
I'll concede that simplicity wins over elegance in this case, but I wish I could make it fail here to see how to make it more bullet-proof. I can't figure out why it doesn't work on your server; The only thing I can think of is that maybe your configuration prohibits setting 'private' environment variables... But I've never heard of such a thing on modern Apache versions.
Jim
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
And that will correct 1, 2, 3 or 4 underscores into hyphens with one redirect, and should fix URLs with more than 4 underscores with multiple redirects, for URLs leading to .html files.
<added>And if you can leave out the RewriteBase line, you should be able to use the much spiffier code provided by jdMorgan in his last post... Test his code on your server first, it's much nicer. ;) </added>