Forum Moderators: phranque

Message Too Old, No Replies

Replacing a single character with mod rewrite?

Want to get rid of underscores...

         

mivox

12:54 am on Apr 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Back when the 'underscores vs. hyphens in URLs' debate was still unsettled, I set up a large website using underscores to separate the words in directory and file names...

...since then, I've heard hyphens seem to have emerged the definite winner, and I'd like to replace the underscores with hyphens, but of course I don't want all our incoming links broken afterwards.

Instead of making a huge 302 redirect list for every underscored URL on the site, I'd like to put a single rule in my .htaccess that would rewrite incoming underscore-infested URLs to their new, hyphenated versions.

Is it possible to use mod rewrite to universally replace a single character (underscore) in ANY incoming URL with another character (hyphen), no matter where in the URL the character appears?

jdMorgan

2:54 am on Apr 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, it's possible. And you can simply use mod_speling if there is only one underscore in each URI.

There is also a solution in mod_rewrite:


RewriteRule ^([^_]*)_(.*)$ $1-$2 [R=301,L]

...And you can expand that to handle more than one per external redirect, say:

RewriteRule ^([^_]*)_([^_]*)_(.*)$ http://www.example.com/$1-$2-$3 [R=301,L]
RewriteRule ^([^_]*)_(.*)$ http://www.example.com/$1-$2 [R=301,L]

for up to three, or

RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ http://www.example.com/$1-$2-$3-$4-$5 [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ http://www.example.com/$1-$2-$3 [R=301,L]
RewriteRule ^([^_]*)_(.*)$ http://www.example.com/$1-$2 [R=301,L]

for up to seven. Note that if you have a rule for two or more, you must also have a rule for one following it, so as not to be left with a lone straggler.

However, this can get very slow at some point if you have a lot of hyphens in the URLs. You'll have to find the right trade-off between the number of rules and external redirects versus the overhead of processing these rules for every request.

If only a particular type of file is named with the underscore convention, then the rules can and should be rewritten so that they are only invoked for those file types - the more selective, the better. There are numerous other tweaks you can do if you have a lot of hyphens to replace, such as avoiding the (slow) external redirect until it is required. Here's an example that only checks .html files, and avoids the external redirect until the last step:


RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)\.html$ $1-$2-$3-$4-$5.html [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_(.*)\.html$ $1-$2-$3.html$ [E=unscors:Yes]
RewriteRule ^([^_]*)_(.*)\.html$ $1-$2.html [E=unscors:Yes]
#
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule ^(.*)\.html$ http://www.example.com/$1.html [R=301,L]

Ref:
Apache mod_rewrite documentation [httpd.apache.org]
Apache URL Rewriting Guide [httpd.apache.org]
Regular-Expressions Tutorial [mnot.net]

Jim

[edited by: jdMorgan at 9:19 pm (utc) on June 24, 2004]

mivox

7:09 pm on Apr 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You rock! :)

Eek. I really need to study regex... heheh.

Most of the URLs are in the form of www.domain.com/directory_name/file_name.html, there are a few that I went a little nuts on and ended up with www.domain.com/directory_name/really_long_file_name_with_keywords.html

Shall I just go RTFM and figure out for myself which of the rules you took the time to write would actually work best? ;)

Would it be most efficient to just use one rule for www.domain.com/directory_name/file_name.html, and then put the few exceptions in regular redirect format?

I like the idea of limiting it to .html files only. The 3-4 pdf files that ended up with underscored names wouldn't be too much to do a standard redirect for.

jdMorgan

7:55 pm on Apr 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Shall I just go RTFM and figure out for myself which of the rules you took the time to write would actually work best?

RTM is good, but you might just want to figure out what the maximum underscore count it, and provide for that.

> Would it be most efficient to just use one rule for www.domain.com/directory_name/file_name.html, and then put the few exceptions in regular redirect format?

Well, I'm really not sure. This depends on so many things about your site -- The mix of filetypes, whether you can take advantage of the directory structure to minimize the performance impact of the rewrites (if all the files that need to be rewritten are of a certain type or are contained in a limited number of directories, you can take advantage of that to minimize performance impact.)

> I like the idea of limiting it to .html files only. The 3-4 pdf files that ended up with underscored names wouldn't be too much to do a standard redirect for.

The focus here is more toward NOT running the rule when it is NOT needed. So, in this case, just skip all four of the rules unless .html and/or .pdf filetypes are requested:


RewriteRule !\.(html¦pdf)$ - [S=4]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ $1-$2-$3-$4-$5 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ $1-$2-$3 [E=unscors:Yes]
RewriteRule ^([^_]*)_(.*)$ $1-$2 [E=unscors:Yes]
#
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

As noted previously, this code will handle up to seven underscores with one external redirect. If the URL *still* has underscores in it after the first seven are changed to hyphens, then the code will run again after the redirect, and up to another seven will be replaced and another external redirect will occur. I would recommend making the code handle the highest possible number of underscores with two external redirects maximum -- You don't want the search engine spiders to get bored with redirects and leave... :)

I believe in being efficient both with the code and with my time, but more with my time. So, I'll trade off CPU time for my own time. Occasionally, I'll come up with some half-baked idea that cripples the server, and then I go back and rewrite the code to make it much more efficient. So again, it all depends on your site, your server, what the load is now, and how much load the new code adds. A pragmatic approach is to write the code in the simplest way possible and then test. If your server falls to its knees, then rewrite for better performance.

I don't mean to scare anyone here; mod_rewrite is certainly not any less efficient that any of the server-side scripting languages in common use on many sites, and there are lots of sites out there with thousands of lines of complex scripts running for each request. But be aware that mod_rewrite code is going to be executed for each and every HTTP request that accesses a file in or below the directory where the code resides. So it's always good to limit the code's execution to certain circumstances if those are easily identifiable; In this case, we make it skip execution for anything except html and pdf files -- No use running it for each and every gif and jpg file on your site! If the code is only intended to affect requests for files in one (or a few) subdirectories, then consider putting the code *in* that subdirectory.

As always, change the broken vertical pipes "¦" in the code above to solid vertical pipes before use. I didn't test this code, so post again if you have trouble.

Jim
[corrected as noted below]

[edited by: jdMorgan at 1:45 am (utc) on May 1, 2004]

mivox

8:31 pm on Apr 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



RewriteRule (.*) [example.com...] [R=301,L]

Being dumb here... what does this line do? It looks to the untrained eye like it would send everyone to a single destination URL...

Basically, there are two directories I can think of offhand that have underscored files. One is the pdf directory, and I'm going to leave that to a regular redirect in an .htaccess file in that specific directory (There are hundreds of technical pdfs on the site, and only three of them have underscored names). Heck, now that I think about it, I might just not mess with the PDFs at all...

The underscored .html files are all in a single directory (I think), but the directory name itself has an underscore, so I'd need to put code in the site root directory to deal with that.

You mentioned using mod_speling if there was only one underscore involved... Could I use mod_speling in the root directory .htaccess, just to deal with the directory name, and then put the mod_rewrite multiple-underscore code in the directory itself, so the rest of the site didn't have to deal with processing it? I'm really trying to think of the simplest solution here, from the server processing load standpoint.

mod_rewrite is one of those things I deal with SO rarely, that everytime I go back to it it's like starting all over again from the beginning. Your help is GREATLY appreciated.

When I get the Apache install on my laptop set up to my liking, I plan on actually drilling this stuff into my head... but I don't want to experiment with my own half-baked ideas on my employer's site. ;)

mivox

8:45 pm on Apr 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



NM about the mod_speling thing. I just read up on it, and it doesn't sound like my half-baked idea about that would work at all. I was thinking I could specify one specific misspelling to 'fix' but it looks like it's just an on/off setting for everything... ;)

So I guess I'm back to putting all the rewrite code in the root directory, but the rule you wrote out doesn't seem TOO excessive.

jdMorgan

9:05 pm on Apr 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<added> Missed this because of the double-post...

>what does this line do? It looks to the untrained eye like it would send everyone to a single destination URL

Yes, it would redirect any request to itself (and create an infinite loop, too), except that it is preceded by a RewriteCond, and the condition must be met in order for the Rule to be invoked. In this case, the 'unscors' variable must be set to "Yes", which it won't be if one of the previous three rules has not already been invoked.

Also, note that all four rules are skipped for files which are not .html or .pdf type.

The initial three rules change underscores in the URI to hyphens, but they don't tell anyone -- they just change the URI string locally, and set 'unscors' to "Yes" to indicate that they changed the URI. The final rule is invoked in order to do an external redirect and give the client the new URI. Then, in accordance with the definition of an external redirect, the client will use the new URI to re-request the resource it asked for intially, but at the new address.

This is pretty fancy small code and there are some nuances to it. However, the links above will tell you everything you need to know to understand it... I'm sure, because that's where I learned it! (That and a few hundred server crashes) :o </added>

If you're at the 10,000 visitors per day level, you might not even notice it. At a million, you would.

Try it and see how it goes. :)

Jim

mivox

9:51 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I put the following code in my .htaccess file:

Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule \.(html¦pdf)$ - [S=4]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ $1-$2-$3-$4-$5 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ $1-$2-$3 [E=unscors:Yes]
RewriteRule ^([^_]*)_(.*)$ $1-$2 [E=unscors:Yes]
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule (.*) [mysiteurl.com...] [R=301,L]

...and I'm getting 404s on the underscored URLs.

I'm not even sure where to start looking for a fix. Which is, of course, the problem with letting someone else do your homework for you. ;-)

mivox

10:20 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Changed it to this:
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*).html$ $1-$2-$3-$4-$5 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_([^_]*).html$ $1-$2-$3 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*).html$ $1-$2 [E=unscors:Yes]
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule (.*) [mysiteurl.com...] [R=301,L]

...and I got this impressive ever-repeating url, and a forbidden error. hehehehe.

Good thing I saved my old .htaccess file for quick replacements when I inevitably broke something. ;)

<added>The last chunk of code posted in msg.2 did the same thing... pretty neat, but not quite the effect I was looking for.</added>

[edited by: mivox at 10:32 pm (utc) on April 26, 2004]

jdMorgan

10:23 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Doh!

Try this, replacing the first RewriteRule in the last version I posted:


RewriteRule [b]!\.[/b](html¦pdf)$ - [S=4]

I guess I should have made time to test that mess of code!

Jim

mivox

10:27 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Did that, and got this:

You don't have permission to access
/directory-name/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html/file-name.html/file-name.html/
file-name.html/file-name.html on this server.

Hmm... it IS rewriting them. It's just not stopping when it's done.

mivox

10:40 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This one worked:

Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]

It's a combo of a suggestion from post two with .html limiting... I noticed it had "L" at the end of each line, so I figured it might fix the repeating url problem, which it did.

:) :) :)

Thanks SO much for your help!

jdMorgan

10:43 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmmm...

I tested it here, and it works fine, with two qualifiers: First, I'm testing in .htaccess, and second, I commented-out the RewriteBase directive.


Options +FollowSymLinks
RewriteEngine on
#RewriteBase /
RewriteRule !\.(html¦pdf)$ - [S=4]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ $1-$2-$3-$4-$5 [E=unscors:Yes]
RewriteRule ^([^_]*)_([^_]*)_(.*)$ $1-$2-$3 [E=unscors:Yes]
RewriteRule ^([^_]*)_(.*)$ $1-$2 [E=unscors:Yes]
RewriteCond %{ENV:unscors} ^Yes$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

And, of course, I had to replace the broken vertical pipe in the 1st RewriteRule with a solid one.

Jim

jdMorgan

10:46 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



... And the problem with using external redirects with [R=301,L] is that this will only fix four, two, or one underscore per redirect, and it won't fix URLs with three underscores at all.

Jim

mivox

10:51 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just tested the code I quoted above with a three underscore redirect... it worked fine. Tested and working with 1, 2, 3 and 4 underscore urls.

Brings to mind the quote from the Apache docs for it:

Despite the tons of examples and docs, mod_rewrite is voodoo. Damned cool voodoo, but still voodoo.

<added>
I can't comment out the RewriteBase directive, because it will break other rules I have in there. Right now, everything is working perfectly, so I'm reluctant to mess with it again. I don't want the mod_rewrite gods to feel I'm being ungrateful or anything... ;)
</added>

jdMorgan

10:58 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, but it required two redirects to fix three underscores. If you want to use that approach, consider filling in between the four-underscore rule and the two-undescore rule with a three-underscore rule. The increase in overhead will be negligible, but the improvement in user experience will be much greater.

I'll concede that simplicity wins over elegance in this case, but I wish I could make it fail here to see how to make it more bullet-proof. I can't figure out why it doesn't work on your server; The only thing I can think of is that maybe your configuration prohibits setting 'private' environment variables... But I've never heard of such a thing on modern Apache versions.

Jim

mivox

11:01 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's hosted on pair.com... They're pretty standard I think. Don't know why it was causing a problem.

But I'll add the three underscore rule too. Didn't think about the two redirects thing.

mivox

11:20 pm on Apr 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just to clarify for other site members, the final code I'm using now is:

Options +FollowSymLinks
RewriteEngine on
RewriteBase /
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]
RewriteRule ^([^_]*)_(.*)\.html$ [mysiteurl.com...] [R=301,L]

And that will correct 1, 2, 3 or 4 underscores into hyphens with one redirect, and should fix URLs with more than 4 underscores with multiple redirects, for URLs leading to .html files.

<added>And if you can leave out the RewriteBase line, you should be able to use the much spiffier code provided by jdMorgan in his last post... Test his code on your server first, it's much nicer. ;) </added>

winglian

9:50 pm on May 10, 2004 (gmt 0)

10+ Year Member



Is there a way to do this recursively using the [N]ext round in the rewrite? I am trying to replace commas with hyphens since Yahoo! escapes my commas in my links, but I don't want to lose the links they have spidered.

Thanks,
Wing

jdMorgan

2:37 am on May 11, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, you can use [N] if you like, but it can be horribly slow, since mod_rewrite will have to parse all directives up to the first mod_rewrite directive in your file on each pass.

The method shown above using an environment variable is usually faster, but go ahead and try it (and let us know).

Jim