Forum Moderators: phranque

Message Too Old, No Replies

Fixing Character Encoding

         

rainborick

4:32 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I ran into a problem on a site that unfortunately used a space in several directory names. I recently started to try to repair a character encoding mismatch between the server which defaulted to utf-8 and some meta tags that specified iso-8859-1. This seems to have led to a problem with links that had been using %20 for the <space> being converted to %2520. So I tried to use this code in my .htaccess to redirect the bad URLs:

RewriteCond %{REQUEST_URI} ^(.*)%2520(.*)$
RewriteRule^ (.*)%2520(.*)$ http://www.example.com/$1%20$2 [R=301,L]

The true original URLs would have looked something like:

http://www.example.com/directory%20name/page.html

I'd appreciate any help in getting this corrected. Thanks!

g1smd

5:22 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The %25 is the % of %20 encoded again.

Your best bet is to replace the space with a hyphen and redirect directly to that format (i.e. don't redirect to URLs with a space any more).

RewriteRule ^([^\%]+)\%([0-9]+)(.*)$ http://www.example.com/$1-$3 [R=301,L]

[edited by: g1smd at 6:03 pm (utc) on Jan. 1, 2009]

rainborick

5:27 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, I realize that the %25 is the %. And I'd had any more success setting up the redirect for the %20 to hyphen, I'd be doing that now. So, if you can suggest the code for that redirect, I'll be very grateful.

jdMorgan

6:49 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To avoid having your code 'fooled' by the pre-decoding that takes place when Apache 'builds' the %{REQUEST_URI} and internal req_rec variables, you need to examine the raw client request:

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(.*)%(25)*20([^\ ]*)\ HTTP/
RewriteRule . http://www.example.com/%1-%3 [R=301,L]

The RewriteCond here examines the HTTP request exactly as sent by the client, for example:
GET /directory%20name/page.html HTTP/1.1
- or -
GET /directory%2520name/page.html HTTP/1.1

Because the pattern matches the '25' sequence occurring zero or more times, this RewriteCond will handle both singly-encoded %20 and multiply-encoded '%2520' or %252520 character sequences.

Jim

g1smd

7:30 pm on Jan 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks jd. That was the bit that I had forgotten in all this. Good example.

rainborick

3:38 pm on Jan 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yer a prince. Thanks, jd!

rainborick

4:26 pm on Jan 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Drat, it doesn't look like this server makes %{THE_REQUEST} available to users. I also tried %{HTTP_REQUEST}, and the redirect doesn't work. I'm running with Apache 2.0.52 and when I ran a Perl script to dump the $ENV variables, neither %{THE_REQUEST} or %{HTTP_REQUEST} appear. It's on a server that I control. Can I modify the Apache configuration files to add this variable?

All things considered, I'm on the verge of just renaming the directories properly and letting the rest of the world (namely the search engines) discover the changes on their own.

g1smd

7:43 pm on Jan 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What do you mean by "rename correctly"?

Do you mean change the names of the folders on the server itself?

Do you mean change the words in the URLs in links on your pages?

In that case, you should be doing both of those things. Neither should have spaces within.

The redirect is purely to show search engines what the new names are, if they continue to request the old names.

jdMorgan

9:30 pm on Jan 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



%{THE_REQUEST} may be a mod_rewrite-only variable name -- I'm not sure if it's externally available by that name. Also, be aware that if you are rewriting the request to your PERL var-dumper script, then the previously-set variables may get renamed to "Redirect_<varname>"

[added] %{HTTP_REQUEST} isn't a valid varname. That would probably be %{REQUEST_URI} you were looking for. But you must use %{THE_REQUEST} in the code I posted, and the pattern must be exactly as shown... No doubt about that. [/added]

Jim

[edited by: jdMorgan at 9:34 pm (utc) on Jan. 2, 2009]

rainborick

4:32 pm on Jan 4, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just in case it helps someone in the future, I fixed the problem as best I could. I renamed the directories using a hyphen, and updated all of the internal links that were the source of the %2520 problem. I then installed a redirect that handled replacing the original %20 with a hyphen to update the search engines (which was adequate because there were no significant external links to these URLs anyway). The redirect I used was:

RewriteCond %{REQUEST_URI} ^(.*)\ name/(.*)$
RewriteRule ^(.*)\ (.*)$ http://www.example.com/$1-$2 [R=301,L]

Since the bad old internal links that caused the %2520 problem are now updated to use a hyphen, I'm just going to let the search engines continue to see 404 errors on those. And I ran Xenu Link Sleuth on the site just to make sure everything was updated properly.

jdMorgan

4:37 pm on Jan 4, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since you're not looking for escaped spaces any more, you don't need the RewriteCond. And you can speed up the code quite a bit by using a negative match on the initial space:

RewriteRule ^([^\ ]*)\ (.*)$ http://www.example.com/$1-$2 [R=301,L]

Jim

[edited by: jdMorgan at 4:38 pm (utc) on Jan. 4, 2009]