homepage Welcome to WebmasterWorld Guest from 54.166.148.189
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
htacess remove a string of characters and redirect to a specific url
a question on htaccess or url redirect
webpilotz2




msg:4541162
 1:36 am on Feb 1, 2013 (gmt 0)


i want to redirect:

http://www.example.com/example-folder/%E2%80%8Bvirginia.html

into:

http://www.example.com/example-folder/virginia.html

thus only removing this part: %E2%80%8B from the first url

by the way, if the first link is pasted in firefox, the characters doesnt change. but on chrome the characters i want to remove becomes a square character.

thanks in advance!

----------------

on another forum or help site this was suggested but doesnt work:

RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} ^/example-folder/[^/]+virginia\.html/? [NC]
RewriteRule .* example-folder/virginia.html [R=301,L]



TIA!

 

lucy24




msg:4541172
 3:01 am on Feb 1, 2013 (gmt 0)

That Mystery String is the Zero-Width Space-- which explains why you can't see it :) --not to be confused with* the Zero-Width Nonbreaking Space which doubles as the Byte Order Mark (%ef%bb%bf). Both have caused trouble for many people over the years: try a quick Forums search and you'll see.

By the time the request reaches your htaccess it will have been decoded. So it's really only one character.

The suggested code you posted is way, way overkill. Does the problem occur only with this specific filename? If so, I smell a bad link somewhere. Or is it a generic issue and you've just illustrated with a random example?

If it is just one file-- or a limited number of files-- first step is to get their names down into the body of the Rule so your server doesn't have to stop and evaluate the conditions for every single request it ever gets for any file of any kind ever. Ever ;)


* Translation: I habitually get them mixed up myself. Sometimes, unfortunately, in forums posts that I can't edit later :(

webpilotz2




msg:4541179
 4:04 am on Feb 1, 2013 (gmt 0)

hello lucy24. thanks for pointing it out. this is big lead for me regarding the BOM that explains htaccess cant read it because it becomes a different character(the square character in this case).

the problem occurs in different links. for now 5 to be exact and growing each day as google webmaster tools reports increasing number from that particular site which gives in the incoming link to ours.

regarding:
If it is just one file-- or a limited number of files-- first step is to get their names down into the body of the Rule so your server doesn't have to stop and evaluate the conditions for every single request it ever gets for any file of any kind ever. Ever ;)

any actual solution you can suggest?

again thank you so much, been very helpful in enlightening :)

lucy24




msg:4541234
 8:43 am on Feb 1, 2013 (gmt 0)

Uh-oh, all one site? Are you on speaking terms with them? If so, sit down and see if you can pinpoint the problem. It's probably happening with their links to other sites too.

When these requets arrive at your site, one easy solution of course is to do nothing. It's their mistake, not yours. But if they are legitimate links that might have legitimate humans attached to them, you will want to do something about it.

%E2%80%8B is not-- luckily-- the BOM. It is "only" the zero-width space. (Q: Why would you want a zero-width space? A: It's the plain-text equivalent of HTML's new <wbr> tag, meaning "you can break here if you want to". It may also have meaning in scripts that use different letterforms depending on whether you're next to a word break.)

:: wandering off to refresh memory on how to deal with invisible characters in mod_rewrite ::

:: looking vaguely around for g1 or someone similar whose memory needs no refreshing ::

webpilotz2




msg:4541236
 8:51 am on Feb 1, 2013 (gmt 0)

hello lucy24. i struck gold: [webmasterworld.com...] the post of @mslina2002 second solution.

yes its incoming and their fault. the IBL's are coming from an indexing site and google WMT reports it as 404. ;)

thanks again for the lead.

webpilotz2




msg:4541237
 8:52 am on Feb 1, 2013 (gmt 0)

for those looking for a solution check the thread i posted before or make this your guide(from mslina2002)

# for url like www.example.com/my-file-name/
# bad backlink site is doing this:
# string %E2%80%8B is randomly added to slug url
# fix will redirect back to www.example.com/my-file-name/
#
RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\%E2%80%8B(.*)\ HTTP/ [NC]
RewriteRule ^.*$ http://www.example.com/%1 [R=301,L]

g1smd




msg:4541250
 9:57 am on Feb 1, 2013 (gmt 0)

Both renditions of (.*) must be changed to a more specific pattern.

(.*) is greedy, promiscuous and ambiguous. The first one eats the entire remainder of the input. The parser then discovers that after "everything" there's supposed to be more stuff. It then has to back off and retry using "trial matches" to find the % character. With two (.*) patterns, the input string might be reparsed several thousand times to find a match. The first (.*) should likely be ([^%]+) and the second probably ([^\ ]+) here.

Additionally, you can do this all in the Rule (replacing the ^.*$ pattern). There's no need for an additional condition in this case.

webpilotz2




msg:4541257
 10:33 am on Feb 1, 2013 (gmt 0)

hello g1smd, can you write it down? im concerned about what you said about resource hungry. might get a ban from the host or the server shuts down due to over use of resources. thanks

webpilotz2




msg:4542411
 2:21 am on Feb 5, 2013 (gmt 0)

hello g1smd, this is now whats on my htaccess:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^%]+)\%E2%80%8B([^\ ]+)\ HTTP/ [NC]
RewriteRule ^.*$ http://www.example.com/%1%2 [R=301,L]

works pretty well. thanks for the heads up.

lucy24




msg:4542426
 4:46 am on Feb 5, 2013 (gmt 0)

Did you try it in the Rule? Does it not work that way? :( The pattern would be changed slightly, to read

^([^%]*)%E2%80%8B(.*)

\%E2%80%8B

Is that a typo? It's not clear to me why the first % would need escaping but not the other two-- especially since only those last two are at risk for ambiguity. (%8 can have meaning; %E can't.)

webpilotz2




msg:4542431
 4:59 am on Feb 5, 2013 (gmt 0)

hello lucy24, it works in the site pretty well.
im an htaccess noob so cant tell if it was typo or intentional. will try without the \ before the string and see how it goes.

lucy24




msg:4542462
 8:04 am on Feb 5, 2013 (gmt 0)

I meant, did you try putting the pattern into the Rule instead of detouring to a Condition? A conditionless RewriteRule is always the ideal. Second choice would be to look at %{REQUEST_URI} instead of %{THE_REQUEST}. Then at least you wouldn't have to keep track of literal spaces.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved