Welcome to WebmasterWorld Guest from 54.225.33.25

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

htacess remove a string of characters and redirect to a specific url

a question on htaccess or url redirect

     
1:36 am on Feb 1, 2013 (gmt 0)

New User

joined:May 21, 2012
posts: 8
votes: 0



i want to redirect:

http://www.example.com/example-folder/%E2%80%8Bvirginia.html

into:

http://www.example.com/example-folder/virginia.html

thus only removing this part: %E2%80%8B from the first url

by the way, if the first link is pasted in firefox, the characters doesnt change. but on chrome the characters i want to remove becomes a square character.

thanks in advance!

----------------

on another forum or help site this was suggested but doesnt work:

RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} ^/example-folder/[^/]+virginia\.html/? [NC]
RewriteRule .* example-folder/virginia.html [R=301,L]



TIA!
3:01 am on Feb 1, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


That Mystery String is the Zero-Width Space-- which explains why you can't see it :) --not to be confused with* the Zero-Width Nonbreaking Space which doubles as the Byte Order Mark (%ef%bb%bf). Both have caused trouble for many people over the years: try a quick Forums search and you'll see.

By the time the request reaches your htaccess it will have been decoded. So it's really only one character.

The suggested code you posted is way, way overkill. Does the problem occur only with this specific filename? If so, I smell a bad link somewhere. Or is it a generic issue and you've just illustrated with a random example?

If it is just one file-- or a limited number of files-- first step is to get their names down into the body of the Rule so your server doesn't have to stop and evaluate the conditions for every single request it ever gets for any file of any kind ever. Ever ;)


* Translation: I habitually get them mixed up myself. Sometimes, unfortunately, in forums posts that I can't edit later :(
4:04 am on Feb 1, 2013 (gmt 0)

New User

joined:May 21, 2012
posts: 8
votes: 0


hello lucy24. thanks for pointing it out. this is big lead for me regarding the BOM that explains htaccess cant read it because it becomes a different character(the square character in this case).

the problem occurs in different links. for now 5 to be exact and growing each day as google webmaster tools reports increasing number from that particular site which gives in the incoming link to ours.

regarding:
If it is just one file-- or a limited number of files-- first step is to get their names down into the body of the Rule so your server doesn't have to stop and evaluate the conditions for every single request it ever gets for any file of any kind ever. Ever ;)

any actual solution you can suggest?

again thank you so much, been very helpful in enlightening :)
8:43 am on Feb 1, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Uh-oh, all one site? Are you on speaking terms with them? If so, sit down and see if you can pinpoint the problem. It's probably happening with their links to other sites too.

When these requets arrive at your site, one easy solution of course is to do nothing. It's their mistake, not yours. But if they are legitimate links that might have legitimate humans attached to them, you will want to do something about it.

%E2%80%8B is not-- luckily-- the BOM. It is "only" the zero-width space. (Q: Why would you want a zero-width space? A: It's the plain-text equivalent of HTML's new <wbr> tag, meaning "you can break here if you want to". It may also have meaning in scripts that use different letterforms depending on whether you're next to a word break.)

:: wandering off to refresh memory on how to deal with invisible characters in mod_rewrite ::

:: looking vaguely around for g1 or someone similar whose memory needs no refreshing ::
8:51 am on Feb 1, 2013 (gmt 0)

New User

joined:May 21, 2012
posts: 8
votes: 0


hello lucy24. i struck gold: [webmasterworld.com...] the post of @mslina2002 second solution.

yes its incoming and their fault. the IBL's are coming from an indexing site and google WMT reports it as 404. ;)

thanks again for the lead.
8:52 am on Feb 1, 2013 (gmt 0)

New User

joined:May 21, 2012
posts: 8
votes: 0


for those looking for a solution check the thread i posted before or make this your guide(from mslina2002)

# for url like www.example.com/my-file-name/
# bad backlink site is doing this:
# string %E2%80%8B is randomly added to slug url
# fix will redirect back to www.example.com/my-file-name/
#
RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\%E2%80%8B(.*)\ HTTP/ [NC]
RewriteRule ^.*$ http://www.example.com/%1 [R=301,L]
9:57 am on Feb 1, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Both renditions of (.*) must be changed to a more specific pattern.

(.*) is greedy, promiscuous and ambiguous. The first one eats the entire remainder of the input. The parser then discovers that after "everything" there's supposed to be more stuff. It then has to back off and retry using "trial matches" to find the % character. With two (.*) patterns, the input string might be reparsed several thousand times to find a match. The first (.*) should likely be ([^%]+) and the second probably ([^\ ]+) here.

Additionally, you can do this all in the Rule (replacing the ^.*$ pattern). There's no need for an additional condition in this case.
10:33 am on Feb 1, 2013 (gmt 0)

New User

joined:May 21, 2012
posts: 8
votes: 0


hello g1smd, can you write it down? im concerned about what you said about resource hungry. might get a ban from the host or the server shuts down due to over use of resources. thanks
2:21 am on Feb 5, 2013 (gmt 0)

New User

joined:May 21, 2012
posts: 8
votes: 0


hello g1smd, this is now whats on my htaccess:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^%]+)\%E2%80%8B([^\ ]+)\ HTTP/ [NC]
RewriteRule ^.*$ http://www.example.com/%1%2 [R=301,L]

works pretty well. thanks for the heads up.
4:46 am on Feb 5, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Did you try it in the Rule? Does it not work that way? :( The pattern would be changed slightly, to read

^([^%]*)%E2%80%8B(.*)

\%E2%80%8B

Is that a typo? It's not clear to me why the first % would need escaping but not the other two-- especially since only those last two are at risk for ambiguity. (%8 can have meaning; %E can't.)
4:59 am on Feb 5, 2013 (gmt 0)

New User

joined:May 21, 2012
posts: 8
votes: 0


hello lucy24, it works in the site pretty well.
im an htaccess noob so cant tell if it was typo or intentional. will try without the \ before the string and see how it goes.
8:04 am on Feb 5, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


I meant, did you try putting the pattern into the Rule instead of detouring to a Condition? A conditionless RewriteRule is always the ideal. Second choice would be to look at %{REQUEST_URI} instead of %{THE_REQUEST}. Then at least you wouldn't have to keep track of literal spaces.