Forum Moderators: phranque

Message Too Old, No Replies

stripping nonsense in malformed inbound links

%e2%80%8b

         

crobb305

4:38 am on Mar 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am getting dozens of 404s in WMT from malformed IBLs, all appending %E2%80%8B before the filename.

www.example.com/%E2%80%8Bfilename.htm

I'd like to strip that gibberish out and do a 301 to the correct url. Can you guys help me form a redirect to eliminate that? I already have a rule that strips out spurious query strings, but gibberish is over my skill level.

Thanks for your help.

crobb305

4:45 am on Mar 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This was suggested, but I'm not sure I understand /the-path/to/the-page\.ext

RewriteEngine on
RewriteCond %{THE_REQUEST} !^[A-Z]{3,9}\ /the-path/to/the-page\.ext\ HTTP/1\.
RewriteRule the-path/to/the-page\.ext$ http://www.example.com/the-path/to/the-page\.ext [R=301,L]

crobb305

2:02 am on Mar 17, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am still working on this problem and have thought of one POSSIBLE solution, but it's not working.

Google Webmaster Tools is reporting NUMEROUS 404s from three specific malformed URLS, all using the %E2%80%8B.

Since there are only three being reported (with dozens of 404s on each one -- given that Gbot won't stop trying to fetch), I was thinking of doing a 301 redirect (or even a 410). It is not working, however:

Redirect 301 /%E2%80%8Bfilename.htm /filename.htm

All my other redirects work just fine, but the %E2%80%8B is breaking it (or not being recognized in the htaccess). Should I be specifying those characters in another way?

I'm trying different things, and trying to understand what I can do to make this work. Granted, I'm probably obsessing too much over it, but I do not like Gbot to encounter excessive 404s, plus my site was hit by recent algorithm changes, so I am trying to cover ALL my bases. :)

jdMorgan

1:25 am on Mar 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's not gibberish, and even if it were, it makes no difference to mod_rewrite because mod_rewrite only deals with characters -- It ascribes no "meaning" to any of them.

Just detect the un-decoded "%E2%80%8B" substring prepended to any ".htm" page request, and redirect to strip it off:

RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /\%E2\%80\%8B(([^/?#\ ]+/)*([^.?#\ ]+\.)*htm([?#].*)?)\ HTTP/1\.
RewriteRule \.htm$ http://www.example.com/%1 [R=301,L]

Jim

crobb305

5:15 am on Mar 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jim,

Thank you for helping me with that. It was gibberish to me since I am not familiar with unicode. I am not sure why visitors on my site and Googlebot are encountering it. I see dozens of 404s from Googlebot on urls with %E2%80%8Bfilename.htm and I also see real visitors hitting 404s when they are navigating on my site then try to fetch filename%E2%80%8B.htm. I have checked all my internal links, and no where do I link with that. I am simply not sure where it's coming from. I can understand an external link being malformed with those characters, which is how Googlebot is discovering them, but for visitors to encounter it while navigating my site/internal links, I just don't know how it's happening.

jdMorgan

10:47 pm on Mar 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> but for visitors to encounter it while navigating my site/internal links, I just don't know how it's happening.

If you're saying that visitors are seeing these URLs, then look for internal rewrites preceding external redirects. Check your external redirect rewriterules. Check all scripts that can do redirects. Something is corrupting these URLs somewhere if they are being seen on your own site.

Jim

g1smd

11:08 pm on Mar 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Run Xenu LinkSleuth over your site and carefully study the reports.

crobb305

7:34 pm on Mar 30, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



following your suggestion above, I implemented an htaccess rule to eliminate %E2%80%8B from %E2%80%8Bfilename.htm so now those 404s are no longer being encountered by Googlebot. However, now that I have fixed that, brand new 404s from inbound links are being generated, with the same %E2%80%8B, but now it is being placed in the middle of the filename (e.g., file%E2%80%8Bname.htm) It seems that someone is deliberately linking to my site with this. Is it possible to write a rule that strips out any occurrence of %E2%80%8B in the url or filename?

Also, I just now saw your most recent replies, so I am scanning the site now for any internal link problems. Thank you for the tips. I haven't found anything yet.

mslina2002

5:43 am on Apr 2, 2011 (gmt 0)

10+ Year Member



I have also encountered the same thing on my site with a few of these IBL that are creating these errors on my site.

crobb, did you find out if this was being created from your server or is this bad linking done externally?

mslina2002

8:47 pm on Apr 2, 2011 (gmt 0)

10+ Year Member



I have a similar problem.

My IBL are also appending the same string %E2%80%8B randomly to my url slugs, in addition to going after my queried datafeed pages by replacing query parameters with more junk.

slug is appended by %E2%80%8B
? is replaced by %3F
= is replaced by %3D
+ is replaced by %2B

So I am trying to fix the url in two steps.
(STEP 1) Fix Queried datafeed pages
(STEP 2) Fix Slug urls


STEP 1
# for url like www.example.com/Dir/File.php?kw_search=Keyword1+Keyword2
# bad backlink site is doing this:
# ? is replaced by %3F
# = is replaced by %3D
# + is replaced by %2B
# fix will redirect back to www.example.com/Dir/File.php
#
RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\%3F(.*)\ HTTP/ [NC]
RewriteRule \.php$ http://www.example.com/%1 [R=301,L]

STEP 2
# for url like www.example.com/my-file-name/
# bad backlink site is doing this:
# string %E2%80%8B is randomly added to slug url
# fix will redirect back to www.example.com/my-file-name/
#
RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\%E2%80%8B(.*)\ HTTP/ [NC]
RewriteRule ^.*$ http://www.example.com/%1 [R=301,L]


I have looked through many sites on htaccess and what to do. My attempt to fix this was copied code patched up from all over so not sure if it is right or this can all be done in one step?

HansJorgo

9:42 pm on Apr 2, 2011 (gmt 0)

10+ Year Member



%E2%80%8B is a Bing thing and appears in long (the visible) URL’s on Bing’s search result pages. Scrapers catch that URL and Google picks them up. You may see the %E2%80%8B part if you copy and past such a long URL into your browsers address bar. I guess it functions like a kind of breakpoint for long URL’s

crobb305

12:52 am on Apr 3, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



crobb, did you find out if this was being created from your server or is this bad linking done externally?


These are being created externally. At one point, I thought I saw visitors encountering them internally, but I was mistaken. I reviewed my stats again, and I couldn't find what I thought I saw. I chalk that error up to a long day.
However, I still see Google trying to crawl from external links that have %E2%80%8B in the link. The discovery urls (as reported in WMT) are junk portals that provide Google search results, so I am surprised to hear above that this is a Bing thing. Incidentally, I am unable to duplicate the string when I go to those portals. If I perform a search, and see my url, I do not see this code.

The original form that I posted about "%E2%80%8Bfilename.htm" was an easy fix, with the help of g1smd and JdMorgan. However, now I am seeing the code being inserted at various places within the filename (not necessarily at the beginning or the end. Example: File%E2%80%8Bname.htm

I have been trying to figure out a way to strip it out if found anywhere within the filename. I haven't been successful.

mslina2002

2:55 am on Apr 3, 2011 (gmt 0)

10+ Year Member



That is the same issue as mine, the string %E2%80%8B is also appended at random spots.

Try the code in STEP 2. It's working for me sofar.

RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\%E2%80%8B(.*)\ HTTP/ [NC]
RewriteRule ^.*$ http://www.example.com/%1 [R=301,L]

crobb305

3:24 am on Apr 3, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



that's not working for me. It seems to strip out the code if it is found at the beginning of the filename, but not in the middle or at a hypen/underscore.

tech2734

10:03 pm on May 6, 2011 (gmt 0)

10+ Year Member



For those looking for a solution to this problem when it appears in the middle, see my post at [dtidata.com...]

lucy24

3:51 am on May 7, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Zero-width spaces again? Search for that exact phrase (within these forums, I mean, not in G###) and you'll see it cropping up over and over. For example from tedster on the same date this thread started:
%E2%80%8B is the unicode sequence for a "zero width space". It has been implicated in several exploits of various kinds, and some email address obfuscation scripts insert it to hide the real address from email harvesters.

Don't quite like the sound of that "implicated".

For variety's sake, an even more recent poster found it in its decimal html clothes, ​ alias 200B.

g1smd

6:05 am on May 7, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If there is the possibility of there being more than one of those sequences in a URL, I would be tempted to rewrite it to a small PHP script that fixes it.

The "hook" in .htaccess should be one of the very first rules:

RewriteRule \%E2\%80\%8B /my-special-script.php [L]

RewriteRule &8203; /my-special-script.php [L]


This is one of the rare occasions where a rewrite will be listed in .htaccess BEFORE the redirects. Normally we would want to avoid that, with rewrites AFTER redirects. However, when the PHP script is going to do all the work, the rewrite must be right at the beginning of all of the rules.


The chunk of PHP code that fixes the URL and sends the redirect response looks like this:

<?php
$server_url = $_SERVER['HTTP_HOST'];
IF (preg_match('/^www\./', $server_url)!==true)
{$server_url = "www." . $server_url; };

$old_url = $_SERVER['REQUEST_URI'] );
$new_url = str_replace("%E2%80%8B", "", $old_url");
$new_url = str_replace("&8203;", "", $new_url");
$new_url = 'http://' . $server_url . '/' . $new_url;

HEADER "Status: HTTP/1.1 301 Moved Permanently";
HEADER "Location: " . $new_url;
?>


You can do any test and any manipulation you like in the PHP code. It's one of the few occasions when it might be better to do the work in PHP, not in .htaccess.

However, with the word "exploit" being mentioned I am not so sure I want the request to even reach the PHP part of the server.