Welcome to WebmasterWorld Guest from 107.20.75.63

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

RewriteCond with special characters

     
2:33 pm on Jul 18, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 18, 2003
posts:629
votes: 0


Webmaster tools is showing a bunch of 404's from scraper sites that are creating multiple incorrect links to our site. Normally I wouldn't worry too much about low value links from scraper sites but there's a bunch of them showing up in WMT and every little bit helps IMO.

The link looks like this -
example.com/City·Category

A copy/paste shows this -
example.com/City%C2%B7Category

I can't seem to get a proper condition match.
3:55 pm on July 18, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12714
votes: 244


Mid dot, utf-8 C2B7. Have you tried %B7?
7:55 pm on Aug 9, 2011 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 18, 2003
posts:629
votes: 0


I couldn't get anything to work so I settled for:
RewriteCond %{REQUEST_URI} ^/City(.*)Category

Annoying because there are tons of cities and tons of categories showing up.
7:59 pm on Aug 9, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The (.*) pattern says to match the whole URL request . The parser then has to try tens of thousands trial matches in order to find what you actually meant, as the pattern needs to match only one character.

If there is only ever one single character to be matched here, then use >> . << or >> [^a-z] << or similar instead of >> .* << here.

What should the link look like? It is likely that the solution lies in adding a few more checks to the PHP script, not in adding rules to the .htaccess file.
10:06 pm on Aug 9, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12714
votes: 244


If there is only ever one single character to be matched here

If they've been encoded--as the OP seems to imply--the original single character will come through as exactly six characters in the form %\h\h%\h\h. (DO NOT cut & paste-- I don't think htaccess recognizes this terminology).

Never, ever use non-ASCII characters in an url*, even if they seem to work fine and neither the validator nor the link checker objects. There are further restrictions**, but that's a good starting point. For a mid-dot · the obvious alternative is a hyphen -


* I mean of course when the domain name itself is ASCII. If it's in non-Roman script there will be different rules.

** g1 posted a useful link just recently, but I have already misplaced it :( Among other things, it explained all those mysterious + signs in raw logs.