Webmaster tools is showing a bunch of 404's from scraper sites that are creating multiple incorrect links to our site. Normally I wouldn't worry too much about low value links from scraper sites but there's a bunch of them showing up in WMT and every little bit helps IMO.
The link looks like this - example.com/City·Category
A copy/paste shows this - example.com/City%C2%B7Category
Msg#: 4340811 posted 7:59 pm on Aug 9, 2011 (gmt 0)
The (.*) pattern says to match the whole URL request . The parser then has to try tens of thousands trial matches in order to find what you actually meant, as the pattern needs to match only one character.
If there is only ever one single character to be matched here, then use >> . << or >> [^a-z] << or similar instead of >> .* << here.
What should the link look like? It is likely that the solution lies in adding a few more checks to the PHP script, not in adding rules to the .htaccess file.
Msg#: 4340811 posted 10:06 pm on Aug 9, 2011 (gmt 0)
If there is only ever one single character to be matched here
If they've been encoded--as the OP seems to imply--the original single character will come through as exactly six characters in the form %\h\h%\h\h. (DO NOT cut & paste-- I don't think htaccess recognizes this terminology).
Never, ever use non-ASCII characters in an url*, even if they seem to work fine and neither the validator nor the link checker objects. There are further restrictions**, but that's a good starting point. For a mid-dot · the obvious alternative is a hyphen -
* I mean of course when the domain name itself is ASCII. If it's in non-Roman script there will be different rules.
** g1 posted a useful link just recently, but I have already misplaced it :( Among other things, it explained all those mysterious + signs in raw logs.