homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

RewriteCond with special characters

 2:33 pm on Jul 18, 2011 (gmt 0)

Webmaster tools is showing a bunch of 404's from scraper sites that are creating multiple incorrect links to our site. Normally I wouldn't worry too much about low value links from scraper sites but there's a bunch of them showing up in WMT and every little bit helps IMO.

The link looks like this -

A copy/paste shows this -

I can't seem to get a proper condition match.



 3:55 pm on Jul 18, 2011 (gmt 0)

Mid dot, utf-8 C2B7. Have you tried %B7?


 7:55 pm on Aug 9, 2011 (gmt 0)

I couldn't get anything to work so I settled for:
RewriteCond %{REQUEST_URI} ^/City(.*)Category

Annoying because there are tons of cities and tons of categories showing up.


 7:59 pm on Aug 9, 2011 (gmt 0)

The (.*) pattern says to match the whole URL request . The parser then has to try tens of thousands trial matches in order to find what you actually meant, as the pattern needs to match only one character.

If there is only ever one single character to be matched here, then use >> . << or >> [^a-z] << or similar instead of >> .* << here.

What should the link look like? It is likely that the solution lies in adding a few more checks to the PHP script, not in adding rules to the .htaccess file.


 10:06 pm on Aug 9, 2011 (gmt 0)

If there is only ever one single character to be matched here

If they've been encoded--as the OP seems to imply--the original single character will come through as exactly six characters in the form %\h\h%\h\h. (DO NOT cut & paste-- I don't think htaccess recognizes this terminology).

Never, ever use non-ASCII characters in an url*, even if they seem to work fine and neither the validator nor the link checker objects. There are further restrictions**, but that's a good starting point. For a mid-dot · the obvious alternative is a hyphen -

* I mean of course when the domain name itself is ASCII. If it's in non-Roman script there will be different rules.

** g1 posted a useful link just recently, but I have already misplaced it :( Among other things, it explained all those mysterious + signs in raw logs.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved