Page is a not externally linkable
- Code, Content, and Presentation
-- Apache Web Server
---- Return 410 on a URI with a plus sign (+) ?


1script - 5:26 pm on Sep 22, 2012 (gmt 0)


Hi all,

I am trying to return 410 Gone HTTP code on all URIs containing plus signs in them. These are remnants of some terrible programming mistake in site search that ran amok few years ago and led to creation of 2M+ bogus URLs that Google keeps coming back to.

The bad URIs have this structure:

http://www.example.com/word1+word2-search.htm
http://www.example.com/word1+word2+word3-search.htm
...
and so on. I don't even know if there's a limit to the number of words. But in the simplest example, there would always be two words and a plus sign between them. I think that browsers (and Googlebot) treat the plus sign as a space break (%20) and therefore my server 301-redirects them to http://www.example.com/word1, which does not exist and results in a 404 code returned.

Because of that 301 before the final 404, Google still thinks the bogus URL exists and keeps coming for it.

I tried this:


RewriteCond %{REQUEST_URI} (.*)search\.htm [NC]
RewriteRule ^.*$ - [G,L]



It didn't work


RewriteCond %{REQUEST_URI} (.*)(\+)*(.*)search\.htm [NC]
RewriteRule ^.*$ - [G,L]


didn't work either. Neither did


RewriteCond %{REQUEST_URI} (.*)(\%20)?(.*)search\.htm [NC]
RewriteRule ^.*$ - [G,L]


By didn't work I mean, it still behaves as if this rule does not exist.

So, my question is, how can I catch URLs containing a space break? Or how can I prevent the conversion of the plus sign to a space break so I can then catch it with an .htaccess rule?

Thanks!


Thread source:: http://www.webmasterworld.com/apache/4498631.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com