homepage Welcome to WebmasterWorld Guest from 54.166.66.204
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Return 410 on a URI with a plus sign (+) ?
can it be done in .htaccess at all?
1script




msg:4498633
 5:26 pm on Sep 22, 2012 (gmt 0)

Hi all,

I am trying to return 410 Gone HTTP code on all URIs containing plus signs in them. These are remnants of some terrible programming mistake in site search that ran amok few years ago and led to creation of 2M+ bogus URLs that Google keeps coming back to.

The bad URIs have this structure:

http://www.example.com/word1+word2-search.htm
http://www.example.com/word1+word2+word3-search.htm
...
and so on. I don't even know if there's a limit to the number of words. But in the simplest example, there would always be two words and a plus sign between them. I think that browsers (and Googlebot) treat the plus sign as a space break (%20) and therefore my server 301-redirects them to http://www.example.com/word1, which does not exist and results in a 404 code returned.

Because of that 301 before the final 404, Google still thinks the bogus URL exists and keeps coming for it.

I tried this:


RewriteCond %{REQUEST_URI} (.*)search\.htm [NC]
RewriteRule ^.*$ - [G,L]



It didn't work


RewriteCond %{REQUEST_URI} (.*)(\+)*(.*)search\.htm [NC]
RewriteRule ^.*$ - [G,L]


didn't work either. Neither did


RewriteCond %{REQUEST_URI} (.*)(\%20)?(.*)search\.htm [NC]
RewriteRule ^.*$ - [G,L]


By didn't work I mean, it still behaves as if this rule does not exist.

So, my question is, how can I catch URLs containing a space break? Or how can I prevent the conversion of the plus sign to a space break so I can then catch it with an .htaccess rule?

Thanks!

 

g1smd




msg:4498662
 7:46 pm on Sep 22, 2012 (gmt 0)

Did you try just...

RewriteRule \+ - [G]


It's a rare occasion you'll need (.*) in a rule and you NEVER need more than one. If you feel you do, you're using the wrong pattern.

1script




msg:4498686
 10:47 pm on Sep 22, 2012 (gmt 0)

Thanks, g1smd. Do you mean just the rule without a condition? I would like to be sure only that particular pattern (with search.htm on the end) is affected by the rule. Just in case there's a valid URL among some 1000+ static URLs (very old site) that I would rather keep.

Actually, come to think of it: such static URL is most likely invalid, too: it would result in Apache trying to serve the part of the URL before the "+" as a file/directory name.

lucy24




msg:4498694
 11:34 pm on Sep 22, 2012 (gmt 0)

Do you mean just the rule without a condition?

Never put something into a condition when you can put it into the Rule itself.

Conceptually it's like this:

Rule {somestuff} {do what rule says}
Condition {"somestuff" includes "xyz"}

versus

Ruls {xyz} {do what rule says}

1script




msg:4498721
 2:58 am on Sep 23, 2012 (gmt 0)

This has me stunned: adding this rule before anything else still does not change the behavior: given /word1+word2-search.htm it still 301 redirects to /word1 , then sends 404.

RewriteRule \+ - [G,L]

What gives? I thought having no conditions would sure make this match ANY URI with a plus sign in it? Is it a hard-coded Apache 2 behavior? I've found one legitimate page with a plus sign in the URL - it also no longer works although it worked fine in Apache 1.3

Can anyone comment? Thanks!

1script




msg:4498728
 3:21 am on Sep 23, 2012 (gmt 0)

OK, I think I got it:

I needed to look at the entire request and not the URI. So, this works as expected:

RewriteCond %{THE_REQUEST} \+.*search\.htm
RewriteRule .* - [G]

lucy24




msg:4498750
 7:32 am on Sep 23, 2012 (gmt 0)

such static URL is most likely invalid, too: it would result in Apache trying to serve the part of the URL before the "+" as a file/directory name

This looked promising at first glance, but it turns out not to work; see below. (I did some more experimenting after I hit Submit, so I hope you don't read too fast :))

In any case I'm not wildly thrilled about the current form

%{THE_REQUEST} \+.*search\.htm

It will work, but it's not optimal. Constructions with non-final .* or .+ are never a good idea. The Regular Expression will find a plus and then whizz along all the way to the end of its search string-- in this case the whole request-- before screeching to a halt: "Oh, oops, I was supposed to leave room for a 'search.htm'". This isn't even the very end of the request; the search should have been stopped sooner. At an absolute minimum, constrain it to \S non-spaces.

Does your site use query strings that might contain plusses? Do you have a real page whose name ends in "search.htm"? If not, all you need is a bare \+ in the Condition.

given /word1+word2-search.htm it still 301 redirects to /word1 , then sends 404

I don't get where the redirect is coming from. I tried some random requests containing plusses (don't be alarmed, webmaster, that was just me experimenting) on various domains including my own, and nothing extra happened. I got taken straight to the 404 page. So if you're getting redirected, there's code in place somewhere that's explicitly causing it to happen. Even if the plusses were converted to spaces-- which they aren't, in the body of an URL-- the request wouldn't be truncated. You can use spaces; they just don't travel well.

1script




msg:4498844
 3:45 pm on Sep 23, 2012 (gmt 0)

@lucy24: Thanks for your input, lucy24. I understand what you're saying about the greedy regex, I just wanted to be safe in case a "legitimate" URL with a plus sign in it existed on this site. Turns out, it does exist (I only had enough time to find one so far) but it also does not work, so in the end, I am going to simplify the regex and make it stop on the plus sign. Looks like there's no other way but to rename the file which now happens to be called '600+.shtml' into '600plus.shtml' and update the links to it. Not a big deal, at least not unless I find many more of those.
I don't get where the redirect is coming from.
This must be internal Apache 2 feature. The URL with the sign used to work in Apache 1.3 and I have looked at both the .htaccess and the httpd.conf and there's nothing in either of the configs that would instruct to make that redirect.

You can use spaces; they just don't travel well.
I don't believe actual spaces travel at all - the browser converts them to %20 or to "+", so Apache never sees an actual space. By the way, "%20" in a URL works just fine, it's only the plus sign that's a trouble.

Cheers!

P.S. I forgot to mention that a plus sign in the query string works fine, too and has always worked - the issue was always only about the part before the "?"

g1smd




msg:4498847
 3:57 pm on Sep 23, 2012 (gmt 0)

You should read RFC 2616, the HTTP/1.1 specification.

It contains a list of characters that are not valid in the hostname, in the path and file name and in the query string parts of a URL.

Each part has a different list.

lucy24




msg:4499008
 11:11 pm on Sep 23, 2012 (gmt 0)

Looks like there's no other way but to rename the file which now happens to be called '600+.shtml' into '600plus.shtml' and update the links to it.

... and that's why g1 is always telling people to put the more specific rules before the more general. And it illustrates something I said in another thread just a few days ago: sometimes you have to put an individual Redirect before your Gone rules so it gets picked up in the right place.

Here, all you need to do is make a separate Rule addressing that one page, and then let your generic [G] rule pick up all the others. In the long term this is good, because the + should never have been there anyway ;)

If I cut-and-paste something containing spaces directly into Camino's address bar, the spaces will be quietly eaten. This can sometimes be useful, because I may have put in the spaces myself to work with the URL.

You should read RFC 2616, the HTTP/1.1 specification.

Funny you should say that. I was there-- or somewhere similar-- just yesterday because of this very thread :)

[w3.org...]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved