Forum Moderators: phranque

Message Too Old, No Replies

Rewrite to fix multiple `//' in Requests

Glance over this rewrite code, please

         

AlexK

2:02 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mostly due to my own unfortunate coding mistake, my site is receiving requests--many from Google--which contain multiple forward slashes in the URI. A recent example is:
"GET /downloads//////non-modem/...

Talk about duplicate pages (they get a 200 response).

The number of slashes can vary, and so can the position. This is the code for httpd.conf that I think should fix it:

RewriteCond %{THE_REQUEST} ^(.*)(//+)(.*)
RewriteRule ^.* [my-site.com%1...] [L,R=permanent]

Having made one ars*hole mistake, I do not want to make another, and would welcome comments/corrections.

jdMorgan

2:45 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



THE_REQUEST will look something like this for your example:
GET /downloads//non-modem?name1=value1&name2=value2 HTTP/1.1

This is the original request header sent by the client (e.g. browser or SE robot). It contains the HTTP Method (GET) and the HTTP protocol version (HTTP/0.9, HTTP/1.0, HTTP/1.1, etc.)

Therefore, it is necessary to take the 'extra' stuff into account in the RewriteCond pattern:


RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ (.*)//+([^\ ]*)\ HTTP/
RewriteRule .* http://www.example.com%1/%2 [R=301,L]

You can avoid future problems by using the "Live HTTP Headers" extension for Firefox to test your code to be sure that it does what you want, and equally importantly, that it does nothing more than what you want.

Jim

AlexK

8:42 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many thanks, Jim.

My head is a little frazzled at this instant due to trying to transfer all my business stuff from one imminently-soon-to-break-down winXP computer to a new one (100+ GB of My Documents transferring across a 100Mb line as I type these words).

I knew I needed to get it checked as I typed it, hence this post.

I've spent hours removing duplicate pages in the past via both httpd.conf and PHP, but never expected that a `//+' would introduce endless other variations. My httpd.conf is now chock-a-block with defensive coding for situations that I would once have scoffed at. Getting rid of all the "GET /.?" (multiple question-marks) from my logfile notifications is next (at least that one's easy).

Many thanks, once again.

wilderness

9:39 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Alex,
I had a problem with this a while back.
Even though I added some rewrite lines in attempt to stop the google (and later MSN) of crawling the multiple slashes, the repetition of lines increased to the point of five or six.

I'd finally had enough at that point and decided (wisely) to take the necessary time to repair all the incorrect links, which finally solved the problem. (Although it did take Google and MSN some time to stop requesting these multiple slash links.)

BTW, the majority of my faulty links were in fact blank links
(<a href="">My phrase</a>) that I had intended to add within the inital content creation of pages and failed to complete.

AlexK

9:53 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A quick follow-up to say that it works nicely. Takes 2 bites at the cherry to get from `///' to `/', but that is fine.

jdMorgan

10:50 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Takes 2 bites at the cherry to get from `///' to `/'

Rats -- Darn greedy ".*" pattern strikes again... Another example of why I always avoid using it...

See if this works any better (I haven't tested any of this):


RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ ((/[^/]+)*)//+([^\ ]*)\ HTTP/
RewriteRule .* http://www.example.com%1/%3 [R=301,L]

Jim

AlexK

2:56 am on Mar 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks yet again Jim - that does work in one shot... but also not.

The wretched thing is converting embedded spaces, and triggering my hack-routines:

"GET /...Billionton/BCM2035%20Bluetooth..."

becomes

"GET /...Billionton/BCM2035%2520Bluetooth..."

Fortunately, I was able to remember something about no-escaping the URI-specials (`%' in this case) and found it within the ubiquitous Apache rewrite page [httpd.apache.org].

This is what it all became in the end:

#
# replace multiple //' in REQUEST (there due to my bad coding in PHP)
# 2007-03-08 added -AK
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ ((/[^/]+)*)//+([^\ ]*)\ HTTP/
RewriteRule .* [my-site.com%1...] [L,NE,R=permanent]

That works fine with "GET /directory1///...", which is my current situation. I was hoping to get a generic solution, but I am far too tired at the moment to care.
PS:
Personal note: I'm amazed that I am able to function at all; I'm totally whacked. The 128 GB transfer got to 108 GB, then filled a supposedly-160-GB disk on the new computer (there is actually only 108 GB available).

Dell & me are going to have words in the morning.

jdMorgan

3:12 am on Mar 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Believe it or not, I'm a few days behind you: A 160GB drive on my other machine went all pear-shaped and wobbly on me today, so I'll be transferring a bunch of data from that machine to another, and then to a replacement drive. Funny thing is, that drive is an "enterprise class" drive and yet, so far, every other drive I've ever owned has lasted far longer than it did (13 months)... I will be having words with the manufacturer soon myself.

Good news: Drive is still under warranty for another 3+ years.
Bad news: I've got to buy another drive to get the data off the defective one before I send it in.

Get some rest... And then post and tell me why you still need "a generic solution" after your tweak to suppress double-escaped characters -- That is -- Is there some case you've found that still doesn't work (aside from the escaping issue)?

Jim

AlexK

1:45 pm on Mar 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



jdMorgan:
post and tell me why you still need "a generic solution"

The specific error that my coding glitch introduced adds extra `/'s after the first directory. Thus:
  • "GET /downloads///some-other-directory/..."

So, the rewrite code above fixes that quite nicely. However...

One of the big issues for any website with Google is duplicate pages, as (from past experience) the site drops dramatically in the SERPs, and also recovers when the duplication is fixed. Duplication occurs big-time with this error - the maximum that I have counted so far is 7 `/'s!

Now, OK - the fix so far implemented fixes my site's specific problem. What about multiple `/'s in other places? Perhaps accidental fat-fingers from links on other sites, or whatever.

It therefore occurred to me that, at the same time that I fixed this immediate problem, it would be good to be able to put in place a generic fix for *any* "//+" at any location. That would also be more widely useful.

Now, if you say "that's a silly thing to have to guard against" I agree entirely. However, you should see some of the stuff that I've had to defensively code against. It's quite ludicrous. [this para not aimed at you, Jim; I suspect you know this better than I do]

PS
The Dell story so far: no hidden partition on the drive (full Windows CD, drivers etc provided), but instead partitioned with a 'Backup' drive, dropping the 160GB drive to effective 108GB. Buyer beware.

Dell and me are currently negotiating to get a solution. Easy, really - if they do not sort it by the end of the day, then I do a chargeback on the credit-card.

jdMorgan

4:27 pm on Mar 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> ..."//" at any other location...

The intent of the code I posted was to handle multiple slashes occurring anywhere in the requested URL, although only one occurrence of multiple slashes can be corrected by a single redirect. As I said, I did not test the code I posted -- Does it not work this way?

As far as "defensive coding," yes, I'm well-aware: Perhaps you missed this thread [webmasterworld.com] in our library. :)

The backup partition you refer to is a standard Dell thing; They use it to store an image of the OS installation and driver CDs, so they don't have to ship you CDs, and also to keep backups of initial/factory registry settings, etc. They will likely point you to the fine print in the catalog or on-line system description that notes that this space is reserved and unavailable. Enterprise-class 160GB hard drives are available for $54 online, and it's easy to install a secondary hard drive... ;)

Jim

AlexK

6:37 pm on Mar 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



jdMorgan:
The intent of the code ... handle multiple slashes occurring anywhere

Mea culpa. Tried it, and it does.

Now, I am confused.

I thought that the "

((/[^/]+)*)
" bit would stop at the first sub-directory slash. Obviously not. Let's see if I can sort it for myself:

( 
(/ # root directory slash
/[^/] # anything NOT a slash
+) # 1 or many
*) # 0 or many

Nope, still cannot understand. By my reasoning, that should stop at the first slash (even though the evidence is that it does not). Sigh.

Ah! (understanding suddenly dawns): the inner () says "/{anything-not-slash}", and the outer () says "repeated until it hits `//+'". Yes! Got it!

Dell:
Correct about the contents of the partition. The point, however, is that it is 37+ GB, whilst containing just 65MB. The rest is simply wasted (inaccessible, never used by the system).

Dell have agreed to replace the computer. I am to receive a phone call on Monday. 250 GB should do it, although I may go higher.

What I want to know is: where has the rest gone? My current IDE 160 GB has formatted to 152 GB. 108 + 37 = 145 GB. I make that 7 GB missing. Good Lord, I used to dream of having a 30MB hdd, let alone 7GB!

jdMorgan

9:12 pm on Mar 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Start inside the parentheses: "((/[^/]+)*)" says, "match a slash, followed by one or more characters not a slash, and as many of those as you like -- zero or more, and then include the whole mess (everything that has matched so far, in back-reference $1." So any number of subdirectories are supported.

Jim

AlexK

3:13 pm on Mar 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



jdMorgan:
As far as "defensive coding," yes, I'm well-aware: Perhaps you missed this thread [webmasterworld.com] in our library. :)

Blimey! Talk about trying to teach Grandmother to suck eggs!

Ah well, you will be able to remove 4 lines from that coding, now.
.

I've discovered that Apache 2.0.52 has the same bug as Apache 1.3.x. [archive.apache.org]

The 2 are possibly not connected, but I've discovered my own bug with Apache2 and mod_rewrite re: PHP.

`ExtendedStatus On'
exposes many
$_SERVER
variables in PHP.
$_SERVER[ 'QUERY_STRING' ]
is normally urlencoded, but use of mod_rewrite urldecodes it. That leads to at least 2 undesirable effects:
  1. `+' (`%2B' in REQUEST_URI) ends up as a space in $_REQUEST
  2. `&' (`%26' in REQUEST_URI) within query values breaks $_REQUEST parsing
    (spurious, and wrong, parameter-value pairs)

Explicity, this affects
$_SERVER[ 'QUERY_STRING' ], $_GET, $_REQUEST
and
parse_str()
(not
$_POST
, and I cannot recall the situation for
$_COOKIE
, although I think that it *is* affected).

This points to a bug within the mod_rewrite urldecode/urlencode sequencing, and in that sense the 2 may be connected.