Forum Moderators: phranque
"GET /downloads//////non-modem/...
The number of slashes can vary, and so can the position. This is the code for httpd.conf that I think should fix it:
RewriteCond %{THE_REQUEST} ^(.*)(//+)(.*)
RewriteRule ^.* [my-site.com%1...] [L,R=permanent]
GET /downloads//non-modem?name1=value1&name2=value2 HTTP/1.1
Therefore, it is necessary to take the 'extra' stuff into account in the RewriteCond pattern:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ (.*)//+([^\ ]*)\ HTTP/
RewriteRule .* http://www.example.com%1/%2 [R=301,L]
Jim
My head is a little frazzled at this instant due to trying to transfer all my business stuff from one imminently-soon-to-break-down winXP computer to a new one (100+ GB of My Documents transferring across a 100Mb line as I type these words).
I knew I needed to get it checked as I typed it, hence this post.
I've spent hours removing duplicate pages in the past via both httpd.conf and PHP, but never expected that a `//+' would introduce endless other variations. My httpd.conf is now chock-a-block with defensive coding for situations that I would once have scoffed at. Getting rid of all the "GET /.?" (multiple question-marks) from my logfile notifications is next (at least that one's easy).
Many thanks, once again.
I'd finally had enough at that point and decided (wisely) to take the necessary time to repair all the incorrect links, which finally solved the problem. (Although it did take Google and MSN some time to stop requesting these multiple slash links.)
BTW, the majority of my faulty links were in fact blank links
(<a href="">My phrase</a>) that I had intended to add within the inital content creation of pages and failed to complete.
Rats -- Darn greedy ".*" pattern strikes again... Another example of why I always avoid using it...
See if this works any better (I haven't tested any of this):
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ ((/[^/]+)*)//+([^\ ]*)\ HTTP/
RewriteRule .* http://www.example.com%1/%3 [R=301,L]
The wretched thing is converting embedded spaces, and triggering my hack-routines:
"GET /...Billionton/BCM2035%20Bluetooth..."
becomes
"GET /...Billionton/BCM2035%2520Bluetooth..."
Fortunately, I was able to remember something about no-escaping the URI-specials (`%' in this case) and found it within the ubiquitous Apache rewrite page [httpd.apache.org].
This is what it all became in the end:
#
# replace multiple //' in REQUEST (there due to my bad coding in PHP)
# 2007-03-08 added -AK
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ ((/[^/]+)*)//+([^\ ]*)\ HTTP/
RewriteRule .* [my-site.com%1...] [L,NE,R=permanent]
Dell & me are going to have words in the morning.
Good news: Drive is still under warranty for another 3+ years.
Bad news: I've got to buy another drive to get the data off the defective one before I send it in.
Get some rest... And then post and tell me why you still need "a generic solution" after your tweak to suppress double-escaped characters -- That is -- Is there some case you've found that still doesn't work (aside from the escaping issue)?
Jim
post and tell me why you still need "a generic solution"
One of the big issues for any website with Google is duplicate pages, as (from past experience) the site drops dramatically in the SERPs, and also recovers when the duplication is fixed. Duplication occurs big-time with this error - the maximum that I have counted so far is 7 `/'s!
Now, OK - the fix so far implemented fixes my site's specific problem. What about multiple `/'s in other places? Perhaps accidental fat-fingers from links on other sites, or whatever.
It therefore occurred to me that, at the same time that I fixed this immediate problem, it would be good to be able to put in place a generic fix for *any* "//+" at any location. That would also be more widely useful.
Now, if you say "that's a silly thing to have to guard against" I agree entirely. However, you should see some of the stuff that I've had to defensively code against. It's quite ludicrous. [this para not aimed at you, Jim; I suspect you know this better than I do]
PS
The Dell story so far: no hidden partition on the drive (full Windows CD, drivers etc provided), but instead partitioned with a 'Backup' drive, dropping the 160GB drive to effective 108GB. Buyer beware.
Dell and me are currently negotiating to get a solution. Easy, really - if they do not sort it by the end of the day, then I do a chargeback on the credit-card.
The intent of the code I posted was to handle multiple slashes occurring anywhere in the requested URL, although only one occurrence of multiple slashes can be corrected by a single redirect. As I said, I did not test the code I posted -- Does it not work this way?
As far as "defensive coding," yes, I'm well-aware: Perhaps you missed this thread [webmasterworld.com] in our library. :)
The backup partition you refer to is a standard Dell thing; They use it to store an image of the OS installation and driver CDs, so they don't have to ship you CDs, and also to keep backups of initial/factory registry settings, etc. They will likely point you to the fine print in the catalog or on-line system description that notes that this space is reserved and unavailable. Enterprise-class 160GB hard drives are available for $54 online, and it's easy to install a secondary hard drive... ;)
Jim
The intent of the code ... handle multiple slashes occurring anywhere
Now, I am confused.
I thought that the "
((/[^/]+)*)" bit would stop at the first sub-directory slash. Obviously not. Let's see if I can sort it for myself:
(
(/ # root directory slash
/[^/] # anything NOT a slash
+) # 1 or many
*) # 0 or many
Nope, still cannot understand. By my reasoning, that should stop at the first slash (even though the evidence is that it does not). Sigh.
Ah! (understanding suddenly dawns): the inner () says "/{anything-not-slash}", and the outer () says "repeated until it hits `//+'". Yes! Got it!
Dell:
Correct about the contents of the partition. The point, however, is that it is 37+ GB, whilst containing just 65MB. The rest is simply wasted (inaccessible, never used by the system).
Dell have agreed to replace the computer. I am to receive a phone call on Monday. 250 GB should do it, although I may go higher.
What I want to know is: where has the rest gone? My current IDE 160 GB has formatted to 152 GB. 108 + 37 = 145 GB. I make that 7 GB missing. Good Lord, I used to dream of having a 30MB hdd, let alone 7GB!
Jim
As far as "defensive coding," yes, I'm well-aware: Perhaps you missed this thread [webmasterworld.com] in our library. :)
Ah well, you will be able to remove 4 lines from that coding, now.
.
I've discovered that Apache 2.0.52 has the same bug as Apache 1.3.x. [archive.apache.org]
`ExtendedStatus On'exposes many
$_SERVERvariables in PHP.
$_SERVER[ 'QUERY_STRING' ]is normally urlencoded, but use of mod_rewrite urldecodes it. That leads to at least 2 undesirable effects:
$_SERVER[ 'QUERY_STRING' ], $_GET, $_REQUESTand
parse_str()(not
$_POST, and I cannot recall the situation for
$_COOKIE, although I think that it *is* affected).
This points to a bug within the mod_rewrite urldecode/urlencode sequencing, and in that sense the 2 may be connected.