Forum Moderators: phranque
I have duplicate content available on mydomain.com/index.html/whatever and mydomain.com/whatever.
I'm trying to add a rule to .htaccess to redirect any url in the first format to the 2nd.
I add this:
RewriteCond %{REQUEST_URI} index.html/.+
RewriteRule ^index.html/(.+)$ [mydomain.com...] [L,R=301]
And the first redirect works fine
[mydomain.com...]
GET /index.html/whatever.html HTTP/1.1
HTTP/1.x 301 Moved Permanently
Server: Apache/1.3.33 (Unix) PHP/4.3.10 mod_perl/1.29
Location: [mydomain.com...]
But then it keeps doing a 301 redirect to itself. (/whatever.html)
Here is the whole .htaccess.
RewriteEngine On
RewriteCond %{HTTP_HOST} ^mydomain.com [NC]
RewriteRule ^(.*)$ [mydomain.com...] [L,R=301]
RewriteCond %{REQUEST_URI} ^.+/$
RewriteCond %{REQUEST_URI}!(/index\.html)
RewriteRule ^(.+)/$ [mydomain.com...] [L,R=301]
RewriteCond %{REQUEST_URI} .asp/.+
RewriteRule ^(.+)\.asp/(.+)$ [mydomain.com...] [L,R=301]
RewriteCond %{REQUEST_URI} index.html/.+
RewriteRule ^index.html/(.+)$ [mydomain.com...] [L,R=301]
RewriteCond %{REQUEST_URI}!([whole list of exempted OR'd directories])
RewriteRule ^(.*)$ index.html/$1
RewriteCond %{REQUEST_URI} /admin
RewriteRule ^(.*)$ index.html?1=admin
Any help would be appreciated, I've been fighting with this one for weeks now! :)
I moved the domain canonicalization redirect to be the last external redirect, in order to avoid 'stacked' redirects in cases where both the domain and the URL need to be corrected. Stacked redirects won't pass PageRank.
There were many instances of literal characters that needed be escaped, and redundant RewriteConds. Also, the rule order was non-optimal, and in one case, possibly-incorrect. I have left these lines in place, but commented them out. You may remove them after evaluating the changes.
To the extent that I could infer intent, I added comments about the function of each rule. Check these, because if I was wrong about the intent, then the code might also be wrong.
Anyway, here is how I would code this:
RewriteEngine On
#
# Redirect to remove trailing slash unless URI starts with "index.html"
# Delete first RewriteCond. It is broken due to a missing leading
# slash, and would be redundant even if it wasn't broken.
# No parentheses needed, either.
# RewriteCond %{REQUEST_URI} ^.+/$
# RewriteCond %{REQUEST_URI} !(/index\.html)
RewriteCond %{REQUEST_URI} !/index\.html
RewriteRule ^(.+)/$ http://www.example.com/$1 [L,R=301]
#
# Redirect to remove ".asp" in middle of URLs
# No "+" needed, because pattern is not end-anchored.
# That RewriteCond is redundant, anyway.
# Escape literal periods by preceding them with "\".
# For efficiency, avoid ambiguous leading patterns when
# a "floating" match is sought (".asp" in this case).
# RewriteCond %{REQUEST_URI} \.asp/.+
# RewriteRule ^(.+)\.asp/(.+)$ http://www.example.com/$1/$2 [L,R=301]
RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$1/$3 [L,R=301]
#
# Redirect to remove "index.html" prefixes from requested URLs
# RewriteCond is redundant as written, and won't prevent a
# loop when interacting with "DirectoryIndex" directives or
# with the internal "index.html" rewrite below.
# RewriteCond %{REQUEST_URI} index\.html/.+
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html/.
RewriteRule ^index\.html/(.+)$ http://www.example.com/$1 [L,R=301]
#
# Redirect "example.com/<whatever>" to "www.example.com/<whatever>"
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [L,R=301]
#
# Internally rewrite /admin requests to index.html?1=admin
# RewriteCond not needed if pattern moved to RewriteRule
# RewriteCond %{REQUEST_URI} /admin
RewriteRule ^admin /index.html?1=admin
#
# Internally rewrite to prefix all requested URLs with
# "index.html" unless this has already been done
RewriteCond %{REQUEST_URI} !/index\.html/
RewriteCond %{REQUEST_URI} !(<whole list of exempted OR'd directories>)
RewriteRule (.*) /index.html/$1
RewriteEngine On
#
# Redirect to remove trailing slash unless URI starts with "index.html"
RewriteCond %{REQUEST_URI} !/index\.html
RewriteRule ^(.+)/$ http://www.example.com/$1 [L,R=301]
#
# Redirect to remove ".asp" in middle of URLs
RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$1/$3 [L,R=301]
#
# Redirect to remove "index.html" prefixes from requested URLs
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html/.
RewriteRule ^index\.html/(.+)$ http://www.example.com/$1 [L,R=301]
#
# Redirect "example.com/<whatever>" to "www.example.com/<whatever>"
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [L,R=301]
#
# Internally rewrite /admin requests to index.html?1=admin
RewriteRule ^admin /index.html?1=admin
#
# Internally rewrite to prefix all requested URLs with
# "index.html" unless this has already been done
RewriteCond %{REQUEST_URI} !/index\.html/
RewriteCond %{REQUEST_URI} !(<whole list of exempted OR'd directories>[b][b])
RewriteRule (.*) /index.html/$1
GET /index.html/widget HTTP/1.1
There were a lot of changes, so hope there aren't any typos! :)
Jim
ps, there was only one typo :)
RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$1/$3 [L,R=301]
should have been
RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$2/$3 [L,R=301]
but I caught it during testing ;)
Thanks again.
RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$2/$3 [L,R=301]
Similarly, you would have problems with a path such as "this.directory/file.asp" because the first part of the directory-pathname would be dropped, yielding "directory/file.asp".
If there was some other problem not related to this behaviour, please let me know.
(To determine the matched-contents of the numbered back-references, count left parentheses.)
Jim
Similarly, you would have problems with a path such as "this.directory/file.asp" because the first part of the directory-pathname would be dropped, yielding "directory/file.asp".
Hmm you're right :)
The problem I was encountering was that as
RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$1/$3 [L,R=301]
it was redirecting /this-file.asp/another-file to /this-file./another-file .. I figured the 2nd backreference looked like the string without the dot. So it looked like it worked, but yeah on closer inspection, it introduces the problem you mentioned with urls like /this.directory.asp/whatever .
Btw, what does the caret in here [^.] represent?
You could fix it like this:
RewriteRule ^(([^.]+\.)*([^.]+))\.+asp/(.+)$ http://www.example.com/$1/$4 [L,R=301]
RewriteRule ^([^.]+)\.+asp/(.+)$ http://www.example.com/$1/$2 [L,R=301]
RewriteRule ^(.+)\.+asp/(.+)$ http://www.example.com/$1/$2 [L,R=301]
For more information, see the concise regex tutorial cited in our forum charter [webmasterworld.com].
Jim