Welcome to WebmasterWorld Guest from 100.25.43.188

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Need help with this .htaccess rule

     
5:18 pm on Apr 25, 2007 (gmt 0)

New User

10+ Year Member

joined:Dec 31, 2006
posts:29
votes: 0


Situation:

I have duplicate content available on mydomain.com/index.html/whatever and mydomain.com/whatever.

I'm trying to add a rule to .htaccess to redirect any url in the first format to the 2nd.

I add this:

RewriteCond %{REQUEST_URI} index.html/.+
RewriteRule ^index.html/(.+)$ [mydomain.com...] [L,R=301]

And the first redirect works fine

[mydomain.com...]

GET /index.html/whatever.html HTTP/1.1
HTTP/1.x 301 Moved Permanently
Server: Apache/1.3.33 (Unix) PHP/4.3.10 mod_perl/1.29
Location: [mydomain.com...]

But then it keeps doing a 301 redirect to itself. (/whatever.html)

Here is the whole .htaccess.

RewriteEngine On
RewriteCond %{HTTP_HOST} ^mydomain.com [NC]
RewriteRule ^(.*)$ [mydomain.com...] [L,R=301]

RewriteCond %{REQUEST_URI} ^.+/$
RewriteCond %{REQUEST_URI}!(/index\.html)
RewriteRule ^(.+)/$ [mydomain.com...] [L,R=301]

RewriteCond %{REQUEST_URI} .asp/.+
RewriteRule ^(.+)\.asp/(.+)$ [mydomain.com...] [L,R=301]

RewriteCond %{REQUEST_URI} index.html/.+
RewriteRule ^index.html/(.+)$ [mydomain.com...] [L,R=301]

RewriteCond %{REQUEST_URI}!([whole list of exempted OR'd directories])
RewriteRule ^(.*)$ index.html/$1

RewriteCond %{REQUEST_URI} /admin
RewriteRule ^(.*)$ index.html?1=admin

Any help would be appreciated, I've been fighting with this one for weeks now! :)

8:42 pm on Apr 25, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


The main problem is that your index.html removal redirect has no provision to prevent it from interacting with either a DirectoryIndex directive defining index.html as a possible index file, or with the internal rewrite executed later in the file. Remember that the [L] flag only affects the current pass through .htaccess, and that .htaccess will be re-invoked as the server responds to either a new client request due a redirect or to a new internal request as the result of a previous internal rewrite. That is, you must code .htaccess to handle recursion.

I moved the domain canonicalization redirect to be the last external redirect, in order to avoid 'stacked' redirects in cases where both the domain and the URL need to be corrected. Stacked redirects won't pass PageRank.

There were many instances of literal characters that needed be escaped, and redundant RewriteConds. Also, the rule order was non-optimal, and in one case, possibly-incorrect. I have left these lines in place, but commented them out. You may remove them after evaluating the changes.

To the extent that I could infer intent, I added comments about the function of each rule. Check these, because if I was wrong about the intent, then the code might also be wrong.

Anyway, here is how I would code this:


RewriteEngine On
#
# Redirect to remove trailing slash unless URI starts with "index.html"
# Delete first RewriteCond. It is broken due to a missing leading
# slash, and would be redundant even if it wasn't broken.
# No parentheses needed, either.
# RewriteCond %{REQUEST_URI} ^.+/$
# RewriteCond %{REQUEST_URI} !(/index\.html)
RewriteCond %{REQUEST_URI} !/index\.html
RewriteRule ^(.+)/$ http://www.example.com/$1 [L,R=301]
#
# Redirect to remove ".asp" in middle of URLs
# No "+" needed, because pattern is not end-anchored.
# That RewriteCond is redundant, anyway.
# Escape literal periods by preceding them with "\".
# For efficiency, avoid ambiguous leading patterns when
# a "floating" match is sought (".asp" in this case).
# RewriteCond %{REQUEST_URI} \.asp/.+
# RewriteRule ^(.+)\.asp/(.+)$ http://www.example.com/$1/$2 [L,R=301]
RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$1/$3 [L,R=301]
#
# Redirect to remove "index.html" prefixes from requested URLs
# RewriteCond is redundant as written, and won't prevent a
# loop when interacting with "DirectoryIndex" directives or
# with the internal "index.html" rewrite below.
# RewriteCond %{REQUEST_URI} index\.html/.+
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html/.
RewriteRule ^index\.html/(.+)$ http://www.example.com/$1 [L,R=301]
#
# Redirect "example.com/<whatever>" to "www.example.com/<whatever>"
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [L,R=301]
#
# Internally rewrite /admin requests to index.html?1=admin
# RewriteCond not needed if pattern moved to RewriteRule
# RewriteCond %{REQUEST_URI} /admin
RewriteRule ^admin /index.html?1=admin
#
# Internally rewrite to prefix all requested URLs with
# "index.html" unless this has already been done
RewriteCond %{REQUEST_URI} !/index\.html/
RewriteCond %{REQUEST_URI} !(<whole list of exempted OR'd directories>)
RewriteRule (.*) /index.html/$1

Without all the commentary, it looks like this:

RewriteEngine On
#
# Redirect to remove trailing slash unless URI starts with "index.html"
RewriteCond %{REQUEST_URI} !/index\.html
RewriteRule ^(.+)/$ http://www.example.com/$1 [L,R=301]
#
# Redirect to remove ".asp" in middle of URLs
RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$1/$3 [L,R=301]
#
# Redirect to remove "index.html" prefixes from requested URLs
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html/.
RewriteRule ^index\.html/(.+)$ http://www.example.com/$1 [L,R=301]
#
# Redirect "example.com/<whatever>" to "www.example.com/<whatever>"
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [L,R=301]
#
# Internally rewrite /admin requests to index.html?1=admin
RewriteRule ^admin /index.html?1=admin
#
# Internally rewrite to prefix all requested URLs with
# "index.html" unless this has already been done
RewriteCond %{REQUEST_URI} !/index\.html/
RewriteCond %{REQUEST_URI} !(<whole list of exempted OR'd directories>[b][b])
RewriteRule (.*) /index.html/$1

Using THE_REQUEST in the rule above ensures that the rule is only applied when "index.html/" is present in the URL requested by the client browser or robot; The rule will not be applied as the result of an internal rewrite or DirectoryIndex. THE_REQUEST is the entire request header received from the client, for example:
GET /index.html/widget HTTP/1.1

While REQUEST_URI is updated after any internal URL rewrite (including DirectoryIndex rewrites), THE_REQUEST is defined soley by the client, and never changes while any given request is being processed.

There were a lot of changes, so hope there aren't any typos! :)

Jim

4:12 pm on Apr 26, 2007 (gmt 0)

New User

10+ Year Member

joined:Dec 31, 2006
posts:29
votes: 0


Jim, thank you so much! It worked perfectly. Thanks for putting the effort in to help me.. you rock! Not only did you fix the problem for me, I learned a lot from your post/reformulation too ;)

ps, there was only one typo :)

RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$1/$3 [L,R=301]

should have been

RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$2/$3 [L,R=301]

but I caught it during testing ;)

Thanks again.

5:07 pm on Apr 26, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


If you use $2 instead of $1 like this:

RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$2/$3 [L,R=301]

then only the last part of a multi-dotted filename such as "this.file.asp" would be back-referenced. The result would be that $2 would be "file" instead of "this.file". By contrast, $1 would include the whole filename.

Similarly, you would have problems with a path such as "this.directory/file.asp" because the first part of the directory-pathname would be dropped, yielding "directory/file.asp".

If there was some other problem not related to this behaviour, please let me know.

(To determine the matched-contents of the numbered back-references, count left parentheses.)

Jim

7:27 pm on Apr 26, 2007 (gmt 0)

New User

10+ Year Member

joined:Dec 31, 2006
posts:29
votes: 0



Similarly, you would have problems with a path such as "this.directory/file.asp" because the first part of the directory-pathname would be dropped, yielding "directory/file.asp".

Hmm you're right :)

The problem I was encountering was that as

RewriteRule ^(([^.]+)\.)+asp/(.+)$ http://www.example.com/$1/$3 [L,R=301]

it was redirecting /this-file.asp/another-file to /this-file./another-file .. I figured the 2nd backreference looked like the string without the dot. So it looked like it worked, but yeah on closer inspection, it introduces the problem you mentioned with urls like /this.directory.asp/whatever .

Btw, what does the caret in here [^.] represent?

8:01 pm on Apr 26, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Ugh, yeah, you're right...

You could fix it like this:


RewriteRule ^(([^.]+\.)*([^.]+))\.+asp/(.+)$ http://www.example.com/$1/$4 [L,R=301]

but that's getting too complicated. You might as well just use

RewriteRule ^([^.]+)\.+asp/(.+)$ http://www.example.com/$1/$2 [L,R=301]

if you only ever have one 'dot' in your URL-paths, or something more like your original pattern if you do or might have multiple dots:

RewriteRule ^(.+)\.+asp/(.+)$ http://www.example.com/$1/$2 [L,R=301]

"[^.]+" means "match one or more characters NOT equal to a literal period. This is a negative character group. Groups may contain multiple alternate characters. A group of [abc] matches one character that is one of a, b, or c. [^abc] matches any one character that is NOT a, b, or c. Any quantifier such as ?, +, *, or {2,5} can be used to modify a group.

For more information, see the concise regex tutorial cited in our forum charter [webmasterworld.com].

Jim