Forum Moderators: phranque

Message Too Old, No Replies

Avoiding duplicates with htaccess

Stopping all duplicate pages for spiders

         

Peter

1:40 am on Dec 4, 2004 (gmt 0)

10+ Year Member


Hello, the object is to avoid all possible sources of apparent duplicate content for spiders.

- www.mysite.net is also accessible by .org and .com
- Except for things in cgi-bin, no parameters are used and all pages are .html
- For some reason the (shared) server reponds equally to www.mysite.net// (two slashes) and to www.mysite.net/index.html/qsdfgh and also, of course, to www.mysite.net/index.html?qsdfgh

The following .htaccess at root level seems to prevent all these duplicates (except dups inside cgi-bin, which don't matter because excluded in robots.txt and by "noindex nofollow" in the pages themselves).

Can anyone see any errors or improvements, please?

...
RewriteEngine On
...
# If wrong TLD or query outside cgi-bin, then redirect to .net without the query
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.mysite\.net [OR]
RewriteCond %{QUERY_STRING} .
RewriteCond $1 !^cgi-bin/
RewriteRule ^(.*)$ ht tp://www.mysite.net/$1? [R=301,L]
# If anything after .html, strip it off and redirect
RewriteRule ^(.*)\.html(.) ht*p://www.mysite.net/$1.html? [R=301,L]
# Rewrite www.mysite.net/folder//
RewriteRule ^(.*)// ht*p://www.mysite.net/$1/? [R=301,L]
# Rewrite www.mysite.net//
RewriteRule ^/ ht*p://www.mysite.net/? [R=301,L]
...

Thank you.

jdMorgan

4:06 am on Dec 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Peter,

Haven't seen you in awhile... Welcome back to the Apache forum!

A few ideas:


# If anything after .html, strip it off and redirect
RewriteRule ^[b]([^.]*)[/b]\.ht[b]ml.[/b] http://www.mysite.net/$1.html? [R=301,L]
# Redirect www.mysite.net/folder(s)//<anything>
RewriteRule ^(.+)[b]//+(.*)[/b] http://www.mysite.net/$1/[b]$2[/b]? [R=301,L]
# Redirect www.mysite.net//<anything>
RewriteRule ^/+[b](.*)[/b] http://www.mysite.net/[b]$1[/b]? [R=301,L]

1st Rule: Improve pattern-parsing efficiency and delete unnecessary parentheses after "html"
2nd Rule: Catch 2 or more slashes, allow for subfolders
3rd Rule: Catch 2 or more slashes, added code to catch mysite.net//<something>

Other than that, good tight code!

Note that in the case where you get a request for "mysite.net//" the second rule will not be invoked. Rather, the third rule will now take care of that problem.

Jim

Peter

1:15 pm on Dec 4, 2004 (gmt 0)

10+ Year Member



Thank you very much, Jim, for your welcome and your most instructive reply, as always.

By the way, can you see any legitimate reason for the server (over which I have no control) to accept these requests that we don't want, or does this suggest a bad configuration somewhere?

Peter.

jdMorgan

4:23 pm on Dec 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The Web specifications allow for requests such as "GET //robots.txt", so the servers allow them. The intent was to make the Web more robust in the face of human error.

You can either ignore such requests, redirect them, or 403-Forbid them as you choose.

Jim

Peter

7:42 pm on Dec 4, 2004 (gmt 0)

10+ Year Member



Thanks, that's all very clear now.

Peter.