Forum Moderators: phranque

Message Too Old, No Replies

Removing 100% duplicate site from search index

         

fish_eye

5:37 am on Oct 26, 2005 (gmt 0)

10+ Year Member



Following on from this thread [webmasterworld.com].

The problem is a site that has had both www and non-www versions spidered and added to indexes. Given that SE's reactions to 301s, 404s and 410s are many and varied, would this do the trick better?

RewriteCond %{HTTP_HOST} ^example\.com
RewriteCond %{REQUEST_URI} ^robots.txt
RewriteRule ^robots\.txt /goaway.txt [L]

Where goaway.txt would contain:

User-agent: *
Disallow: /

fzx5v0

9:35 am on Oct 26, 2005 (gmt 0)

10+ Year Member



I do not know if this is correct way to do thing but

I always set up the non www. version as a different virtual server in apache and put a .htaccess file in the document root

Redirect 301 / [yourdomain.co.uk...]

jdMorgan

7:15 pm on Nov 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, with a couple of minor tweaks (see 2nd RewriteCond line below), that would serve an alternate robots.txt for the example.com domain. You could go even further, and serve based on user-agent.

RewriteCond %{HTTP_HOST} ^example\.com
RewriteCond %{REQUEST_URI} ^/robots\.txt
RewriteCond %{HTTP_USER_AGENT} slurp
RewriteRule ^robots\.txt /slurp_robots.txt [L]

... or even more detailed, with a robots.txt file per-robot:

RewriteCond %{HTTP_HOST} ^example\.com
RewriteCond %{REQUEST_URI} ^/robots\.txt
RewriteCond %{HTTP_USER_AGENT} (Googlebot¦msnbot¦Slurp¦Teoma)
RewriteRule ^robots\.txt /%1_robots.txt [L]

However, I share the opinion posted in the original thread that a simple domain redirection should work --given time-- and that it is the correct way to prevent duplicate sites on www and non-www. I have always included this redirection where needed on any and all new sites, and have never had any canonicalization problems. It's one of those things that, if done at the very start, prevents potentially-big problems.

Jim