Duplicate content and robots.txt - need clarification - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate content and robots.txt - need clarification

kvegas

8:25 pm on Nov 28, 2007 (gmt 0)

10+ Year Member

Good afternoon,

I have been trying to learn more about SEO pratices to promote web site but there is one area that I can't get clarity. Currently advertiser with a firm who I am told use duplicate content to promote and bring traffic, explain to me :

The most common way to block the search engines is by use of a “robots.txt” tag in the root directory of your web site. The robots.txt file is primarily used to tell the search engines where they can and cannot go.
# go away
User-agent: *
Disallow: /
Simply put, User-agent is the name the search engine calls itself when it requests information from your site (”*” = any). So in other words all search engines need to pay attention to what follows. “Disallow” should be obvious, and “/” is the root directory of the web server (and by default anything below).

Have a associate who disputes this and ask the following of me to which I am hoping to get answers to the following

1.) What directory do they put that in?
2.) How can they do that if they exclude the site from getting crawled?

Just wanting to see if I stay the current course or if it's better to move away from this duplicate content all together?

tedster

8:43 pm on Nov 28, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hello kvegas, and welcome to the forums.

Robots.txt is a standalone file called, naturally enough, robots.txt. It is placed in the root directory of the domain and tells spiders where they should not index. So yes, if you intentionally create duplicate or near duplicate pages for marketing purposes, it's a good idea to only allow one version to be indexed.

While a full discussion of robots.txt syntax is better suited for our robots.txt forum [webmasterworld.com], you don't need to disallow the entire site. You can build rules in robots.txt that will disallow only one directory or even only one url. Takeing even further, Google supports pattern matching wild card rules, too, even though wild cards are not currently part of the robots.txt standard. In your Webmaster Tools account, Google offers tools to help you validate your robots.txt file and make sure that your rules actually do what you intend.

There's also another way to restrict indexing of any given page - use a robots meta tag in the url's <head>.

g1smd

1:24 am on Nov 30, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

There is some thinking that using robots.txt might not be the best way to handle the duplicate content issue, because the URLs that are blocked from spidering can still acquire PR from inbound links pointing to that URL, and that that PR is therefore "wasted".

There is also the issue of "dangling URLs". I have been mulling these issues for a little while, but several bloggers have now published some very convincing articles in recent weeks on this subject.

g1smd

1:31 am on Nov 30, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Disallow: /

This disallows any URL that starts with "/" and therefore disallows every URL on the entire site.