Forum Moderators: Robert Charlton & goodroi
I have been trying to learn more about SEO pratices to promote web site but there is one area that I can't get clarity. Currently advertiser with a firm who I am told use duplicate content to promote and bring traffic, explain to me :
The most common way to block the search engines is by use of a “robots.txt” tag in the root directory of your web site. The robots.txt file is primarily used to tell the search engines where they can and cannot go.
# go away
User-agent: *
Disallow: /
Simply put, User-agent is the name the search engine calls itself when it requests information from your site (”*” = any). So in other words all search engines need to pay attention to what follows. “Disallow” should be obvious, and “/” is the root directory of the web server (and by default anything below).
Have a associate who disputes this and ask the following of me to which I am hoping to get answers to the following
1.) What directory do they put that in?
2.) How can they do that if they exclude the site from getting crawled?
Just wanting to see if I stay the current course or if it's better to move away from this duplicate content all together?
Robots.txt is a standalone file called, naturally enough, robots.txt. It is placed in the root directory of the domain and tells spiders where they should not index. So yes, if you intentionally create duplicate or near duplicate pages for marketing purposes, it's a good idea to only allow one version to be indexed.
While a full discussion of robots.txt syntax is better suited for our robots.txt forum [webmasterworld.com], you don't need to disallow the entire site. You can build rules in robots.txt that will disallow only one directory or even only one url. Takeing even further, Google supports pattern matching wild card rules, too, even though wild cards are not currently part of the robots.txt standard. In your Webmaster Tools account, Google offers tools to help you validate your robots.txt file and make sure that your rules actually do what you intend.
There's also another way to restrict indexing of any given page - use a robots meta tag in the url's <head>.
There is also the issue of "dangling URLs". I have been mulling these issues for a little while, but several bloggers have now published some very convincing articles in recent weeks on this subject.