Forum Moderators: open
However, I do not want Google to detect duplicate content. (I want Google to credit the article to the originating magazine, if to anyone.) I do not want spiders to go into a loop.
What advice should I give to my programmer?
For example. If a village newspaper were to publish an original article on their web site which was then duplicated by the New York times web site, Google may deem the New York Times version as being definitive.
The only way to be sure is to prevent Google from indexing the other versions of the article using robots.txt or robots meta tags.
As the admin of the larger site, I'm not too worried about who gets credit in G for the article. What I want to avoid is G's duplication filter.
Maybe what you are saying is that I should stick a robots file at the top of the article every time another magazine runs the article. Then G would only find the article in one place.?
I also want to avoid spiders getting caught in loops. Maybe the robots file would help here, too.
As the admin of the larger site, I'm not too worried about who gets credit in G for the article. What I want to avoid is G's duplication filter.
When I refer to an article as being "given credit""the definitive article", I mean that it is the article that Google does not filter out. Therefore it is the article that is most likely to appear in the search results for a given search term that is in both article pages.
By using a robots exclusion, you are effectively filtering the article yourself before Google gets a chance to do so.
The reson that people might exclude an article themselves is if they are fearful of a duplicate content site penalty.
Some people believe that Google imposes some kind of duplicate content penalty. If you believe that your site would be affected by this penalty then excluding these duplicate articles using robots exclusion could be a preventative option for you.
However, please bear in mind that the robots exclusion only excludes robots (in this case, GoogleBot). I personally don't believe this is enough to hide content from Google. Google could well be analysing pages based on data they get from the Google Web Accelerator and Google Secure Access services. Although these services are currently in limited use, the user base is likely to increase in future, therefore as time progresses, the liklihood of your duplicate content being detected increases.
If Google are analysing this kind of data, you could well get penalised more for both having duplicate content and trying to hide it.
If your site has a good ratio of original articles to duplicate articles, I wouldn't be to worried about possible penalties if I were you.
I think you have a great point about Google having "other ways" to detect duplicate content.
I want to set up this site and to let it run without constant SEO maintenance. So, I think I just want to avoid duplicate content altogether. Tell publications to link to articles in sister publications, but not to duplicate them.
For the long term, that's the only safe way.