Avoiding dup content using robots.txt?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Avoiding dup content using robots.txt?

morags

6:48 am on Sep 2, 2005 (gmt 0)

I have 2 sites. Both sites share the same MySQL database. I need to display the content of the database on both sites. But this has obvious "duplicate content" implications.

What I am considering, is using robots.txt to exclude spiders from one of the sites (lets call it SiteA). Therefore SiteB will be the only one spidered.

This sounds reasonable, but I have just learned that if someone links to a page on SiteA then that page WILL be indexed - even if it is not spidered. Meaning dup content penalty.

Is there a surefire way to prevent SiteA pages from being indexed? A way that will also allow PR to pass through the page? (eg if someone links to the page, and the page contains a link to the home page, I would like the PR to pass to the home page).

g1smd

4:35 pm on Sep 2, 2005 (gmt 0)

No, the content will not be indexed.

The page will appear as a URL-only entry simply because Google knows that it exists.

I would prefer to have <meta name="robots" content="noindex"> on each page instead. I believe that stops the URL from even appearing in the SERPs.

MHes

5:05 pm on Sep 2, 2005 (gmt 0)

>But this has obvious "duplicate content" implications.

So what? Google will show one or the other, let them decide.

g1smd

5:55 pm on Sep 2, 2005 (gmt 0)

>> I don't think they make a full index when they have to decide. <<

This is some information about a site that had been online for 2 years, had 120 pages, and showed both www and non-www with status 200. A Google site: search showed about 150 entries, a mixture of www and non-www. Many pages were shown as URL-only entries; a lot of pages were indexed as both www and non-www (often both as URL-only though) and some content was not indexed at all.

A 301 redirect was added in March and within a couple of weeks all 120 non-www pages were showing as fully indexed, and the www pages were turning to URL-only, and/or dropping out en masse.

I can recommend making Google's life easier and showing just one set of content.

MHes

9:20 pm on Sep 2, 2005 (gmt 0)

I'm not convinced. You are comparing duplication on the same url with duplication between two different domains and suggesting the problems are the same. They may not be. There are millions of examples of duplication where Google very effectively selects one page and puts the others into 'suplimental results'. I personally would let google make the choice. They will pick the best domain according to their rules, if you prejudge, you may choose the wrong one.

cws3di

5:14 am on Sep 3, 2005 (gmt 0)

morags,

I suggest that if you are worried about duplicate content penalty that you code the actual .php pages on the two sites so that the static content "surrounding" the dynamic database input is completely different.

Make sure that the database elements are rearranged in a different order, a different theme, style and structure for the two different sites.

There are also some really nice tools available for .php random text that can be customized to dynamically display an extra paragraph or two of your own original content - every time the spiders or users see your page, you have new content that you have writeen, that is on topic.

There are also many ways to pull in dynamic (and different) content into your <title> tags, description, etc.

i.e. it is easy to use the same database input if you carefully create unique content on the two different sites.

cws3di

5:23 am on Sep 3, 2005 (gmt 0)

I forgot to mention in the above post that I am advocating NOT using robots.txt to block the "good" spiders out.

Keep in mind that there will be massive numbers of other bots that do NOT obey robots.txt, often scrapers, and yes, they will end up with many miscellaneous links to those pages you "thought" were blocked.

I think it is better to put in a little extra effort at the outset to make the two sites very different from each other, then take advantage of having both sites fully spidered.

MHes

8:11 am on Sep 3, 2005 (gmt 0)

cws3di
Good advice. Two sites in Google is better than one and either one could be flavour of the month.