Internal Duplicate Content

Forum Moderators: open

Message Too Old, No Replies

Internal Duplicate Content

Printer versions, etc..

yetanotheruser

1:13 am on Feb 27, 2003 (gmt 0)

When is duplicate content duplicate content...

I'm about to re-build one of our sites which is about 85k pages... 65k or so are genuine unique content, and one of the things I was going to do this build was to add a 'Printer Version' of the main content pages..

Is Google going to [not-mind/ignore/penalize] 65k pairs of similar content (they won't obviously be exactly the same, but the bulk of the text will be)

While I'm at it, how many times would be too many? The apache mod_perl site, httpd.apache.org for example has html/pdf/pod versions of most pages..

Cheers, J. :)

deejay

1:20 am on Feb 27, 2003 (gmt 0)

uhhh.. the mere fact that you used 'mod__*' in a post would normally be enough to scare my somewhat novice butt off :) .... but here goes anyway.

1 If in the printable version you are leaving text menus and other text bits that in itself may produce a 10% difference (rule of thumb) that would negate a problem.

2 Do you really want Google to index the (usually not very viewer attractive or navigable) print versions? Why not just pop them all in a /printversion/ directory and ban bots from it?

[edited by: deejay at 1:20 am (utc) on Feb. 27, 2003]

Macguru

1:20 am on Feb 27, 2003 (gmt 0)

Hi yetanotheruser,

Interesting question.

My instictive move would be to simply store them in folders banned by the robots.txt file.
Had no problem with that yet.

buckworks

1:28 am on Feb 27, 2003 (gmt 0)

Another way to do this would be to just have one page, but link to a separate stylesheet specifically for printing.

hmpphf

1:29 am on Feb 27, 2003 (gmt 0)

We recently did a major accessibility project for one of our clients and achieved W3C priority 2 standard. I had hoped to sell them a 'printable version of this page' function but it turned out that one of the byproducts of the accessibility project was that the pages print perfectly.

I've been wanting to put the 'printable page' feature on some of our clients sites for a while now because I've often thought that the version of the page that is printable could carry a list of (search engine optimized) article keywords and nobody would ever notice unless they did a really careful check.

yetanotheruser

1:46 am on Feb 27, 2003 (gmt 0)

deejay,

that in itself may produce a 10% difference

I wondered about this.. but our spider ignores link content when it produces the page-snapshots.. Infact printable versions are the reason I'm having to add duplicate content spotting to our algo and they become very easy to spot when you ignore links..

hmpphf,

Sadly I'm moving from a currently printable template to a fancy streatchy one.. (I'm not the designer, I just build the things!) Well done on the w3c score though :)

Macguru, deejay,

I hadn't thought of robots.txt's.. thanks..

dare I try it? (it would increase the latent PR by about 1/2 ;) - but not sure that it'll benefit me.. ) .. I've got about 12hrs to decide before I re-mill.. grrr.. decisions decisions ;)

Macguru

1:51 am on Feb 27, 2003 (gmt 0)

One thing is for sure, I dont want visitors to have a first sight at some site with a printable version.

And maybe another thing, I dont want any problems with text of printable versions beeing any different with the screen versions.

yetanotheruser

1:56 am on Feb 27, 2003 (gmt 0)

Macguru,

good point.. but the fancy version would have more PR, and so would appear first afaik I think... I hope! ;)

Macguru

2:04 am on Feb 27, 2003 (gmt 0)

yetanotheruser,

Your point seems perfectly logical. I would be surprised that G penalises the multitude of sites not caring about duplicate content. AFAIK, it just does not index 'new' duplicate content.

Just instinctively, I tend to block the ugliest version. Wont take risks on my clients domains.

aspdesigner

4:48 am on Feb 27, 2003 (gmt 0)

Another possible problem, if the duplicate page filter decides pages are identical, it will only list one of them in the index and drop the rest. This could result in the ugly "print" version being the only one listed in Google!

I agree, the best thing to do is use to ban the bot.

yetanotheruser

10:51 am on Feb 27, 2003 (gmt 0)

ugly "print" version being the only one listed in Google

I'm banking on the print version having lower PR than it's counterpart and being the one getting dropped, but it is a risk...

I'll build them, and as you suggest - put them in a folder, that way I can easily decide later whether GB gets to read them..

Thanks y'all for the help :)

TheDave

11:15 am on Feb 27, 2003 (gmt 0)

I noticed the other day that my site's print pages were being listed in google as a nested page, off the main listing/products page. I'm expecting it to fall out in the next update because I banned it with robots.txt at least a month ago. Does anyone know if by putting a page in robots.txt google will eventually remove it from the database, or will they just never visit that page again?

aspdesigner

11:32 am on Feb 27, 2003 (gmt 0)

TheDave, check-out this page -

[google.com...]

imran

2:01 pm on Feb 27, 2003 (gmt 0)

I also have this issue. I am using similar software to that used at the popular online encyclopedia, wikipedia.

Wikipedia allows 1 page to redirect to another, but it still leaves you at the URL of the original request.

For example, if you go to [wikipedia.org...] it redirects you to the page for 'open source' (it should be two words). The only difference between the 'opensource' page and the 'open source' page is that one says '(Redirected from OpenSource)'.

Wikipedia is well liked by Google. It seems that it is not penalized for this, and only 1 of these pages is in the cache. Both pages contain the meta information

So, how does Google know which page to drop? Could it be based on pagerank? Less incoming links?