Forum Moderators: open
I'm about to re-build one of our sites which is about 85k pages... 65k or so are genuine unique content, and one of the things I was going to do this build was to add a 'Printer Version' of the main content pages..
Is Google going to [not-mind/ignore/penalize] 65k pairs of similar content (they won't obviously be exactly the same, but the bulk of the text will be)
While I'm at it, how many times would be too many? The apache mod_perl site, httpd.apache.org for example has html/pdf/pod versions of most pages..
Cheers, J. :)
1 If in the printable version you are leaving text menus and other text bits that in itself may produce a 10% difference (rule of thumb) that would negate a problem.
2 Do you really want Google to index the (usually not very viewer attractive or navigable) print versions? Why not just pop them all in a /printversion/ directory and ban bots from it?
[edited by: deejay at 1:20 am (utc) on Feb. 27, 2003]
I've been wanting to put the 'printable page' feature on some of our clients sites for a while now because I've often thought that the version of the page that is printable could carry a list of (search engine optimized) article keywords and nobody would ever notice unless they did a really careful check.
that in itself may produce a 10% difference
I wondered about this.. but our spider ignores link content when it produces the page-snapshots.. Infact printable versions are the reason I'm having to add duplicate content spotting to our algo and they become very easy to spot when you ignore links..
hmpphf,
Sadly I'm moving from a currently printable template to a fancy streatchy one.. (I'm not the designer, I just build the things!) Well done on the w3c score though :)
Macguru, deejay,
I hadn't thought of robots.txt's.. thanks..
dare I try it? (it would increase the latent PR by about 1/2 ;) - but not sure that it'll benefit me.. ) .. I've got about 12hrs to decide before I re-mill.. grrr.. decisions decisions ;)
ugly "print" version being the only one listed in Google
I'm banking on the print version having lower PR than it's counterpart and being the one getting dropped, but it is a risk...
I'll build them, and as you suggest - put them in a folder, that way I can easily decide later whether GB gets to read them..
Thanks y'all for the help :)
[google.com...]
Wikipedia allows 1 page to redirect to another, but it still leaves you at the URL of the original request.
For example, if you go to [wikipedia.org...] it redirects you to the page for 'open source' (it should be two words). The only difference between the 'opensource' page and the 'open source' page is that one says '(Redirected from OpenSource)'.
Wikipedia is well liked by Google. It seems that it is not penalized for this, and only 1 of these pages is in the cache. Both pages contain the meta information
<meta name="robots" content="index,follow">
So, how does Google know which page to drop? Could it be based on pagerank? Less incoming links?