Forum Moderators: not2easy

Message Too Old, No Replies

HTML & PDF: Duplicate Content?

is having the same content in different formats okay?

         

ccDan

5:54 am on Mar 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



On my revised "under re-construction" web site, I was planning on making some of my content available as PDFs (one of these days I will have to stop thinking of new features to add so I can finish this site already...).

Then, I got to thinking, why not make all of my content available as a PDF? In other words, they would read the article in HTML, and then if they wanted to save the article or print it out, they could download the PDF version.

Since Google (and I presume others) can parse PDF files, will having the same content as both HTML and PDF be considered duplicate content, and thus lower my rankings?

And, if so, I presume I can just add an entry to my robots.txt file to prevent Google from parsing PDFs?

aus_dave

10:27 am on Mar 31, 2004 (gmt 0)

10+ Year Member



I did this on a site a while ago (duplicate HTML/PDF files). As far as I could tell Google indexed all the PDFs ok and there were no penalties. They didn't rank too well though, as the corresponding HTML pages had SEO working in their favour.

Eventually I gave up doing this as it was more work and the PDFs have a 'finality' about them that I don't like. It is easy to edit some HTML when an article needs a change but it takes more time to create a new PDF.

ccDan

6:21 am on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks. It's not important to me if or how Google ranks the PDFs, just so long as it does not have an effect on my HTML page ranking.

As for the PDF, it's not much more work than HTML to make a change in it. I count just two extra steps to make a PDF Versus updating HTML.

engine

5:05 pm on Apr 1, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Publishing articles as PDFs can bring all the benefits to you of control. Look primarily upon PDFs to provide you with better document control and compatibility, rather than as a duplicate of information.

The PDF files should have little impact upon your rankings and I've got no conclusive proof that it would be a negative step. All my tests have proved it to be a positive step to generate articles in PDF files.

Indeed, you can block indexing easily with robots.txt, and if it duplicate content, I would consider it to be a good idea in this instance.

sovidiu

11:59 am on Apr 18, 2004 (gmt 0)

10+ Year Member



From what I have seen so far, using the same textual content on a HTML and a PDF documents does not affect the indexing process and it's not considered as duplicate content.

mumbledawg

12:02 am on Apr 20, 2004 (gmt 0)

10+ Year Member



Never occured to me that it would be considered duplicate content. I put up pdf files as a printable option. The html files are too wide to fit on a standard printed page.

fom2001uk

8:36 am on Apr 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What about Word files?

sem4u

8:46 am on Apr 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wouldn't worry about the duplicate content issue too much with Word and PDF files. If you want to exclude them from the search engines then just use robots.txt

piskie

9:56 am on Apr 21, 2004 (gmt 0)

10+ Year Member



I put the entire site content into a PDF for download. The PDF had 125 0f 126 pages, only the Home Page was left out.

That was over 2 years ago and the positions of all the html pages was not affected one little bit.

sovidiu

10:02 am on Apr 21, 2004 (gmt 0)

10+ Year Member



fom2001uk, please restrain from usign MSWord documents, since viewing them in a proper manner requires users to have MSOffice installed, with paid license. You can always add a WordPad document (default Windows text document viewer, beside Notepad). MSWord documents are not that indicated.

sem4u

10:09 am on Apr 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Microsoft's Word Viewer software is free...

sovidiu

10:21 am on Apr 21, 2004 (gmt 0)

10+ Year Member



MSWord documents are opened by default with WordPad, if MSOffice is not installed, causing some minor display problems.

kevinpate

11:23 am on Apr 21, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



we have a passle of pages which provide download links to registration forms or toretain page content. We usually go beyond offering PDF and include options for DOC and TXT as well.

sovidiu

11:41 am on Apr 21, 2004 (gmt 0)

10+ Year Member



That is the right solution, since the visitors can easily print your web site's content for non-Internet related activities. And Google does not see this type of documnent arrangement as duplicate content.

mgream

1:09 pm on Apr 21, 2004 (gmt 0)

10+ Year Member



I have not been penalised for having a number of equivalent documents in powerpoint, word, pdf and html formats.

If anything, the only issue is that if you allow the SE (google) to index them all, you may find that (for example) the PPT version is shown in the SE, but the PDF version is obscured unless you expand the results. Visitors may click through to the wrong context (e.g. they get a PDF file with no surrounding navigation, whereas you prefer to have them enter HTML pages). The end result is that your site has inconsistently represented in the SE. That's may not be what you want.

I have since deprecated all of my content other than PDF and HTML. I did have the problem just mentioned (PPT v PDF).

Maybe it would be a better tactic to allow the SE to only pick up one version, the others are noindex. In a better world, your site would be content neutral: only one format exported to SE's, but by selecting per-page conversion (equivalent to "printable version") your server's engine would real time (or cache ...) translate and serve up the alternate format. Commercial services do this (Westlaw for one: it can serve up text, rtf, word, pdf, on demand, although argueable the content is not media rich and largely plaintext anyway).

sovidiu

1:34 pm on Apr 21, 2004 (gmt 0)

10+ Year Member



Google also offers a PDF-to-HTML conversion, since PDF documents are usually presented along with images and other pertinent information anout your company. So there should be no problem about adding the same content in PDF and HTML format.