Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

To Turn PDF Content Into Web Pages?

         

kurzo

8:26 pm on Sep 14, 2015 (gmt 0)

10+ Year Member



I have a 27 page PDF that is full of great content, would it make sense to go through and pull the content apart and build out 10 new pages of content that are tightly themed by keyword/topic? And then remove the PDF from being indexed?

Robert Charlton

9:52 pm on Sep 14, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Yes, I think it can make a lot of sense to turn pdfs into web pages.

A well-structured website, with page titles, html text, internal linking, and more opportunities for inbound links, offers much opportunity than a pdf does for your content to be seen, to attract links, and to rank. In my experience, on straightforward website searches, you could well see dramatic increases in traffic and ranking.

I'm not sure you should arbitrarily break the article up into 10 sections, though. Depending on the length, topic, and structure of your pdf, you might do better with longer but fewer sections.

Also, I truly don't know right now whether it's necessary to remove the pdf from the index... as it's possible that Google might rank both the pdf and the html pages.

Many scholarly fields rely heavily on pdf content, and that is something for you to consider. You should check to see whether you do get traffic from visitors searching specifically for pdfs. For certain kinds of content, searchers might assume that the material will more likely be found in pdfs and will search by file-type.

tangor

10:36 pm on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Great points by Robert. I have, on some cases, gone the other direction for different reasons, ie. web pages to PDF for the simple reason that html of any kind is "too loose" for some text layout necessities ... and some of my pdfs are actually sales items.

Since the major search engines can index pdf content, you lose very little in "keyword" and general traffic.... just a marker PDF by the entry in serps.

Additionally, there are a few sites were I keep both the web version and the pdf version available as some folks prefer one over the other and have never seen a "duplicate content" hit because, while the words and images in either might be the same, they are NOT the same in layout and intent and the users treat them differently.

martinibuster

12:00 am on Sep 15, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I do not encourage tightly theming a website by keyword. Organizing web pages by topic is less likely to resemble term spamming.


Mod's note: martinibuster has started a separate thread on the topic of theming a website by keyword, and you can find that discussion here....
[webmasterworld.com...]

[edited by: Robert_Charlton at 4:23 am (utc) on Sep 15, 2015]
[edit reason] added mod's note and link [/edit]

tangor

4:01 am on Sep 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



True. Most of us (me included) have been doing that for years.

To web page or PDF is not as great a problem now as it was back in the early days: Search Engines worth a salt (and we want) all read PDF just as easily as html.

The OP mentioned a 27 page pdf he now has.... if his then already has the source material which makes conversion to web pages relatively simple. If not, then we're talking something else and might want to take further discussion to the copyright forum. [webmasterworld.com...]

aakk9999

4:10 am on Sep 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Search Engines worth a salt (and we want) all read PDF just as easily as html.

It is not probem on SE understanding PDF. The angle I am looking is call to action to a different parts of your web. When I land to PDF, there is nowhere else to go. If I land on a page, I can scan the main menu, the secondary menu, there could be other calls to action and steering the visitor to convert is so much easier from the web page than from PDF.

Robert Charlton

4:16 am on Sep 15, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



tangor, thanks. I'm glad to hear that you've had no dupe content issues with PDFs. I was suspecting there might not be.

While search engines do in general read PDFs, pdf files are at such a ranking disadvantage that I've found it best to search by file-type, and I suspect that many regular readers of pdfs (which are extremely common in technical areas) will do that as well. It's good that the OP can keep both html and pdfs available for search. I tend to be super cautious, so I would probably test one file... and wait a while... before doing all of them.

Robert Charlton

4:40 am on Sep 15, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



PS: aakk9999... I was posting in a different tab and didn't see your post. Yes, I like the way that's put, about calls to action and ease of conversion.

(Many pdfs do link back to the parent site, btw, and we've recently discussed them in a spam context, but I don't have that thread reference handy. I don't think such self-links would be significantly helpful to boost the parent site, though... and I'd caution against using pdf links in place on a site with that in mind.)

With regard to martinibuster's post about theming by keywords, it might or might not be a necessary caution, depending on the OP's mindset about the term "keyword".

I've seen ongoing examples of attempted theming by actual keyword, so the caution is perhaps important with regard to this question. It also depends, of course, on what martinibuster means by keyword.

tangor

5:30 am on Sep 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And my apologies to all for using that term (keyword) in my original response.... I was only hoping to let rank and file know that PDFs, too, can be searched and ranked (though not with the same precision) as an ordinary web page.

Also, to echo Robert, pdf linking is generally krap. The pdf itself takes the user outside the site, so to speak, though a back button will return them to the site. We, as webmasters, however, may not rely on that (to us) ordinary action), and that also references aakk9999's commentary.

Just letting folks know that my experience, across 300+ sites, about half of which ROUTINELY have PDFs for specific layout (or sales) reasons have seen no diminished rank for having a PDF, or having both html and PDF of the same content on the same site. These files (clearly marked) attract different kinds of users, and in some cases, provide better readability.

Having said the above, the vast majority of webmasters have little to no use for PDF as a normal RWD site is pretty good at what is necessary to display "stuff". If, however, you have technical or express layout considerations, PDF will work, Fixed In Stone, so to speak, with other attributes such as encryption, passwords and more. PDF is a different beast, just one of the tools of the enterprising webmaster.

And there's little wonder that PDF is among the ebook formats as it is versatile in many directions.

But the average webmaster really doesn't need it.

And back to OP once again: Unless there is a pressing need to put ads three up page by page of the converted PDF, just put it in one file. Your users will love you. (Some of my "pages" are in the 100,000 to 300,000 word length, but those they pay for!)

Kratos

9:45 am on Sep 15, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



My main issue with leaving the PDF and creating an HTML page out of it (i.e. duplicated content) is that, as far as I recall, you cannot put a canonical tag on a PDF (the same way that you cannot nofollow links in PDFs). This could create a problem with regards to what Google ranks, especially if the PDF document has external backlinks pointing to it or strong internal links pointing to it.

I quickly browsed the following link as I'm just skimming the forum for the September thread but this may be a good read with regards to PDFs and canonicals (apparently YES you can add the canonical element to a PDF)
[googlewebmastercentral.blogspot.com...]

Anyone have experience using "canonical" on PDFs? (just to add to the thread, not meaning to derail it)

pageoneresults

12:54 pm on Sep 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...as far as I recall, you cannot put a canonical tag on a PDF.

Ah, but you can...

Specify a Canonical Link in Your HTTP Header
[support.google.com...]

If you can configure your server, you can use rel="canonical" HTTP headers to indicate the canonical URL for HTML documents and other files such as PDFs. Say your site makes the same PDF available via different URLs (for example, for tracking purposes).

You can also do some nifty things with the Advanced Properties of the PDF...

Acrobat Help / PDF Properties and Metadata
[helpx.adobe.com...]

In the PDF settings for Acrobat, you can set a base Uniform Resource Locator (URL) for web links in the document. Specifying a base URL makes it easy for you to manage web links to other websites. If the URL to the other site changes, you can simply edit the base URL and not have to edit each individual web link that refers to that site. The base URL is not used if a link contains a complete URL address.

I have a 27 page PDF that is full of great content, would it make sense to go through and pull the content apart and build out 10 new pages of content that are tightly themed by keyword/topic?

Yes, it would make sense. I wouldn't get to caught up in the whole "keyword" thing but the topic relevancy is a plus.

Also keep in mind that when HTML pages have supporting documents such as PDFs, I think it adds a little more umph to the overall performance of the page. A well structured page may include additional supporting documents in various formats with PDF being one of them.

martinibuster

1:23 pm on Sep 15, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Nice post P1R! :)

pageoneresults

1:51 pm on Sep 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks martinibuster! You know, I just want to properly case your username: MartiniBuster

Regarding PDFs and the optimization thereof. I wrote a very lengthy article years ago about how to optimize PDFs. Much of the content I've used for developing initial websites came from PDFs such as client brochures, line cards, mailers, etc. While dissecting all of those PDFs, I found myself stuck to the Adobe website reading and learning everything about PDF structure. ZING! Google LOVES optimized PDFs. They love anything that is optimized properly. With PDFs, you have a host of options available to you through the Properties Dialog. The really advanced options are only available in the full version of Acrobat.

Custom (Acrobat only): Lets you add document properties to your document.

Certain content justifies supporting documents. Technical sites are loaded with PDF documents. It's unfortunate that MANY of them have failed to optimize them properly. Whenever I see a file name as the title of a PDF for a search result, I know the document is NOT optimized. It's the first you do, set a title, description and a few primary keywords for the document.

Edit Document Metadata
[helpx.adobe.com...]

Click Advanced to display all the metadata embedded in the document. (Metadata is displayed by schema—that is, in predefined groups of related information.) Display or hide the information in schemas by schema name. If a schema doesn’t have a recognized name, it is listed as Unknown. The XML name space is contained in parentheses after the schema name.

kurzo

8:36 pm on Sep 15, 2015 (gmt 0)

10+ Year Member



Sorry - just getting to this now. Thanks to all for the responses and advice.

I was using "10 pages" as an arbitrary number, I have not looked too closely at it to determine how to extract and organize the content yet.

Just wanted to get a consensus, glad to hear it is positive.

SEMachine

12:27 pm on Sep 16, 2015 (gmt 0)

10+ Year Member



Don't mean to hijack the thread but I have a related question that seemed to piggyback: If we have a very tight keyword strategy aligning pages to specific keywords, what do you do if you have a landing page for a PDF download. They would both be very optimized (naturally) for the given keyword so that would lead to cannibalization. Noindex one? Optimize one page for a different iteration/slight variation of the keyword?

aakk9999

12:51 pm on Sep 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@SEMachine,
I presume both are showing in SERPS, the PDF download page as well as the actual PDF?
If so, does the one rank better than the other? Where the visitors land more, on the download page or on the PDF itself?

SEMachine

12:59 pm on Sep 17, 2015 (gmt 0)

10+ Year Member



Both indexed and both are found. But as a growing organization it's essential to have a protocol in place as we can't try to attack every content piece individually based on whether the landing page or the actual content piece is ranking better after a given amount of time. I see three main options: 1) keep them both indexed and ignore best practices around cannibalization, 2) noindex the landing pages, 3) keep the landing page indexed but add a noindex to the content pieces.

aakk9999

1:08 pm on Sep 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I know you cannot analyse each page separately, I thought you may have a "big picture" (cummulative picture) on whether the download page or PDF itself is where visitors land more.

Regarding canibalisation, I would think in terms of your conversion. If both pages drive traffic and both pages convert, you are not really canibalising. The angle you should explore is whether, by removing one of the two, you would make the remaining one stronger that would rank better and therefore bring more visitors that would hopefully convert. For example, whether returning the canonical info in PDF header that would point to Download page would make Download page rank better and atract more visitors than two separate page combined.

What I would do is test the water with a few PDFs and their download pages - returning canonical pointing to Download page and monitor what happens, and then I would make a decision based on this.