|Is it valid HTTP-wise to redirect a PDF to an HTML file?|
Hi guys, I'm looking for ideas from more experienced comrades on how to handle this situation:
I have a blog where I have a number of PDF files I've created as parts of tutorials on particular products. By parts I mean that the PDFs are not particularly useful if taken as a standalone source of information. Besides, the pages of the site that the PDFs are linked from, contain not only the rest of the (important) information but also some ads that I derive a tiny little bit of revenue from - nothing much but covers the expense of my doing the research on the product - something that I would like to continue. The PDFs contain mostly pictures and diagrams that would not be easy to use unless they are printed out, hence the PDF format.
Anyhow, what happens is: Google and that one other SE still standing are very happy to link to PDFs directly from their SERP pages and I see most of the usage of those PDFs as drive-by downloads. Meaning, a person just clicks on the link on Google, wastes anywhere between 500kB and 4Mb (depending on the PDF) of my bandwidth, reads the PDF out of context and probably understands nothing or very little, gets disappointed and never opens another URL on my site. This apparently is not working for anyone: I cannot imagine Google liking this much either and that probably even reflects badly on my site as a whole because user experience metrics should suffer from this use.
What I would like to do (if it's a valid HTTP protocol and will be understood by browsers) is a redirect of any request for a PDF that does not have my site as a referrer to a matching page of the blog that the PDF is linked to. Since the PDFs were uploaded as "media" in WP, I think I have enough information in WP database to take the requested PDF URL and translate it to the post URL where it's been used. I haven't looked at the DB yet but I'm not very concerned with the internal programming, I'll do it one way or another, more with how this will be handled by different browsers out there and of course, how Google will look at it (wouldn't want to be whacked for cloaking if they can take it that way)
So, is this something that's possible to do, and does anyone already use it perhaps? Any tips or things to watch out for anyone can offer?
You would also be redirecting Googlebot if you do this, so your PDF would no longer rank.
If you opted to still serve Google the PSDs, it's definitely cloaking, and Google run automated tests to check for this (by pretending to be a browser with a Google search referrer). What you're proposing is to get Googlebot to rank content and then send visitors to somewhere you'd prefer them to be when they try to access the content that was ranked- the very essence of cloaking. I would proceed with extreme caution.
There's no tech issue, since file extensions are irrelevant. If you serve a .pdf as text/html it will open in a browser just like anything else.
Thanks, Andy. No, I would also redirect Google. I have no desire to have the PDF files rank by themselves - only the pages they are linked from. In fact, I suspect that having indexable PDFs is now hindering the ranks of the HTML page it's linked from - the KWs the PDFs are ranked for are almost invariably the title of the PDF which was used to link to them. In other words, since the text with that title is actually on the HTML page but within a link to elsewhere, Google must be "thinking" that elswhere is better than right here.
Anyhow, the issue of ranking aside, like I said, I don't really want them to rank and if I only returned X-Robots-Tag: noindex , they could still be downloaded in a drive-by fashion. I want to eliminate the whole notion of using these PDF files as standalone sources of information, in other words there should be only one way to get them - visit the HTML page and download by clicking the link on it.
I was actually thinking specifically of the technical issue with it that you eluded to: the PDFs are actually served as application/pdf and not text/html . In fact, I think I should see if I've Apache setup with "ForceType application/pdf" because their main use was for printing and I though it wouldn't make much sense to have them open in browser.
Perhaps my caffeine is taking way longer than usual to kick in today, but I am thinking a browser would "expect" a binary file - how will it handle a text/html being served instead? Actually, no, it would get a 301 redirect HTTP header served and will probably not follow it. Just like you cannot really redirect an image to a page - all the browser will do is show an error page within the square defined for the picture (if the size is known).
Anyway, like I said, I'm in the beginning of the research on the subject, and I'm probably not making much sense at this point, so I would appreciate if you point out any holes in my logic.
Oh, and on the subject of converting from one way of handling no-referrer PDF requests to another, I should probably take that to the Google forum - I do want to proceed with extreme caution and would like to avoid bringing down the house in the process.
Thank you for your input!
I think you're confusing the desktop with the web somewhat. On a PC, there are "file associations" whereby particular applications open particular file types (which are known by their extensions). There are no file extensions or file associations on the web. The browser determines what to with a particular response by the content-type header, so if you want to name an HTML file .pdf, you can, as long as you serve it as text/html. Whether browsers download or display PDFs depends primarily on what PDF software is installed on an end-user's machine.
|if I only returned X-Robots-Tag: noindex , they could still be downloaded in a drive-by fashion |
They're not mutually exclusive approaches, though. First step should be to slap the no-index header on all your pdfs and pull them out of g###s index. This by itself will cut way back on people going straight for the pdfs, because now you've only got the people who already know about them.
I think it's pretty closely analogous to hotlink protection. But since the pdfs are linked from the pages rather than displayed inline, you can set up alternative accesses for human users whose browsers don't send a referer. And meanwhile you can redirect referer-less requests to a page that says "I'm sorry but..."
Thanks, lucy24. Yes, I'll be starting sending X-Robots-Tag: noindex right away to make sure they are deindexed before I start redirecting them, just to avoid any possible cloaking issues.
As far as similarity to image hotlinking protection, I think it's *exactly* like hotlinking because many (most?) people have PDF plugins in their browsers that open them inside browsers and also inside frames. I haven't seen it done to my PDFs (and perhaps I just didn't look hard enough) but there are sites that do exactly that - they rank for the KWs found in the titles of your PDF but when you open their site, there would be a frameset where one frame is *their* ads and another is *your* PDF. Your wasted bandwidth, their ads revenue.
So, yes, I think this PDF redirection thing is pretty clearly needed. After that my next step would be to tackle image hotlinking - another rampant bandwidth waste which happens even more often though wastes less bandwidth (at least in my case, simply because images are smaller in size than PDFs)
Thanks for your input!
I have a few pdf files on my site - if people aren't on my site (external requests for .pdf) I have an "answer this question" form that must be passed before files can be read, downloaded etc.
I was redirecting using the no hot link rule [F], but I like this better.
Aside from the scrapers riding up the serp on the backs of other people's content I had literally thousands of 404's from malformed external links.
Still have them since Google still crawls all the pdf scraper sites. Also use no index but I always check several (other) pages at redbot dot org after using these rules to make sure there is no weird inheriting going on.