Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

IIS challenge - Is it ok to use 302's to deliver PDF etc. documents?

         

waveform

12:52 pm on Jun 18, 2009 (gmt 0)

10+ Year Member



Sorry for the long preamble, but this is a bit tricky..

I'm working with a client's site which is written in .NET and uses a 404 script to deliver dynamic PDF documents. They wanted to show "search-friendly" URLs on the site, hence the 404 technique to capture the non-existent requests from browsers.

The problem is, Google is slurping up masses of bandwidth, always downloading these PDF files each time it crawls the site. I found the problem was that IIS is not passing the "if-modified-since" header to the code. Somehow it goes missing, perhaps due to the 404 process sitting in the middle. Very odd.

My proposed solution is to: a) detect when a bot is crawling the site (via user-agent), then b) redirect the bot to another script (via 302) which, not being hampered now by the 404 process, can grab the if-modified-since header from the new request and behave accordingly.

Now my question: If I set up these permanent 302's (perhaps 307 would be more appropriate?) for all requests to PDF documents on the site, will that make Google or other engines look down upon the site with digital disdain and derision? Will it affect status or ranking?

Note this will only be the case for *files* (PDF, DOC, GIF, JPG, etc) but *not the page itself*, which will behave normally.

Sorry again for the long post. Any advice much appreciated!

tedster

6:46 pm on Jun 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My understanding is that a 301 redirect would be more appropriate, not a 302. However serving one response to googlebot and a different response to other visitors can create ranking problems and even penalties.

The 404 approach to redirects on IIS is a kind of hack - it's not the techncal intention of a 404 status at all. I'd suggest investing in the third party plug-in called ISAPI Rewrite and using it to generate all redirects.

waveform

7:32 pm on Jun 18, 2009 (gmt 0)

10+ Year Member



Hi tedster, thank. A 301 makes Google replace the old url with the new one, which isn't what I want it to do. I gather that providing a 302 or 307 makes google retain the original link in its records.

"However serving one response to googlebot and a different response to other visitors can create ranking problems and even penalties."

How would google know I'm doing that? This isn't meant to deceive the search engine or anything, it's merely to get around a technical limitation in IIS.

tedster

7:55 pm on Jun 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google gathers information in many ways - I'm sure they would discover that you are doing this. Whether they would have a problem with it or not is an open question (it is a form of user agent cloaking, after all.) I appreciate that your intention is not manipulative - and in the best case scenario so would they.

My sense is that there is a more standard way around the technical limitation in IIS, however I'm not sure what your entire challenge is here. Is it not possible to create cached copies of the PDF files and serve them from a stable address?

wilderness

8:01 pm on Jun 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



However serving one response to googlebot and a different response to other visitors can create ranking problems and even penalties.

How would google know I'm doing that? This isn't meant to deceive the search engine or anything, it's merely to get around a technical limitation in IIS.

One possible solution is to place the PDF's in a directory that is disallowed in robots.txt (or at least make the path appear from same). Not sure of the ranking/penalties issues, however the action certainly places a precedence.

Be forewarned! I've placed PDF's in a disallowed (i. e. robots.txt) directory for nearly ten years and it doesn't stop Google and other major SE's from grabbing the files. Whether the files are made available is another issue.

waveform

8:18 pm on Jun 18, 2009 (gmt 0)

10+ Year Member



"I'm not sure what your entire challenge is here."

It's a bit complicated, outlined briefly in the OP. The site is using a 404 error script to deliver files. Unfortunately, when a 404 is in place, IIS (for some doubtlessly "by-design" reason) strips two important items from the client headers: "if-modified-since" and "if-none-match".

The upshot is I can't tell whether to respond to a crawler with a full document or a 304 Not Modified. So my idea is to redirect the bot to a normal .aspx script, which (because it will be requested directly, not via a 404) will be able to see those headers and respond accordingly.

If you have any other ideas, I'm all ears. That's the only one I can think of, aside from rewriting their entire site. :)

dstiles

9:56 pm on Jun 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a dual-use 404 system in use.

If a known browser comes in to a missing (removed) page it gets a 301 redirect to the home page. If it's an SE it gets a 404 to tell it the truth: no such page. Eg: google today came in looking for default.html - NEVER had such a page EVER so it got a 404). A customer would have got the home page.

Could you implement this kind of thing for your PDFs?

(Reason for my dual-action: some ISPs deploy 404 catchers to send their customers adverts instead of a proper 404 response, even if it's a customized response such as "Sorry, try clicking here for widgets." I object to MY customers being taken away from MY site and enriching someone else. With a proper 404 the customer can try one or more returned links or modify the URL and try again.)

tedster

11:14 pm on Jun 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Serving a true 404 in the http header for a url that does not exist is really important for Google. They test this functionality all the time, ESPECIALLY on IIS websites. Some sites have been taken from the index until they fixed this issue.

waveform

3:58 am on Jun 19, 2009 (gmt 0)

10+ Year Member



Thanks for your help and ideas. It does send proper 404's for requests where a page doesn't exist, no probs there.

dstiles, I see what you're saying, however these links are valid ones. The redirect would only be for crawlers, just so I can hold of those headers in code, but I want the search engines to retain the *original* link in their systems, not discard it for the new one (which is what 301's imply).

I had another idea.. I could make a subtle but significant change to their url structure. Instead of a url like "www.site.com/docpath/friendly-doc-name.pdf" which triggers the 404 script, change it to "www.site.com/?/docpath/friendly-doc-name.pdf"

Putting the "?" in there changes all their paths to query strings passed to default.aspx for handling. This is a workable solution on all levels, but the question is - will having a "friendly querystring" instead of a "friendly URL" mess up their SEO?

That is, goes Google etc treat query strings differently to URLs? Are "flat" sites like this (where everything comes from the one script) bad for SEO?