Forum Moderators: phranque
There's a bunch of different technologies involved here (Tomcat, JSP, etc), but I'm posting this to the Apache group because I think that mod_rewrite can be part of the solution. I also need to support this under IIS, and am hoping to use one of the mod_rewrite ports for IIS.
I have a large number of documents that I need to provide access to via web server. There are two primary requirements:
1) I need to log access to these documents. Initially, this is actually just a view count, but may eventually involve authentication/authorization. For a variety of reasons, using the web server logs is not an option.
2) I need to be able to turn the availability of any single document on and off via a database flag.
Now, I should give a little info about these documents. They are compound documents made up of a main .htm file that pulls in JPEGs, Flash files, other HTML files, video and audio. In reality, the one document is actually made up of potentially hundreds of sub-documents. They are arranged as such:
/guid-1
/document.htm
/document_files
document.swf
image1.jpg
image2.jpg
audio.mp3
video.rm
/guid-2
/document.htm
/document_files
document.swf
image1.jpg
image2.jpg
audio.mp3
video.rm
I already provide the ability to download all the files as a single ZIP file. However, I need to be able to provide the ability to view the content directly off the webserver with the previous requirements in place.
I've already built a Java servlet that, given a GUID in a querystring, knows how to redirect to the content (http://localhost/view?GUID=12345). This servlet can check the availability of the content and increment the view count. Where I'm having the difficulty, however, is making sure that users use the servlet URL to access the content rather than the direct URL.
This is where I've been trying to use mod_rewrite. I've set up rules to rewrite the following URL:
[localhost...]
To...
[localhost...]
The last URL does the view counting and availability checking, and then redirects back to the original URL.
So, the first thing I found was that I had set up a loop. To deal with the loop, I had the servlet set a session cookie that indicated that I had been through the servlet. My mod_rewrite ruleset then recognized the existence of the cookie and skipped processing on the second pass. This looked great, but it only worked for the first document because my cookie name wasn't unique for each document. If I attempted to view another document during the same session, the view counting/access control was bypassed.
This solution was the closest to working so far, but the ability to create a cookie name that was unique to the document that I could then read in a RewriteCond directive eluded me. I can certainly create a document specific cookie name in my servlet (DOC12345, for example). Is it possible to dynamically construct the name of the cookie to be checked in the RewriteCond statament based on the original URL (http://localhost/content/12345/document.htm)?
I've tried a variety of other ways, like appending a querystring on the rewritten url so that it doesn't trigger the rule on the next pass through. Again, this worked on first look, but it could be bookmarked and therefore the user could bypass the checking on subsequent views.
I'd thought of trying to use the HTTP_REFERER server variable. The idea was that if the REFERER wasn't .*/view?GUID.* , then redirect. However, because of the nature of these documents and that some of the .HTM within the document actually redirect to other pages, I don't know if the HTTP_REFERER value can be trusted. I've also heard that sometimes the media players (Real, Macromedia, Windows Media) don't preserve referer info.
One other complicating factor (directly related to the compound nature of these docs): not only do I want to force access to the document through the 'view?GUID=' mechanism (meaning no [localhost...] but I also want to prevent direct access to all of the files in the _files subdirectory. They should only be accessed as a result of coming through the top-level .htm file.
I know this is long. Please let me know if I can clarify any points.
I appreciate any help anyone can offer.
Thanks,
Rob
Welcome to WebmasterWorld!
I don't build complex sites or scripts myself, but here are a few suggestions from a "theoretical" standpoint:
You can check the original un-re-written URL by extracting it from server variable %{THE_REQUEST} in mod_rewrite. This will help solve your looping problem without requiring cookies.
Then, put your script between the client and the component files, making your script output the content to the client by building the page completely within the script. Since the script accesses the pages' components using server-internal filesystem requests, you can disallow *all* HTTP access to these component files, while still allowing the script to "read them in" to build the pages.
Jim