Forum Moderators: phranque
I'm going to try to explain the situation I have and what I want to do as best as possible. Hopefully some generous person can help point me in the right direction.
I run a site that hosts several gigs of shareware and freeware software. The site has two servers - one is the main one with HTML pages and descriptions - forums.example.org - and the other server simply houses all the .zip, .hqx and .sit files - archive.example.org.
Looking through my logs, I'm getting about 2.5 GB of downloads per day at archive.info-mac.org, but only 10% of that is being refered from my other domain forums.example.org, which houses the description pages. This means that a huge majority of it is being downloaded from links on external sites. Right now, if you go to one of those sites and click a link to download a file that is hosted on my server, it simply downloads as if it were on their own server and you'd never know it was actually hosted on mine.
So what I want to do is, any request that comes from a site that is NOT one of my two servers, I want to redirect to a page that says something like "Thank you for downloading from Info-Mac!" to acknowledge where they're getting their files from, and THEN the download should begin. Similar to how SourceForge works. I don't want to break the link, just have a little recognition in the least-annoying way possible.
So I need to redirect requests to .hqx, .sit and .zip files that come from anywhere but my own two servers to this page, but then also be able to detect what the originally requested file was, so an automatic download will start.
How should I go about doing this?
Thanks in advance!
[edited by: tedster at 6:52 pm (utc) on Oct. 8, 2008]
[edit reason] switch to example.com - no member domain names please [/edit]
Sniff the referrer: If it is not blank and it is not your site.
Sniff the requested file extension: If it one of "those".
If both are true, then process the redirect.
A search in this forum for "hotlink blocking" will find a number of examples that simply block access to the files. You'll need some of that code, followed by the modified redirect code.
It won't block all of them (browsers that do not send a referrer will still get through) but should cut it substantially.
Lets say someone clicks a link to [archive[...]
I want to redirect users to a page on [www[...] which is on a different server.
Upon loading thanks.html, I want it to know what the path to the originally requested file was (http://[b]archive.example.com/file.zip) so that it can automatically download it after like 5 seconds, and/or provide a link to it, or show what it is... etc.
Is this possible? I think I might need a combination of things, including .htaccess redirects.
Thus far, at archive.example.com's .htaccess file I have:
<FilesMatch "\.(zip¦sit¦hqx¦dmg¦iso)$">
redirectMatch .*\.zip$ http://www.example.com/thanks.html
redirectMatch .*\.sit$ http://www.example.com/thanks.html
redirectMatch .*\.hqx$ http://www.example.com/thanks.html
redirectMatch .*\.dmg$ http://www.example.com/thanks.html
redirectMatch .*\.iso$ http://www.example.com/thanks.html
</FilesMatch>
The server successfully redirects to thanks.html, but using a simple JavaScript to display the referer, it believes its only the original site that the person clicked from, NOT the archive.example.com domain that the file is stored on.
What should I do from here?
I'm so lost! :(
For multiple matches, you can simplify repeating rules by using a
¦ pipe symbol for "OR", like: (zip¦sit¦hqx¦dmg¦iso) this. Use RewriteCond and RewriteRule in the .htaccess to sniff the initial referrer value and then only redirect based on the rules I stated above ("referrer is not blank and is not this site" and "requested file is one of those to be protected").
You could then append the requested URL as a parameter to the end of the new target URL (the thanks page), so that the initial referrer value is available for processing at the other site after you have been redirected to it.
So, I am at some other site and I click on a link that points directly to example.example.com/file.zip on your site.
The .htaccess on your site, detects that the referrer is from the wrong site, and issues a redirect to http://www.example.com/thanks.html?refer=example.example.com/file.zip
The thanks.html page should have a meta robots noindex tag on it, and the script within it should detect the parameter that was passed in the URL and use it to populate the new destination URL in the new outgoing link.
The script should also detect that the outgoing URL is for your domain, otherwise you will find other people using your redirect to bounce people off to other spammy sites and virus-laden files.
If all is OK, clicking on this link will deliver the expected file just fine, because the referrer passed with it will now be for a URL that is within your site.
.
The other way to do this, is with cookies. Any page of your site loads a cookie on to the visitors machine (and each page view updates it), and that cookie is set to be valid only for a short time (at most a few hours, maybe one day). To get the ZIP file, you must have a valid cookie, otherwise you are redirected to the thanks page.
So now, my revised strategy is, on files.example.com I am to check the referer... if it is not one of my sites, and not blank, then redirect to www.example.com/thanks.php?variable=files.example.com/therequestedfile.zip and thanks.php will automatically work with the variable.
However, I'm still not sure I'm getting how to do a few things. First, how do I redirect only if the referer is one of my other URLs? Is there some kind of if...then statement in htaccess? Also, how do I check just the domain part? Like, ANYTHING that comes from http://www.example.com?
Secondly, how do I get htaccess to redirect to another URL + ?variable= + the path of the requested file? How do I pass all that as the new redirect location?
Thanks for all your help so far
You'll need one entry that looks at > . < - that's a single dot and means "not blank".
Then use > ! < for "not" and have one line for each of the "allowed" sites (if the referrer is "not" one of these... do the redirect). Don't forget to cater for www and non-www and any other subdomains too.
There are loads of "prevent hotlinking" examples in this forum. You'll need all but the last line of those examples, and you'll replace the last line (which usually completely blocks access) with your redirect to the "thanks" page.
.
In your script (on the thanks page) , you'll need to pull the {QUERY_STRING} data off the latest request and use that to populate the link on the page. How you do that depends on whether you are using PHP, or JavaScript, or something else.
The "thanks" page *must* have a meta robots noindex tag on it, or better yet, be blocked by robots.txt. You will be in a world of woe if the "thanks" page is indexed by search engines under hundreds of different query string variations of URL.
Take a look at our Forum Charter for links to useful resources about mod_rewrite and related subjects.
Jim
Options +FollowSymlinks
RewriteEngine On
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?info-mac\.org/ [NC]
RewriteRule .*\.(zip¦hqx¦sit¦dmg¦iso¦bin)$ http://www.example.com/referer_test.html?a=test [NC]
Everything works except one thing that I haven't implemented. At the very end of the last line, where it says ?a=test, how do I replace "test" with the path of the originally requested file? Is it possible to have it just the path, and not the full http://? This way, on my thanks.php page I can add http://example.com before it and prevent people from using my page to link to nasty stuff.
[edited by: jdMorgan at 12:46 am (utc) on Oct. 11, 2008]
[edit reason] example.com [/edit]
$1 and then re-use the information on the right. Using
.* isn't very efficient. You might try ^([^.]+\.(zip¦hqx¦sit¦dmg¦iso¦bin))$ which says get "everything that isn't a dot, followed by a dot followed by the file extension". It might be better to break on / rather than dot, but anything is likely better than .* which is inefficient). .
To finish, simply drop
$1 on the end of the target URL ...=$1 for the collected information to be appended on the end. This stuff is very powerful and is completely unforgiving of typos and errors in thought process. That is why you have to define *exactly* what you want it to do, before you start any coding (or develop on a test server where errors are not going to get an entire site de-listed from search engines). I hope I haven't made any errors, but jd will likely look in on this thread and offer a more efficient solution if it exists.
[edited by: g1smd at 8:50 pm (utc) on Oct. 10, 2008]
EDIT: Many of my files are located in subdirectories, and subdirectories within subdirectories and so on. Basically, I'd fancy the ability to break after the http://example.com/ and catch the subdirectories and the file name only. Will this still work with the above example?
EDIT#2: I have, indeed, been doing all this on a test server first.
[edited by: jdMorgan at 12:47 am (utc) on Oct. 11, 2008]
[edit reason] example.com [/edit]
By default, both the host name and the query string are not present. They can be found in %{HTTP_HOST} and %{QUERY_STRING} etc, tested separately.
The path information that is present is also localised such that if the .htaccess is in a folder then information about folders closer to the root is not included.
Glad you got it working.
You may find the PHP and/or JavaScript forums as WebmasterWorld to be helpful in getting your "thanks" script to do what you want it to, but it does sound like you are already well on the way.
One thing to do as you go along, is to ask yourself "how could someone abuse this?" and take steps to prevent that. I mentioned the usage of robots disallow.
What would happen if someone linked to your "thanks" page? What would happen if they crafted a "nasty" value for the filename? There's more data checking to be done in the script to reject malicious requests, and to ensure only valid archives are requested.
I think this should have it all taken care of, yes?