Forum Moderators: phranque

Message Too Old, No Replies

Complex results desired - with .htaccess?

Redirecting people from different servers to different places.

         

dpaanlka

4:24 pm on Oct 8, 2008 (gmt 0)

10+ Year Member



Hello!

I'm going to try to explain the situation I have and what I want to do as best as possible. Hopefully some generous person can help point me in the right direction.

I run a site that hosts several gigs of shareware and freeware software. The site has two servers - one is the main one with HTML pages and descriptions - forums.example.org - and the other server simply houses all the .zip, .hqx and .sit files - archive.example.org.

Looking through my logs, I'm getting about 2.5 GB of downloads per day at archive.info-mac.org, but only 10% of that is being refered from my other domain forums.example.org, which houses the description pages. This means that a huge majority of it is being downloaded from links on external sites. Right now, if you go to one of those sites and click a link to download a file that is hosted on my server, it simply downloads as if it were on their own server and you'd never know it was actually hosted on mine.

So what I want to do is, any request that comes from a site that is NOT one of my two servers, I want to redirect to a page that says something like "Thank you for downloading from Info-Mac!" to acknowledge where they're getting their files from, and THEN the download should begin. Similar to how SourceForge works. I don't want to break the link, just have a little recognition in the least-annoying way possible.

So I need to redirect requests to .hqx, .sit and .zip files that come from anywhere but my own two servers to this page, but then also be able to detect what the originally requested file was, so an automatic download will start.

How should I go about doing this?

Thanks in advance!

[edited by: tedster at 6:52 pm (utc) on Oct. 8, 2008]
[edit reason] switch to example.com - no member domain names please [/edit]

g1smd

6:28 pm on Oct 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You'll need just a few lines of code in your .htaccess file to make this work.

Sniff the referrer: If it is not blank and it is not your site.

Sniff the requested file extension: If it one of "those".

If both are true, then process the redirect.

A search in this forum for "hotlink blocking" will find a number of examples that simply block access to the files. You'll need some of that code, followed by the modified redirect code.

It won't block all of them (browsers that do not send a referrer will still get through) but should cut it substantially.

dpaanlka

8:39 am on Oct 10, 2008 (gmt 0)

10+ Year Member



Ok let me simplify this...

Lets say someone clicks a link to [archive[...]

I want to redirect users to a page on [www[...] which is on a different server.

Upon loading thanks.html, I want it to know what the path to the originally requested file was (http://[b]archive.example.com/file.zip) so that it can automatically download it after like 5 seconds, and/or provide a link to it, or show what it is... etc.

Is this possible? I think I might need a combination of things, including .htaccess redirects.

Thus far, at archive.example.com's .htaccess file I have:

<FilesMatch "\.(zip¦sit¦hqx¦dmg¦iso)$">
redirectMatch .*\.zip$ http://www.example.com/thanks.html
redirectMatch .*\.sit$ http://www.example.com/thanks.html
redirectMatch .*\.hqx$ http://www.example.com/thanks.html
redirectMatch .*\.dmg$ http://www.example.com/thanks.html
redirectMatch .*\.iso$ http://www.example.com/thanks.html
</FilesMatch>

The server successfully redirects to thanks.html, but using a simple JavaScript to display the referer, it believes its only the original site that the person clicked from, NOT the archive.example.com domain that the file is stored on.

What should I do from here?

I'm so lost! :(

g1smd

11:01 am on Oct 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don't use RedirectMatch, you need a bit more power than this offers, and don't use .* as a pattern, it is not efficient.

For multiple matches, you can simplify repeating rules by using a

¦
pipe symbol for "OR", like:
(zip¦sit¦hqx¦dmg¦iso)
this.

Use RewriteCond and RewriteRule in the .htaccess to sniff the initial referrer value and then only redirect based on the rules I stated above ("referrer is not blank and is not this site" and "requested file is one of those to be protected").

You could then append the requested URL as a parameter to the end of the new target URL (the thanks page), so that the initial referrer value is available for processing at the other site after you have been redirected to it.

So, I am at some other site and I click on a link that points directly to example.example.com/file.zip on your site.

The .htaccess on your site, detects that the referrer is from the wrong site, and issues a redirect to http://www.example.com/thanks.html?refer=example.example.com/file.zip

The thanks.html page should have a meta robots noindex tag on it, and the script within it should detect the parameter that was passed in the URL and use it to populate the new destination URL in the new outgoing link.

The script should also detect that the outgoing URL is for your domain, otherwise you will find other people using your redirect to bounce people off to other spammy sites and virus-laden files.

If all is OK, clicking on this link will deliver the expected file just fine, because the referrer passed with it will now be for a URL that is within your site.

.

The other way to do this, is with cookies. Any page of your site loads a cookie on to the visitors machine (and each page view updates it), and that cookie is set to be valid only for a short time (at most a few hours, maybe one day). To get the ZIP file, you must have a valid cookie, otherwise you are redirected to the thanks page.

dpaanlka

5:48 pm on Oct 10, 2008 (gmt 0)

10+ Year Member



Ok, that works too...

So now, my revised strategy is, on files.example.com I am to check the referer... if it is not one of my sites, and not blank, then redirect to www.example.com/thanks.php?variable=files.example.com/therequestedfile.zip and thanks.php will automatically work with the variable.

However, I'm still not sure I'm getting how to do a few things. First, how do I redirect only if the referer is one of my other URLs? Is there some kind of if...then statement in htaccess? Also, how do I check just the domain part? Like, ANYTHING that comes from http://www.example.com?

Secondly, how do I get htaccess to redirect to another URL + ?variable= + the path of the requested file? How do I pass all that as the new redirect location?

Thanks for all your help so far

g1smd

5:55 pm on Oct 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Look at adding a set of RewriteCond statements, with each one looking at %{HTTP_REFERER} data.

You'll need one entry that looks at > . < - that's a single dot and means "not blank".

Then use > ! < for "not" and have one line for each of the "allowed" sites (if the referrer is "not" one of these... do the redirect). Don't forget to cater for www and non-www and any other subdomains too.

There are loads of "prevent hotlinking" examples in this forum. You'll need all but the last line of those examples, and you'll replace the last line (which usually completely blocks access) with your redirect to the "thanks" page.

.

In your script (on the thanks page) , you'll need to pull the {QUERY_STRING} data off the latest request and use that to populate the link on the page. How you do that depends on whether you are using PHP, or JavaScript, or something else.

The "thanks" page *must* have a meta robots noindex tag on it, or better yet, be blocked by robots.txt. You will be in a world of woe if the "thanks" page is indexed by search engines under hundreds of different query string variations of URL.

jdMorgan

6:34 pm on Oct 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just so we don't get a communication disconnect here, we are discussing using mod_rewrite in .htaccess. Mod_rewrite has the RewriteCond and RewriteRule directives which support conditional URL operations.

Take a look at our Forum Charter for links to useful resources about mod_rewrite and related subjects.

Jim

dpaanlka

7:49 pm on Oct 10, 2008 (gmt 0)

10+ Year Member



OK, I think I am getting closest ever. This is the code I now have:

Options +FollowSymlinks
RewriteEngine On
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?info-mac\.org/ [NC]
RewriteRule .*\.(zip¦hqx¦sit¦dmg¦iso¦bin)$ http://www.example.com/referer_test.html?a=test [NC]

Everything works except one thing that I haven't implemented. At the very end of the last line, where it says ?a=test, how do I replace "test" with the path of the originally requested file? Is it possible to have it just the path, and not the full http://? This way, on my thanks.php page I can add http://example.com before it and prevent people from using my page to link to nasty stuff.

[edited by: jdMorgan at 12:46 am (utc) on Oct. 11, 2008]
[edit reason] example.com [/edit]

g1smd

8:38 pm on Oct 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes. You are well on the way I think.

.

In the condition, using a single dot

.
is functionally equivalent to
!^$
do note.

!^$
means "is not blank".
[b].[/b]
means "one or more characters".
[b].[/b]
is parsed a lot quicker.

g1smd

8:45 pm on Oct 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The trick in passing the referrer over is to use brackets to collect that information in a server variable
$1
and then re-use the information on the right.

Using

.*
isn't very efficient. You might try
^([^.]+\.(zip¦hqx¦sit¦dmg¦iso¦bin))$
which says get "everything that isn't a dot, followed by a dot followed by the file extension". It might be better to break on / rather than dot, but anything is likely better than .* which is inefficient).

.

To finish, simply drop

$1
on the end of the target URL
...=$1
for the collected information to be appended on the end.

This stuff is very powerful and is completely unforgiving of typos and errors in thought process. That is why you have to define *exactly* what you want it to do, before you start any coding (or develop on a test server where errors are not going to get an entire site de-listed from search engines). I hope I haven't made any errors, but jd will likely look in on this thread and offer a more efficient solution if it exists.

[edited by: g1smd at 8:50 pm (utc) on Oct. 10, 2008]

dpaanlka

8:49 pm on Oct 10, 2008 (gmt 0)

10+ Year Member



Any ideas on how to do the last bit? Attaching the requested file path to the new redirect location as a URL variable?

EDIT: Many of my files are located in subdirectories, and subdirectories within subdirectories and so on. Basically, I'd fancy the ability to break after the http://example.com/ and catch the subdirectories and the file name only. Will this still work with the above example?

EDIT#2: I have, indeed, been doing all this on a test server first.

[edited by: jdMorgan at 12:47 am (utc) on Oct. 11, 2008]
[edit reason] example.com [/edit]

dpaanlka

8:59 pm on Oct 10, 2008 (gmt 0)

10+ Year Member



Well, I used ^([^.]+\.(zip¦hqx¦sit¦dmg¦iso¦bin))$ + the $1 at the end and it does seem to be working now. Thanks for all your help! I'll test this out for a bit and if anything happens I'll post here.

g1smd

12:21 am on Oct 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The left hand side of a RewriteRule has only the folder and filepath available to be tested and evaluated.

By default, both the host name and the query string are not present. They can be found in %{HTTP_HOST} and %{QUERY_STRING} etc, tested separately.

The path information that is present is also localised such that if the .htaccess is in a folder then information about folders closer to the root is not included.

Glad you got it working.

You may find the PHP and/or JavaScript forums as WebmasterWorld to be helpful in getting your "thanks" script to do what you want it to, but it does sound like you are already well on the way.

One thing to do as you go along, is to ask yourself "how could someone abuse this?" and take steps to prevent that. I mentioned the usage of robots disallow.

What would happen if someone linked to your "thanks" page? What would happen if they crafted a "nasty" value for the filename? There's more data checking to be done in the script to reject malicious requests, and to ensure only valid archives are requested.

dpaanlka

9:37 pm on Oct 11, 2008 (gmt 0)

10+ Year Member



I've done a few things to ensure security, including blocking the page from robots.txt, parsing the passed variable into html-friendly code with $variable = htmlspecialchars($_GET['variable']); (so <script> wouldn't run as a script, for example), and also passing only the path of the file, not the domain, and adding the domain. In other words, the htaccess only passes directory/file.ext and the script on thanks page combines it with http://www.example.com/ to create the full name. If somebody goes to the thanks page, no matter what they pass it always gets added to http://www.example.com/ so they can't send people elsewhere.

I think this should have it all taken care of, yes?