Blocking spoofed referrers

Forum Moderators: phranque

Message Too Old, No Replies

Blocking spoofed referrers

The internet is a war zone...

helleborine

12:25 am on Oct 10, 2010 (gmt 0)

I'm trying to block referrers, but referrer-spoofing services are putting a monkey wrench in this.

I was blocking the spoofing site itself, and that worked, but now it seems that they are making it look as if the visitors are typing my URL in the address bar.

PM me for full details with links, which I cannot post here.

helleborine

1:54 am on Oct 10, 2010 (gmt 0)

I think it could be gotten around if I could redirect blank referrers to my home page... unfortunately, that creates a logical loop when I call the website by typing the URL in the address bar.

I understand that the referrer spoofing site uses the 'double meta refresh' method to blank the referrer.

sublime1

2:42 am on Oct 10, 2010 (gmt 0)

helleborine -- a simple way to strip referer is to deliver a link via an onclick event, and there are numerous other ways to make referer wrong if you are serving the refering page. While you can try to block referer, it will fail for the most determined sites ... probably the once you want to block most. It takes more effort to spoof IP addresses, but it's pretty easy especially through proxies, and now you can get a new IP from the cloud in a matter of minutes. In short another losing battle.

There are two approaches I have tried, for various sites having various content that people like to copy.

The first took me a couple days to code, and a couple of months of refinements every so often, with quarterly touchups. By scanning a 5 to 15 minute stream of requests, it's quite easy to pick out the bots. Typically a given burst of requests will have the same IP, or user agent, or lack of referer (or same referer) -- a few common patterns. My code kept the most recent "suspects" in memory, with counters every time a given instance of a pattern was recognized.

Suspects would get a custom page: one with a honeypot (for crawlers), one with some content that only a determined bot could see was not meant for human consumption, some suitably interdependent javascript that they could easily get confused. Any suspect who failed one or more of the tests would start getting 503's or (my favorite) a really slow response. Often that would be enough to discourage them. Those who went away quickly enough had their details recorded, but were taken off the suspects list.

Repeat offenders got put in a "quick 403" list, which we would review as new entries came in, which wasn't that often, actually. This was just a simple map that Apache used to block based on whatever pattern they had shown. The list recorded original date, most recent offense and number of offenses.

That was enough for our needs, and probably blocked 95% of the unruly bots, either by giving them bogus content, or just by making them go away quickly. The total number of hits on expensive content dropped dramatically, and occurrences where there were enough simultaneous mal-bots attacking our sites to cause capacity or performance problems were effectively eliminated.

We made a simple google alert for the special bogus content that we had delivered and regularly found the sites that had ripped us off, and either reported them to Google, or, in several cases, were able to deliver content that would certainly get them penalized by Google and other SEs. I would have automated that process if I had the chance.

This was some of the most rewarding software I have ever written. We did some similar stuff with the form-bots and our email server was not a friendly place for those who wanted to hack us. There's little that is more gratifying than taking down an attacker by deflecting their punches and using their own power against them -- malware kung fu :-)

But then I went to another company, and found that our needs were different. In this case, we just added enough servers to handle the load that the bots put on us, and trusted Google to do the right thing, which they almost always have. When a page we cared about suddenly changed, we manually check to see how many copies of it there are on the web, and hired an intern to rat on the obvious ones to Google. It's less automated, but it happens once or twice a month, and it requires no special code.

I was thinking about using Amazon Mechanical Turk to do the human part of these tasks, but then realized that's probably how the bots are getting so clever at getting past automated or heuristic prevention techniques. And this is probably the most alarming aspect of the whole thing: people are getting paid pennies to find ways to spam your site, and once you have people using real browsers to outwit your automated attempts, game over. People are always smarter and harder to detect.

I recommend the second approach, unless you have good reason to go for the first.

Tom

helleborine

7:06 pm on Oct 10, 2010 (gmt 0)

The problem is that I am not attempting to block bots, but actual visitors from a site that is copying my content (but not copyrightable content unfortunately).

Their IPs could be anything. And I want these visitors inasmuch as they are referred from any other page.

sublime1

7:46 pm on Oct 10, 2010 (gmt 0)

Hi --

I think my approach solves the "rogue bots" issue and also the stolen content issue, which I figured was your main concern. In my second case, that was our concern, as well.

Copyright isn't really relevant here. But there's a far higher authority having much greater power and much more direct influence: Google. They hate site-scrapers as much as the rest of us -- they were very responsive (and have tools through Webmaster Tools, etc.) to reports of sites that copied content or did other things. Adding a little text that only a scraper would see (for example: "This content was taken from another site without permission. Please report the site to Google.") -- then they unwittingly publish this, and give you a way to easily find all the sites that google has not already removed from their index -- just add a Google Alert looking for an exact match to your text, and spend 5 minutes reporting the abuse to Google with your Alert results and a brief description of your strategy. This worked for the one company I referenced, and has resulted in the disappearance (from Google) of probably a hundred sites a year.

I can share more about what worked for us if you like.

And it works great :-)

Tom

jdMorgan

4:23 pm on Oct 15, 2010 (gmt 0)

Basically, it seems that you *don't* want to block referrals from any site. What you want to do is to block the servers that are copying your site. Look up the IP addresses for the domains on which your stolen content appears, and block the entire IP address range of that server farm. If you don't do any business with a particular country, consider blocking all of the ranges assigned to that country (but keep in mind that you'll also block travelers visiting that country).

And consider using a "known-good robot" and "known browser" whitelist that checks any and all HTTP headers to make sure that they are correct. For user-agents claiming to be search robots, check the REMOTE_ADDR and/or the reverse-DNS of that IP address to make sure the robot is actually coming from a search-robot hostname at the claimed search company.

None of this is trivially-simple, so consider how much time you want to put into coding and maintaining a solution.

Jim