|Spam filtering: now with 100% less CAPTCHA|
Implementing anti-spam measures in web forms
| 3:48 am on Sep 9, 2010 (gmt 0)|
Recently, I've been doing some research into the behavior of automated spambots, tracking how they register and use accounts on forum software and other "easy" targets that I have access to. I've noticed my CAPTCHA implementations becoming increasingly ineffective, and the need arose for a new type of protection.
As a result of my investigations, I've noticed two primary patterns and one secondary pattern that can, in most cases, be tracked and blocked. I decided to implement an anti-spam mechanism on a website I run that has been particularly hard-hit by spambots. Immediately, the amount of spam received dropped to almost 0. While it's too much to hope that no spam whatsoever could get through, I haven't received any spam whatsoever on this first site in the months following the implementation of the anti-spam mechanisms. Even better, the methods I chose to implement are fairly fast, easy to build, and can be generalized to work with almost every conceivable platform and programming language. Needless to say, I was delighted with this discovery!
The spambots I tracked fell into one of two categories almost without exception: I have termed these categories human-lead and heuristic throughout this post.
This classification of spambot relies on a human "leader" to fill in a given form for the 'bot, filling in each form field with a unique value. The human tracks and saves what POST/GET data is sent to the server, and tracks where specific values may appear. An automated robot can then mimic the human's form submissions, with desired values being changed to spam messages in the saved POST/GET data. Forms are often located via a spider disguised with a real user agent, and the spambot usually mimics the human's user agent as well. Humans are often hired to do this from low-wage areas or through some sort of mechanical turk project, as with the right software almost no training is required for the human.
A subclass of this type of 'bot is entirely automated, using a heuristic engine to discover form fields, submit bogus data, discover if the sent data is displayed on the site and where, and save this information for repeated automated submissions later in the same method as described above.
These 'bots use a parsing engine to request a page, find form fields on it, and fill them in, then submit the form. Since most forum software and many comment systems, especially on popular CMS software, behave similarly and use similar form field names, guessing which data is likely to be displayed on the page is relatively trivial.
Blocking automated submissions
Since there are two main categories of spambots that hit these targets, the anti-spam method I implemented had two corresponding parts.
Blocking human-lead spambots
These are the hardest 'bots to block reliably. Due to the human leader, the first submission of the form will probably go through successfully, and there's very little you can do to block it. However, the spambot submissions I came across following this method all shared one thing in common: they came many hours or even days after the leader's submission.
To combat this, I added an encrypted timestamp into a hidden form field. When reading the form submission, I unencrypt the timestamp and compare it to the current time. If the submission does not occur within a reasonable timeframe (humans will almost certainly not submit a form of any complexity within 5 seconds of its creation, or after 6 hours), I throw it out the window. While not a perfect mechanism for blocking such spambots, this technique blocks the vast majority of incoming spam from this type of robot and thus reduces the return on investment for the spammers. From a spammer's perspective, paying a human leader to re-submit a given form once every 6 hours just to allow your spambots to continue spamming that form is hardly an effective investment. The encryption of the timestamp prevents advanced spambots from recognizing and changing it to a more recent value.
Blocking heuristic spambots
These 'bots are actually fairly easy to block, for the most part. By inserting a normal text input form field or textarea with a particularly juicy name, such as "comment" or "url", and using CSS to hide it from users, we can create an effective honeypot. Heuristic 'bots will see the form field in the HTML source, fill it in, and submit the form. If the form field gets filled in you know you're dealing with a spambot. For extra security, keep the CSS rule in an external stylesheet, with an obscure class name. If you're feeling particularly paranoid, use a creative method of hiding the form field, such as a large negative margin or absolute positioning to position it off the edge of the page or using z-index layers to put the field beneath the visible area. CSS is flexible enough to support a multitude of ways of hiding the field!
While this may sound complex, it takes very little actual programming to implement in most cases. In most forum systems, adding this type of security to the post functions is quick and easy, and implementing it on registrations is just as simple.
Moreover, this system uses no CAPTCHA. A CAPTCHA-less form has long been a dream of mine, since CAPTCHAs (especially the better, hard-to-read ones) present an annoying level of difficulty to users. The image CAPTCHA, too, is dropping rapidly in usefulness as improvements to OCR technology allow heuristic spambots to solve CAPTCHAs with success rates approaching that of humans.
Since discovering this system, I've implemented it on several other sites and forums under my control. In every case, the spam rates have instantly dropped to nearly 0. An occasional message may get through, but one message every week or month is tolerable, compared to the absolute deluge of spam received by an equivalent unprotected system.
So, what are your thoughts? What possible improvements could be made? Am I entirely off-base in my analysis of spammers' activities, or is this a real, plausible semi-solution to web form spamming?
| 3:35 pm on Sep 9, 2010 (gmt 0)|
I like the encrypted timestamp method- sounds very easy to implement and I think I will start adding it to my sites.
| 4:25 pm on Sep 9, 2010 (gmt 0)|
that's some great stuff Wesley, thanks for sharing.
| 5:55 am on Sep 10, 2010 (gmt 0)|
What have you done to look at the type of content they are hitting you with?
I'd take a closer look at that. Example: link drops account for 95% of the attacks I've seen. Take away their candy (i.e., kill anything with link drops) and you've beat half the battle.
Say no to CAPTCHA. I've never had to use one, not once. Well once. a co-developer wanted to prove me wrong and started spamming from a china IP, convincing the client to use recaptcha. Never saw it again, so I'm sure that was the source.
| 11:58 am on Sep 10, 2010 (gmt 0)|
With these methods implemented, I've never had to deal with content modifications. The systems in question were forums for the most part, and of the type that posting links was a necessity. I'm sure some sort of fast letter-by-letter Markov analysis could be done to determine the likelihood of certain character sequences, but that might end up catching people who misspell words as well as spammers.
| 9:11 pm on Jan 11, 2011 (gmt 0)|
Great post, thanks! What methodology did you use to establish the human lead bot behavior?
|I'd take a closer look at that. Example: link drops account for 95% of the attacks I've seen. Take away their candy (i.e., kill anything with link drops) and you've beat half the battle. |
My experience is removing the "candy" doesn't stop them trying and there's always registration SPAM to contend with...
| 10:10 pm on Jan 11, 2011 (gmt 0)|
Mostly manual log inspection. I noticed a pattern with one registration or submission being sent with apparently "dummy" values--usually random strings of characters--that provided no possible benefit to the spammer whatsoever. A few seconds or a few minutes later, I would start to see the same user agent/IP spidering pages, requesting several common URLs (such as, in the case of the forums, the "registered user" profile page).
In the case of the human leaders, the "user" then did nothing else. Though the request was (to all appearances) made by a human, I never saw that human again. No activity could be seen on their account, no contribution of any sort, just a garbled profile that seemed to have been written by someone's pet hamster playing on the keyboard (or perhaps a random string generator).
However, several hours or days later (the length of time varied), I would begin getting periodic submissions from different IP addresses using the exact same user agent, and not requesting content files (CSS, JS, images, etc). While the user agent string in this case isn't a perfect identifier (people also used the same user agent string to make valid, non-spam requests regularly), these frequent, rapid spam submissions using an identical user agent did not occur without an apparently "valid", but spamlike profile using the same user agent--and upon log inspection, the profile had been created by someone apparently using a regular browser.
There's several somewhat weak links there, but the "human leader" hypothesis is the only reasonable conclusion I could come up with to describe why a human would apparently create a garbage profile, with a large volume of spam being submitted using the same user agent several hours thereafter.
There are a couple other CAPTCHA-less techniques I would pose to go along with those in my first post:
1. Unique form instance identification
Every time you generate a form (for instance user registration form X), assign it a unique ID. Store the particulars of the form--such as the time it was generated and whether or not the form instance has been submitted--in a database or file, and put the unique ID in a hidden form field. When the form is submitted, update the form's stored data by noting that it has been submitted. Compare the submission time to the generation time, as I suggested in the original post. Don't allow submissions from forms that were generated too long ago or have already been submitted.
2. Randomly-named field with random value
This concept is mostly to easily detect and keep out simple bots that simply look for instances of a particularly well-known piece of software to spam, such as a phpBB registration form. Essentially, add a hidden field to the form that is named using a randomly-generated string and contains as its value another randomly-generated string. Store the name/value pair in a cookie (obfuscated in some way if desired, such as hashing, encrypting, or encoding it), and retrieve it when the form is submitted. Compare the name and value submitted with the form to the name and value stored in the cookie, and throw the submission out if they don't match what is expected.
This method may, however, cause users with multiple tabs open to your site problems, particularly if they have two forms open at once. I'd put a definite YMMV on this one...
| 2:03 pm on Jan 12, 2011 (gmt 0)|
Text based browsers like lynx don't utilize CSS so you might catch legitimate registration.
FYI this technique has been used for years. ;)
|1. Unique form instance identification |
Every time you generate a form (for instance user registration form X), assign it a unique ID. Store the particulars of the form--such as the time it was generated and whether or not the form instance has been submitted--in a database or file, and put the unique ID in a hidden form field. When the form is submitted, update the form's stored data by noting that it has been submitted. Compare the submission time to the generation time, as I suggested in the original post. Don't allow submissions from forms that were generated too long ago or have already been submitted
Look into how phpBB handles this, each form is given a token. The form becomes invalid after X amount of time, X is set in ACP
| 6:35 pm on Jan 14, 2011 (gmt 0)|
|there's always registration SPAM to contend with... |
Unmoderated registration is an invitation to insanity.