joined:July 13, 2007
Recently, I've been doing some research into the behavior of automated spambots, tracking how they register and use accounts on forum software and other "easy" targets that I have access to. I've noticed my CAPTCHA implementations becoming increasingly ineffective, and the need arose for a new type of protection.
As a result of my investigations, I've noticed two primary patterns and one secondary pattern that can, in most cases, be tracked and blocked. I decided to implement an anti-spam mechanism on a website I run that has been particularly hard-hit by spambots. Immediately, the amount of spam received dropped to almost 0. While it's too much to hope that no spam whatsoever could get through, I haven't received any spam whatsoever on this first site in the months following the implementation of the anti-spam mechanisms. Even better, the methods I chose to implement are fairly fast, easy to build, and can be generalized to work with almost every conceivable platform and programming language. Needless to say, I was delighted with this discovery!
The spambots I tracked fell into one of two categories almost without exception: I have termed these categories human-lead and heuristic throughout this post.
This classification of spambot relies on a human "leader" to fill in a given form for the 'bot, filling in each form field with a unique value. The human tracks and saves what POST/GET data is sent to the server, and tracks where specific values may appear. An automated robot can then mimic the human's form submissions, with desired values being changed to spam messages in the saved POST/GET data. Forms are often located via a spider disguised with a real user agent, and the spambot usually mimics the human's user agent as well. Humans are often hired to do this from low-wage areas or through some sort of mechanical turk project, as with the right software almost no training is required for the human.
A subclass of this type of 'bot is entirely automated, using a heuristic engine to discover form fields, submit bogus data, discover if the sent data is displayed on the site and where, and save this information for repeated automated submissions later in the same method as described above.
These 'bots use a parsing engine to request a page, find form fields on it, and fill them in, then submit the form. Since most forum software and many comment systems, especially on popular CMS software, behave similarly and use similar form field names, guessing which data is likely to be displayed on the page is relatively trivial.
Blocking automated submissions
Since there are two main categories of spambots that hit these targets, the anti-spam method I implemented had two corresponding parts.
Blocking human-lead spambots
These are the hardest 'bots to block reliably. Due to the human leader, the first submission of the form will probably go through successfully, and there's very little you can do to block it. However, the spambot submissions I came across following this method all shared one thing in common: they came many hours or even days after the leader's submission.
To combat this, I added an encrypted timestamp into a hidden form field. When reading the form submission, I unencrypt the timestamp and compare it to the current time. If the submission does not occur within a reasonable timeframe (humans will almost certainly not submit a form of any complexity within 5 seconds of its creation, or after 6 hours), I throw it out the window. While not a perfect mechanism for blocking such spambots, this technique blocks the vast majority of incoming spam from this type of robot and thus reduces the return on investment for the spammers. From a spammer's perspective, paying a human leader to re-submit a given form once every 6 hours just to allow your spambots to continue spamming that form is hardly an effective investment. The encryption of the timestamp prevents advanced spambots from recognizing and changing it to a more recent value.
Blocking heuristic spambots
These 'bots are actually fairly easy to block, for the most part. By inserting a normal text input form field or textarea with a particularly juicy name, such as "comment" or "url", and using CSS to hide it from users, we can create an effective honeypot. Heuristic 'bots will see the form field in the HTML source, fill it in, and submit the form. If the form field gets filled in you know you're dealing with a spambot. For extra security, keep the CSS rule in an external stylesheet, with an obscure class name. If you're feeling particularly paranoid, use a creative method of hiding the form field, such as a large negative margin or absolute positioning to position it off the edge of the page or using z-index layers to put the field beneath the visible area. CSS is flexible enough to support a multitude of ways of hiding the field!
While this may sound complex, it takes very little actual programming to implement in most cases. In most forum systems, adding this type of security to the post functions is quick and easy, and implementing it on registrations is just as simple.
Moreover, this system uses no CAPTCHA. A CAPTCHA-less form has long been a dream of mine, since CAPTCHAs (especially the better, hard-to-read ones) present an annoying level of difficulty to users. The image CAPTCHA, too, is dropping rapidly in usefulness as improvements to OCR technology allow heuristic spambots to solve CAPTCHAs with success rates approaching that of humans.
Since discovering this system, I've implemented it on several other sites and forums under my control. In every case, the spam rates have instantly dropped to nearly 0. An occasional message may get through, but one message every week or month is tolerable, compared to the absolute deluge of spam received by an equivalent unprotected system.
So, what are your thoughts? What possible improvements could be made? Am I entirely off-base in my analysis of spammers' activities, or is this a real, plausible semi-solution to web form spamming?