|Spam Control with hashcash|
A better alternative to Captcha and Akismet?
| 4:34 pm on Sep 29, 2008 (gmt 0)|
The Drupal community finally got most of the modules I wanted updated to Drupal 6, so I finally made the big upgrade. Everything was great, but none of the spam-control modules I use were updated and I soon found myself swimming in hundreds of spambot comment submissions every day.
I did some of the obvious:
- exclude comment submissions that have no referrer or any referrer other than my site
- exclude submissions without user agents.
- check IPs against a list of proxy servers.
No effect. I needed something more. There are a few more active approaches to spam control, all of which have some drawbacks.
So What Choices Do I Have?
* Akismet and Mollom run your comment submissions through a third party, evaluate them, and put them in a separate queue for you to approve or delete as you wish.
- Generally speaking, Akismet is quite accurate and I've heard that Mollom performance is similar.
- eventually automatically deletes spam comments after a user-set expiration date.
- dependent on third-party server
- sorts submissions after the fact, so you still get the comments and the occasional false positive is almost certain to get deleted unless you manually scan through all your spam.
- if you let spam get auto-deleted, the visitor who submitted the comment never knows his or her comments are being rejected or why.
* The drupal Spam module [drupal.org] does an admirable job of identifying spam submissions and putting them in a separate queue. I would say it works almost as well as Akismet, but runs on your own server and you control it fully. So aside from running on a third-party server, it has most of the same pros and cons as Akismet.
* CAPTCHA and reCCAPTCHA. We've recently had some extended discussions about these (automated CAPTCHA attacks [webmasterworld.com]; baked jake's CAPTCHA rant [webmasterworld.com]; CAPTCHA-cracking in India [webmasterworld.com]). Personally, I'm not a fan. I have good eyes. Not as young as they once were, but nevertheless quite good. Professionally, I make my living as a historian and paleographer and am considered a top expert in reading illegible handwriting from the sixteenth century. And yet I would say that at least 25% of the time, I fail to solve a text-image CAPTCHA. I can't imagine how hard these are for people with bad vision. And though I love the idea of reCAPTCHA (digitizing books with distributed labor), my tests resulted in lengthy delays waiting for the reCAPTCHA server to respond. I've heard it's better now, but I'm too damn lazy to monitor my own sites as I should, let alone third-party servers that my sites depend on.
* Hashcash. Haschash depends on the "proof of work" concept to verify that a human is submitting a form or sending an email. There are many potential proofs of work, but generally this approach gets its name because it uses [tech alert] hashed values of known data. A hash is basically the result of an algorithm that takes an object (text string, file, whatever) and manipulates it to generate a uniform-length number (usually in hexadecimal). This number is not necessarily unique, but it is very difficult to guess a "collision", so hashes are often used to verify that a file has not been corrupted or modified (see [hashcash.org...] for more information).
- just send an error message and ask them to turn on their JS.
- degrade to a CAPTCHA that the user must solve
- put those comments into an approval queue and run them through Akismet.
- [your idea here]
Hashcash is available for Wordpress [wordpress.org] for all recent versions.
The drupal hashcash module [drupal.org] has not been upgraded to drupal 6, but I generated and uploaded to drupal.org a drupal 6 version [drupal.org] (page down to the September 28 version - hashcash-6.x-1.4alpha.zip, not the Sept. 22 version labelled hashcash.zip which is completely fubarred). It's certainly "alpha" but it seems to be working for me. For three days now, I log in to find *no* automated comment submissions in my approval queue.
What do you think?
How do you do it on your site?
Is blocking users with JS off too high a price to pay?
Are you willing to put up with the usablity issues with CAPTCHA?
Do you have a better method altogether?
| 1:59 pm on Oct 1, 2008 (gmt 0)|
I am using akismet and havent had a problem with deleted legitimate comments so far. hashcash sounds interesting. I am concerned about JS being turned off and if someone will turn it on to make a comment. They would have to be pretty motivated to make a comment. What about cell phones? I have a couple of sites that visitors use cellphones to make comments. Will hashcash work for them?
As for captcha, will never use it. too much work for the user and puts them off and for the reasons you gave about readability
| 4:06 pm on Oct 1, 2008 (gmt 0)|
|I am concerned about JS being turned off and if someone will turn it on to make a comment. |
That's the big drawback to haschash. Being that I was drowning in spam, I was willing to pay that price.
The cell phone question is an interesting one. How many people have non-JS-enabled browsers on their cell phones? Do you have any idea?
The best stats I could find say that about half of the mobile visitors have JS-enabled browsers (that's from oct 2007). So I guess it depends on how many you get. I don't get many, but the nubmer is growing. It's growing, however, because of iPhone and others with powerful browsers.
I suppose it depends on your user profile and how hard you're being hit by bots. If you dont' have much spam and you have a lot of users without JS, then Akismet or similar is probably less work in the end.
One other option that I've seen people use is a hidden form field. If the user has a CSS-enabled browser, the field doesn't show. If there's a problem, the field says something like "Do not fill in this field unless you are a spammer". If the form is submitted with a value in that field it gets treated as a bot submission. I have no idea if it works or not. Do spambots automatically fill in every field on a form? Not sure they do.
I agree Akismet is pretty good and I still have it as a second layer of defense on one site, but the spam queue is largely empty after installing hashcash. My issue is that there are occasional false positives that get flagged for review. If you don't have a lot of spam, you can just review these and approve them. For me there are two issues though
- comments that get flagged as spam or flagged for review sit in the queue and if you aren't checking your queue frequently, users might wonder where their comment is.
- if you have tons of spam it's just too time-consuming to go through your spam logs every time and so I tend to just "delete all". The other day I was on a slow dialup and purged my spam queue (drupal spam module, not Akismet, but it also rarely has false positives). Given the slow connection, after I pushed the "delete all" button and as I was waiting for the system to respond I was looking at the screen and noticed a legitimate comment and it was too late. Who knows how many of those I've deleted? Probably not many. In this case, I had time to make a mental note of the subject and sender and that person had also sent an email through my contact form, so all was good. But I just find it onerous to check my spam logs.
| 4:08 am on Oct 2, 2008 (gmt 0)|
I need to try that out. I still need mollom or akismet to cancel out these manual blog spammer who post links to BS. I have tried Mollom but have a complaint, when it blocks something there is no way to go back and tell it that it's wrong. The comment is just gone. There needs to be a que for blocked spam that I can scan to undo any false positives.
| 7:32 am on Oct 2, 2008 (gmt 0)|
I simply randomize the input field names and store them in sessions.
$_SESSION['comment'] = sha1(uniqid (rand()));
echo '<input type="text" name="'.$_SESSION['comment'].'">';
and then retrieve them like this: $comment= $_POST[$_SESSION['comment']];
Since this technique is not very often used by others it works for me. Of course it only works against bots and not human spammers. All in all, I prefer individual ways to combat spam bots, since nobody will develop a bot solution just for my website. Off the shelf solutions on the other hand are always a target for spammers since they are widely used.
| 8:02 am on Oct 2, 2008 (gmt 0)|
All the other junk (askimet, etc.) is overly complicated, false positives, just horrific.
This type of solution has to be randomized and obfuscated so the spam scripts can't detect and create the signature which isn't hard to make it virtually untraceable.
[edited by: incrediBILL at 8:03 am (utc) on Oct. 2, 2008]
| 8:43 am on Oct 2, 2008 (gmt 0)|
How do you do it on your site? - a mixture of blacklists, preview-required, adaptive filtering (delaying people who are trying to post too fast), external testing (akismet-like, but I'll look at the one mentioned above) and a pre-moderation cooperative (a number of webmasters who do bulk moderation across all our sites). I also make our anti-spam policy public on most sites, which deters spammers - if you don't mention it, you look clueless (=attractive, or at least worth a test spamming).
Are you willing to put up with the usablity issues with CAPTCHA? - No. CAPTCHA doesn't test for spam itself, so it's useless against human or semi-human spamming. Even its inventors say that the usability problems are unsolved. Why put resources into annoying your users *and* failing to test for spam?
Do you have a better method altogether? - see above. The field name randomising is pretty useful too. [added note: it usually works in PHP even without cookie support, because PHP will add a session ID to the URL query string if you use PHP-based link functions]
Hope that helps.
| 10:53 am on Oct 2, 2008 (gmt 0)|
Very interesting post which I shall chew over at my leisure and perhaps learn something. For now, I'll describe my methods of keeping the spam down.
I first of all apply the comment against a list of blocked IP addresses that I maintain, these include IP ranges (occasionally one has to block AOL until the children get bored and move on to somewhere else). I keep the list small and after a while lift the ban. That's all automated and for me is a single click affair.
Okay, if they're allowed to post then next gets run through a filter. I have a strict no swearing, SMS text style writing, all capitals etc. policy. I also don't hyperlink web addresses (perhaps that's the most successful thing of all!). The filter also makes sure the post is within minimum and maximum length. Two words isn't really worthwhile in my opinion and wafflers need to learn to get to the point. I also force the use of the capital "I" in the sentence. I know this all sounds a bit tedious, but it weeds out those who have something worthwhile to say from those who are lazy and just there to abuse the forum.
If it passes the IP check and the filter then it gets posted, but it doesn't end there. I moderate using my own purpose built tool that quickly shows me all the comments and allows me to kill posts with a single click, pull out everything from a single IP and then delete and ban.
Perhaps that last bit sounds like too much work, but I spend ten to fifteen minutes a day moderating, some of that actually reading the stuff anyway. I guess if things get busier I may have to look into other methods (captcha's etc.) but for now it works and isn't too hard to look after...
| 12:53 pm on Oct 2, 2008 (gmt 0)|
Akismet has a good record with few to none false positives but it also doesn't catch everything.
The most effective "captcha" I have ever seen is simple math questions.
| 2:02 pm on Oct 2, 2008 (gmt 0)|
I use trivia questions. The advantage is it's readable and effective, but again you are asking people to do something extra, which may put off some.
Avoid maths questions like the plague, however: they're becoming too common. It's important that we all vary our methods.
| 1:23 pm on Oct 3, 2008 (gmt 0)|
Third-party checkers do have some false positives, but most implementations (like the wordpress Typepad plugin) put them in a bucket you can check and retrieve non-spam ones from, which is less work than deleting all the spam from.
| 8:06 pm on Oct 4, 2008 (gmt 0)|
>> then without warning