« Google Acquires Writely Web Based Word Processor |
Main
| Posting from a Treo »
How to Combat Blog Comment Spam
Why do people comment spam blogs? Primarily to generate links to sites they own in an effort to increase the site's search engine rankings. Unfortunately, this can be a blog management nightmare for bloggers, littering their great content with hundreds of irrelevant comments for the latest and greatest "enhancement" drugs, online casinos, and girls girls girls. So what's a blogger to do? Here is a summary of what Technology Evangelist has done so far to keep things under control, along with a request for feedback on how we can keep spam to a minimum without making our site unusable. First, the easiest way for us to prevent spam on the site is to moderate every comment. We've taken this approach from time to time, but prefer to avoid it because it takes too much of our time and takes away the satisfaction of seeing a comment immediately go live. With that option tabled, we're currently using four tactics: - nofollow. Tell search engines do not give link credit to links appearing in trackbacks and comment. Smart spammers look for this and skip sites with nofollow since they won't get any link juice from their efforts.
- SpamLookup Lookups. Moderates comments submitted from known anonymous proxies. (likely responsible for some of the lag in posting new comments)
- SpamLookup Links. Sets a limit on how many links can appear in a post before it's moderated or junked. We welcome the use of links in comments that are relevant to the topic, but review posts with a certain number of links before their published. This keeps spammy lists-of-links posts from getting published.
- SmapLookup Keyword Filter. Moderates or junks comments containing keywords we designate. Mostly porn, pills, and casinos related terms. Some home financing terms have made their way onto the list. It takes a while to build an effective spam keyword list. We moderate terms like ringtones and prequality that could be used in non-spam comments. The list of auto-junk comments includes a large and constantly growing list of pharmaceuticals that would never be mentioned in a legitimate post on this site.
With those four tools in place, some spam slips through. At this point, we manage to catch 96% of the spam that hits our site, but enough gets through to be a pain. For example, we would like to publish a "recent comments" feature on the site, but need to raise the spam prevention bar higher before doing so. Here are a few things we're considering to turn on spam prevention up another notch: - Close comments after a certain period of time. We don't like this one because we want to hear from people who happen to stumble across things written in the past. However, this can be very effective because smart spammers prefer spamming on archives because they're less likely to be noticed.
- Add a CAPTCHA. Short for Completely Automated Public Turing Test to Tell Computers and Humans Apart, CAPTCHAs are the annoying sets of characters you're occasionally forced to type in while filling out a form. They're tough, but not impossible for spammers to figure out, but would certainly help. Of course, they would also annoy all of our legitimate visitors.
- Add some other challenge. Requiring something short of a CAPTCHA may be just as effective because it makes the comment form non-standard. Most spam we see is doneby robots who seek out and post to sites they determine to be blogs. If we add one additional required field with a question like "what is 1+2?" we can stop many robots in their tracks.
- Email confirmation of posts. Would you like to receive an email after every comment you leave, forcing you to click to confirm your post before it goes live? We wouldn't either. But it works to cut down on spam.
- Use TypeKey: Typekey is a service owned by SixApart, the owners of the Movable Type, TypePad and LiveJournal blogging platforms. This service allows people to created a trusted TypeKey account that allows them to post to blogs within the SixApart network with preapproved status. It basically allows you to do #4 once. Some sites require typepad in order to post. We think that's too high of a hurdle, and will cause many interesting thoughts to stay bottled up rather than shared in the comments, so we're not going there. However, one hybrid strategy to consider is automatic approvals of TypeKey user's posts with moderation of everything else.
As a reader of this blog, what forms of spam control do you find acceptable? What hoops are you willing to jump through in order to post? And what hoops will make you think twice about contributing a comment? Share your thoughts in the comments below while you easily can.
TrackBack
TrackBack URL for this entry: http://www.technologyevangelist.com/cgi-bin/mt-tb.fcgi/136
Listed below are links to weblogs that reference How to Combat Blog Comment Spam:
2. Posted by: brendan on March 10, 2006 7:49 PM:
Ed,
rel=nofollow is not a spam deterrent. It's designed to stop artificial manipulation of Google's page-rank system.
After both having it on and now off, there is no difference.
Spammers just don't care - if they can get their comment to stick that's all they want. Whether or not it gets indexed is just a bonus.
Stopping spam showing up at all, is the single best method. You'd be genuinely surprised how many blogs and sites I've read that use rel=nofollow yet are chocked to the gills with spam.
nofollow breaks inter-blog linkage through comments (which helps everyone gain better on-line exposure), please please stop spreading FUD. :)
I agree with most of the spam counter-measures, however CAPTCHA based systems are not user friendly and prone to numerous problems. Most spam scripts do not even use the comment form, they frequently call the comment template directly, so CAPTCHA is actually a pointless exercise that will annoy the heck out of most respondents. :)
A good spam script will stop 90% of spam comments, with manual intervention for the false-positives or ones that slip through. A little effort keeping the site spam free will do more to cause a spammer to move on, than anything else.
Sites that constantly fail to register spam, will be considered a waste of time and resources, skipped over instead in favour of softer targets.
..all of which, without the need for rel=nofollow and CAPTCHA.
3. Posted by: Ed Kohler on March 10, 2006 8:00 PM:
Karl, thanks for the great comments. I've picked up some ideas that will definitely be integrated into this site. Analyzing the difference between human behavior and robots really seems to be key to solving this issue.
4. Posted by: Ed Kohler on March 10, 2006 8:14 PM:
Brendan, thanks for the great insight. Yes, nofollow is designed to prevent link juice, which is why I was under the impression that spammers passed by sites using this form of link condom. Kind of like burglars will simply move on to the next home if you have a barking dog or alarm system. Our experiences fighting spam have taught us that nofollow is by no means a magic bullet. We haven't toggled it to test it, but it sounds like you've determined that it's fairly worthless. Makes sense.
I've had some bad experiences with CAPTCHA experiences on other blogs in the past, and can't stand them on Blogger blogs and other places where I encounter them. Spambots hitting the comments script directly is news to me. Would adding an additional required field to the script squash this issue?
We're catching over 90% of spam now. Closer to 96% with hardly any false positives. The most effective single tactic seems to be the keyword filter. I need to turn the screw a bit tighter without making things a pain for users.
5. Posted by: brendan on March 10, 2006 9:26 PM:
"Would adding an additional required field to the script squash this issue?"
A hidden element that the comment engine looks for stops most spam scripts on the spot.. Scripts used frequently either attempt to 'parse' the comment form or directly query the comment engine to build a comment template to allow for mass spam - embedding content that isn't machine readable halts such nasty behaviour.
Wordpress has a number of plugins that generate a 'hash' that is then passed to the comment engine - no valid hash, no valid comment. Recently the Wordpress plugin Akismet, has been making waves as it uses a combination of 'learning', IP filtering and keywords to block suspect spam (and is getting very good at the job).
I beleive TXP, Typepad, etc have similar modules, code, hacks documented or plugins that achieve the same result. Making it extremely difficult for spammers to get a comment to stick is the single best method.
As you've possibly discovered, rel=nofollow isn't a magic bullet and indeed hurts incoming (and outgoing) links. Look at your incoming links, many will be due to comments you have left elsewhere, that readers have followed - just think how many more there would be if most/all of your comments had been indexed (at least in part).
rel=nofollow is a brute force indiscriminate method to stop all indexing, be it valid commentary, or spam and is imho the single biggest cause of much inter-blog linking failing.
6. Posted by: Ed Kohler on March 10, 2006 10:08 PM:
Great analysis, Brendan. It looks like we have a lot of opportunities for taking our spam control to another level. We'll definitely revisit the nofollow issue. As I see it today, links from comments should be valued differently than links within the body of a post since one represents a 3rd party endorsement while the other potential represents a self-refrencing link. If nofollow is applied only to comments, the conversation on the blog is not inhibited, but the blog manages to avoid endorsing sites (in the form of link juice) they may not be comfortable endorsing.
7. Posted by: humancamcorder on March 10, 2006 10:39 PM:
Comment Spam is one of our greatest headaches! It not only uses up my time in moderating, but is the most common cause for our wordpress installation to fail with the common " Wordpress Database error".
I simply looked at where the spam was coming from and added the IP block to my spam filter. In addition we also used the hash function found in wordpress. But things are still the same. We just slowed them down, but these spam keeps on coming up! Specially , when some of our blogs are PR rated.
I am gratefuly for the post by Brendan on Akismet.. I will try that out as well. With your permission , Mr Kohler, i will also post about this topic on my blog as well.
8. Posted by: Ed Kohler on March 10, 2006 10:55 PM:
Please do, humancamcorder, and keep us posted on any new comment spam prevention strategies you stumble upon.
9. Posted by: Mike Sansone on March 11, 2006 7:14 AM:
Ed - Another great post!
As a user, I also think that many CAPTCHA's are too creative and hard on my eyes (must be an age thing), I don't mind the "other simple challenges." and this is probably what I will implement at some point. The less hoops the better.
As a publisher, I'll periodically delete the spam I find. Most of the time, the spammers seem to target older postings. I consider managing spam "important," but not "urgent."
10. Posted by: Ed Kohler on March 11, 2006 9:08 AM:
Thanks Mike. Personally, I consider comment spam to be a form of web based graffiti. Our stats show that the majority of out site's traffic first enters our site through archive pages, so spam in our archives is just as unacceptible.
11. Posted by: Mels Lenstra on March 11, 2006 10:16 AM:
Would it be an idea to take a 200x200 white (or noise, or whatever) image and have the script generate a "button" on it, then outputting the whole image as a ? This would be a rather nice single click Turing test - all the user has to do is click within the boundaries of the generated button and the post is OK. If he clicks outside of it, the form is submitted but the script will see that the X and/or Y values of the point clicked were out-of-bounds and will tell the user (/spammer) to click ON the button.
12. Posted by: Mels Lenstra on March 11, 2006 10:17 AM:
Woops, the HTML in my previous comment was deleted. It was an input tag with attribute type set to image ;)
13. Posted by: Ed Kohler on March 11, 2006 11:08 AM:
Mels, that is an interesting concept. It probably would be a high enough bar to slow down spambots.
14. Posted by: john t unger on April 17, 2006 4:32 AM:
If you blog using TypePad, you may have been frustrated at the inability to turn off comments site-wide on older posts to prevent spam. I've posted a tutorial on how to do this using a hack for advanced templates (requires a pro account). There's also a link to instructions for how to do this in Basic or Plus accounts. Read it here
15. Posted by: John C on July 11, 2006 8:23 PM:
Hi Ed. I stumbled onto your blog while looking around to see what other people are doing to combat guestbook spam.
As much as I like what Karl Hahn has done, it's more work than I'm willing to put forth. My guestbook is part of a narrowly targeted site, with other avenues of communication that require registration.
I've taken a very aggressive approach since my site topic should be of very little legitimate interest from offending countries. Using htaccess, I simply deny to entire IP blocks from offending countries.
16. Posted by: Everything on February 2, 2007 2:11 AM:
Thanks !
17. Posted by: Simpson on October 4, 2007 1:20 AM:
I feel all those 4 solutions are indeed very useful, But I dont think that it will eliminate Spams totally, yes it may decrease them for a while but we have many examples that suggest that the amount of spams are not really decreasing. Just look at what this article on
Comment Spam shows...
18. Posted by: Discussion on January 29, 2008 3:02 PM:
Mels, that is an interesting concept. It probably would be a high enough bar to slow down spambots.
|
1. Posted by: Karl Hahn on March 10, 2006 5:57 PM:
Here are some measures I've taken on my guestbook page that are invisible to legitimate users, but in combination have been 100 percent effective against spammers.
1) If a spam bot attempts to post and does not present the guestbook URL as its referrer, the posting fails. Since legitimate users always post from the guestbook URL, their referrer is always correct.
2) The posting requires that the user display the guestbook, then use post-form link, then go through the preview step before posting. How do I ensure that the user has done all that? By the use of hidden hash codes. The link to the post-form in the guestbook contains a hidden code that is required in order to reach the post-form. This code automatically changes every day. The post-form itself contains another hidden code that is required in order to get to the preview page. This code is a function of the user's IP and the date. Finally the preview page contains another hidden code that is required for posting. It contains a timestamp and a hash of the timestamp. The timestamp is considered stale if it's more than 30 minutes old. This pair of codes must be correct in order to post. Legitimate users never see any of these codes. But spam-posting bots are unlikely to go through as much detail as is required here, and as a result they never have the proper code for posting.
3) When the user posts, his IP is checked for presence in a public zombie block listing service and rejected if it tests positive. The post itself is scanned for URLs. These are each resolved to IP addresses and checked against the SBL and SPEWS databases as well as my own private database of past abusers. The post is rejected if there is a positive on any of them.
4) The full contents of any rejected post is logged to a file that only I have access to, which helps me maintain my private database.
To contact me on this subject, go to www.karlscalculus.org and email me through the email webform.