Back to: iOS 5: initial notes | Forward to: Design changes

Pink Stuff Patrol

Spam: we hates it.

Most folks have gotten used to — resigned to — finding a steaming pile of the pink stuff in their email inbox. It's all over the place because the barrier to entry is low — the recipient (me, or you) bears the cost, not the sender. Anti-spam measures help some (if you use GMail you could probably be excused for boggling on being told that around 90% of all email sent is spam); ISPs fight a valiant battle to keep it under control. A less well-known problem is blog spam.

This blog has extensive spam-filtering features, and I've been beefing them up recently because of a new wave of attackers. Hitherto I've mostly been seeing spam posted by keyboard monkeys: cheap labour paid to post early and often, with keywords and/or links intended to boost the pagerank of whatever dubious website they're pushing in the search engines. (This blog has a Google pagerank of 5, which is respectable and makes it a fat target for spammers.) The spam bin autoempties after 30 days, and has typically had 200 quivering pink slices in it at any time.

Not any more. As of two weeks ago the spam load began climbing. If the trend continues, later today the 30-day spam bin should pass a thousand chunks of ... well, you don't want to see it; it's mostly random word salad (some of it Chinese) with keywords for various expensive designer products embedded in it. And the word salad chunks are large. If I could impose a 300 word limit on comments before auto-holding them for moderation, it'd trap 95% of the spam. Again, if I could be arsed to write a Movable Type plugin to ban comment posts with an empty name field, that'd work. Annoyingly, these jerks appear to be using a botnet; blocking by IP address is next to useless.

I just resorted to renaming the comment-posting script, just in case the spammers have got it hardwired. (I don't think so, but ...) Next step might be to begin permuting the fieldnames in the comment form. I don't hold with CAPTCHAs — from personal experience, they really suck if you don't have excellent visual acuity, and spammers have been known to pay people to break them.

The battle is unending. Just thought I'd let you know that it's not been lost ... yet.

117 Comments

1:

Since the comment form requires javascript anyway, have you considered using some form of HashCash system? I know there are some for WordPress, no idea about Movable Type.

The basic notion is to make posting a comment require a certain proof of work, by finding hash collisions. This is done automatically by the posting form through javascript. A delay of half a second wouldn't inconvenience a legitimate poster, but would at least set a certain threshold for spammers. Of course, if people are so cheap that you can actually pay them to manually post comments on forms and you still make money from the spam, well, then there's little we can do.

2:

Have you got Bad Behavior installed? I've found that works quite well and it does have a plugin for Movable Type (IIRC).

There is no one-size-fits-all solution for spam, but combining approaches can work quite well. Askimet does a reasonable job of trapping most of the spam on my Wordpress blog

3:

I'm a fan of the non visual (logical or content specific) captchas personally; "What is the opposite of up?" or the "Which of these is not an F1 racing team?" on a Formula one forum, etc. The secondary benefit is that you can filter out a few complete imbeciles too.

The problem with these is that you need to put in the effort to create them, and determined spammers won't be kept out. You'll kill off a lot of the 'scattergun' spammer traffic though.

4:

Clever idea for blocking robots:

1) Create one more field in the comment form, called"Don't write anything here" for instance.

2) Use clever CSS trickery to position it out of the visible area

3) Block every comment that fills the field - because robots insist on filling every available field.

The idea is not mine; I've seen it implemented as a Wordpress plugin. Found the original link too:

http://wordpress.org/extend/plugins/ntzantispam/

Sadly it won't work on human posters.

5:

I got an email this week from one of the spammer suppliers (they have various pieces of software that auto-populates forums with spam posts). I don't use them I simply think its wise to keep an eye on what these people are doing.

They now claim to have OCR software that can read 98% of reCAPTCHA's captcha scheme so users would avoid paying the half cent or so the processing systems charge for using a rural Indian person to read the captcha.

Curiously I remember reading that they increased the size of their posts a while back to ensure that they bypassed filters which stopped short posts from new members being trapped.

Not that I've ever posted a link on here (or followed many to be honest) but are you adding the NOFOLLOW flag to all external Urls to tell search engines to ignore the link. I know it doesn't work as Google and others still give a score (albeit lower) to nofollow links but it stop the flow a little bit.

6:

Yes, adding NOFOLLOW to all external URLs. Using TypePad's antispam API. Using hand-tuned regexps. Looking into other solutions too. Preferably ones that don't require hand-hacking on Perl or JavaScript, because those will fail the next time Six Apart issue a mandatory security ugrade to MT: stuff that can be automated with boilerplate is vastly preferable to hand-rolled solutions.

7:

If you look at the source for this page, you'll see that 'rel="nofollow"' is indeed set.

8:

I read of Douglas Adams's Restaurant at the End of the Universe that it demonstrated how successful his first novel had made him: instead of jokes about being a poor hitch-hiker they were about expense accounts and expensive dining. Having just finished Rule 34 I have to ask to what extent, if any, does running a website bias your prose?

11:

I don't think the problem right now is human posters. What we're getting is typically 500 words of apparently comprehensible (but not actually relevant) text, with a few mentions of ugg boots or whatever. And a half dozen assorted links.

They're pretty obvious - just scan the spam folder pretty well ignoring anything from 'anonymous' that is a few dozen lines long with no paragraph breaks.

What's sneakier is the opportunist: at least one tribute to Steve Jobs came in that was short and sweet and contained only a single url. However, it hit the wrong thread.

12:

where are the spam cops when you need them?

...anyways, I am just glad that electronic spam has not been declared legitimate by the Powers That Be (e.g. like third class postage in the U.S.). But mostly this is a billing issue.

Still, sending in that volume should leave tracks. Unfortunately, no one designs computers "for security" these days. (And, by that, I mean: people that provide computer products are actively hostile to idea that the owner of the computer should have or could even want a full understanding of what the thing is doing. It's death by micro-optimization.)

13:

The problem of spammers seems rather analogous in electron-space to the problem of burglars in meat-space. In both cases, you have someone who is looking to hijack your hard work in order to avoid having to make any effort of their own. In both cases, the trick isn't necessarily to make your location absolutely secure, but rather to make it more secure than insecure - make it a hard enough target that they're likely to give up and go away because they can't get in and out quickly enough, or without leaving a whole bundle of evidence of their passing.

14:

Obnoxious as the stuff is. I'm glad you're not likely to implement a graphic capture. Mostly a lurker here but as I'm using a screenreader that would end my speradic posts. I really like the fact I don't have to register / login to post here. Purely out of lazyness. That ease of use is unfortunately what the human spammers utilize.

Is there any way of determining how much spam is bot generated and which is done by meatware?

Is there any such software solution whereby readers that chance across spam posts on blogs can poison the search rankings for it's target? Like have a mark as spam button that some how filters back to Google? I realise such a system could be potentially horribly abused. (Note though I don't really no what I'm talking about here. Not up on SEO stuff.)

15:

Posted whilst I was typing. This answers my human vs bot question I think.

16:

The problem right now isn't spam that ends up being visible to readers — it's the spam that's getting caught automatically and being dumped into the spam folder. It's coming in at the rate of several an hour, but you lot aren't seeing it because it's being automatically filtered out.

The spammer is obviously not looking to see whether it gets through or not, or they'd have changed their behaviour - spam that is never seen is spam that doesn't actually work for SEO purposes. (Nor could you poison the links.) Also, it's arriving in almost identical format — and substantively the same content — from lots of different addresses, which rather indicates a botnet where multiple spam sources cycle round multiple spam targets, in order to avoid IP blocking

So, the problem isn't spam that readers aren't seeing, it's that that spam makes it harder to notice false positives — stuff that's ended up in the spam folder when it shouldn't have. When you have several dozen long wordy spams a day in that folder, it's all too easy not to see the occasional valued comment that's got itself lost.

(And don't think that actual columnists are immune.)

17:

I used to use Drupal for my blog. The spam got so bad that I forced all non-logged-in comments to go to manual moderation. And I used a CAPTCHA. It seemed to be toning down, so I took away the CAPTCHA. Hundreds of automated spams resulted.

A little while back, I wrote my own blogging package, in Lisp. It's not ready for non-wizards to use yet, but I hope to get there. http://lisplog.org/

As part of the new blog, I wrote my own CAPTCHA system. It requires solving of a simple arithmetic problem. Two numbers from 1 to 9, find the sum, difference (guaranteed positive), or product. A timestamp to limit replays. Click "Add comment" at the end of a post at the link above to see it in action. I still moderate non-logged-in comments, but I've gotten very few of them.

Most of the spam that got through the Drupal blog appeared to be humans, doing some low-paying job. The same is true now, but apparently those humans are only trained on the big blogging platforms, and my simple arithmetic problem is too much for them. Wierd.

18:

Well, there's always restricting comments to the logged in. I mostly go with open comments and akismet on blogs I run, but places with a strong enough community can easily justify the hurdle of a little registration. And OpenID in theory makes it painless, except most people don't know what it is.

I guess human spammers can still get logins but they'd get blacklisted/banned pretty quickly

19:

Spam comments are a hard issue and i don't think that there will be a solution in anything like the near future. I'm thinking it will take at least dumb AI to make good judgements allowing for the increase in spammer sophistication. The are solutions that might help limit the flow somewhat. The first that springs to mind coming from a WordPress world is Akismet, although you would have to look into if there are any plugins for MT. A second approach would be something like Stupid Filter (http://stupidfilter.org/main/) although that looks like a dead project now.

Additional defenses are as mentioned above, make your blog a less attractive target so that spammers move on. I.e. Adding a delay when posting a comment or rather than a CAPTCHA you could use something like a simple maths quiz. This is prone to being broken by spammers just as easily as CAPTCHAs were.

What i see being needed is something like what Google uses to determine the topic/focus of a page being applied to each of your posts. Then when each comment was added having it checked against that to see if it is relavant. This would work quite well for you as your posts tend to be over a couple of paragraphs.

20:

Hang in there! We're all rooting for you.

21:

Which comment neatly shows how difficult it is to determine whether a comment is truly relevant or not, and therefore whether the solution in Barney@19's final paragraph would be feasible.

22:

I wonder if the human spammers are getting paid per click, not per successful posting. They may have a disincentive for taking extra time to do it right.

23:

Where are the spam cops? Most of the spam cops of yore have given up the fight.

I used to do a bit of that back in the 1990s on Usenet. The time has long passed when a human spamcop could make any difference - the volume and automated nature of spam has made us obsolete. We're like cavalry horses.

24:

Still, sending in that volume should leave tracks.

As I said, they're using a botnet. Each of the "anonymous" word salad spams comes from a different IP address. (We've got a tool installed that hunts the spam bin for three spams in 48 hours from a single IP address then bans the hell out of it. Alas, it's getting barely any of them because the spammer in question has a herd of thousands to millions of machines to exploit.)

25:

You have to be very careful when requiring people to register to add a comment. You tend to run into the situation where you have to create yet another set of login details with a unique password and unless they are fairly invested in the topic / site then most people won't bother. Open Id is the best choice but as you stated very few technical people have heard of it/use it let alone the normal web user.

It could be interesting experiment for Charlie to require logging in for a month to track the fall in comments and see if they are still at an acceptable level. I would suggest that the quantity would drop by well over 50%.

As a side note for all the people who don't change their password for each site stop and think for a second. The likelihood is that you have just given a complete stranger the login details to your mailbox, which in turn will provide access to most of your online presence. Start using a password manager to get rid of the cognitive load involved in remembering so many alphanumeric strings.

26:

Social networking integration and services like Disqus seems to work reasonably well. I can't remember the last time I saw spam on a Disqus-enabled blog (for example Sociological Images) although that's only an outside impression; I don't run such a blog myself and I'm sure moderation and vigilance is still required.

It does however take a significant element of control out of the blog maintainer's hands, and the fact that it has a "like" feature enabled does seem to mean that people game their comments to get a high like-to-post ration.

27:

Two things.

Firstly, manual moderation won't work here -- my blog typically gets north of a hundred comments a day.

Secondly, your numerical captcha unfortunately discriminates against folks with dyscalcula, which is distressingly common. This may be fine if your audience are able to do mental arithmentic (as you'd expect if you're catering to programmers or mathematicians, or even folks who learned to pound a mechanical cash register back in the day), but it's a huge deterrent to a younger or more general audience.

28:

And to think that you're actually spared from most of the blog spam that goes around since you're using custom software.

Try having a Wordpress, it having a rather well-documented interface and all. Automated spammers will go apeshit on it, even for sufficiently low amounts of popularity.

29:

Maybe rejecting non-ascii text would be an idea? Comments in Chinese would be of limited value for you and most of your audience, not?

30:

I take it the spammers haven't read your last book? Or it's not been translated to Chinese yet?

31:

There is actually only a small minority of such spam coming in. There is one spam with a small fraction of CJK in it a day ago, and one primarily CJK spam about two days ago.

I see no Cyrillic, Arabic, Thai or such. The flood is almost entirely English.

(It's also likely to peeve Feorag if you start filtering out the occasional posts she may make with Japanese in.)

32:

Its hardly surprising that you never see the same IP address often. The bot or person will receive a message and a set of websites (possibly 100's, probably 1000's) and try and post that message to all the sites.

Personally while I can't get my head around Disqus (it seems to want to track users across multiple sites and I can't have the pseudonym I always use) they do seem to be winning the war on spam. I think that's because the more spam you see the easier it is to identify and remove it.

33:

I've been using Spam Free Wordpress which has blocked all the automated spam. I've received all of two spam messages (entered by humans, I am sure) in the past two months. I hope there is something similar for other platforms.

I also moderate first-time commenters, but then my blog doesn't get many readers. :-)

Moderator note: this message was rescued from the spam bin

34:

I hate captchas with a passion; sometimes they're not actually legible even with (corrected to) 20/20 vision.

Also, on the "login" point, are people really so stupid that they use the same password for discussion fora, their e-mail, e-tailers they use...?

35: Use the same method that stackexchange does. Users can flag spam. The users who correctly catch spam flags are weighed higher and get badges and such ( http://stackoverflow.com/review/ ). Turn the whole thing into a game ( http://en.wikipedia.org/wiki/Gamification ). Spammers can't win a battle of attrition with people that are catching spam for fun (some people like the taste of spam and will actually eat it).
36:

It's not so much stupid as lazy. If a user has to log in on a number of different sites, having to remember passwords for all those different sites is a monumental pain. And it's all too easy for a site that they originally went to for mere discussion to evolve into something with real potential for damage — say, a book discussion blog might get into selling books to its frequenters.

(I'll admit to having a shared password on a few sites, though none of those sites has any information even mildly critical. The worst that could be ruined is my online reputation in a few places I don't really frequent any more.)

37:

Also, on the "login" point, are people really so stupid that they use the same password for discussion fora, their e-mail, e-tailers they use...?

According to recent research, yes they are (about 40% of them, anyway).

38:

Has anyone implemented this as a plugin for Movable Type?

If not, sounds like a lot of work ...

39:

If spam defies regexes & other automated tests, maybe "Is this a piece of spam?" would make a good captcha. Follow the ReCAPTCHA approach of pairing a known spammy or non-spammy post with a recent, not-yet-classified post. Of course, each correct answer provides only one bit of information, or even less if the signal/noise balance isn't close to 50/50, so you'll want to have people moderate a few dozen comments before they're allowed to post...

More seriously, if most of your spam comes from botnet IPs that you haven't encountered before, could you turn that fact to your advantage? If Alice's IP address hits your RSS feed a few times a day, presumably she's an actual reader & commenter. If Bob's IP shows up for the first time requesting pink-stuff-patrol.html but no external CSS/image files, and then five seconds later posts a 200-word comment, then maybe he isn't one.

40:

Write one or pay/request that someone write one (you have the powah!). As you've pointed out a few times, blog comment spam is a big problem. Movable Type users are a good chunk of the blog market.

Put out a Charlie Stross Challenge: The first person to write a good gamification spam fighting plugin wins: 'signed copy of a book' or 'gets their name in a book' or 'get a webpage on antipope proclaiming their greatness'.

41:

I ran across a short article about building comment forms that screen out spam pretty effectively. It's not a Movable Type plugin as far as I know, but it might help: http://nedbatchelder.com/text/stopbots.html

To all the people talking about forcing login to stop spam: it doesn't. I've tried in the past. All you get is spam links on the user profiles and spam in the posts.

To the question of where have all the spamcops gone, long time passing: we tend to give up after banging our heads on the wall for a while. Periodically I get re-engaged with a spam problem every now and then, hearing that siren call of Big Data, but honestly, after the first few years of people not actually caring much, it gets tiring.

Jim Rodovich - tricks like that with IP addresses can sometimes help, but you still have the problem with false negatives when Alice is on vacation or reading from the airport. It's best to dump it all into some kind of Bayes evaluation, but that's hard to get right, and there's always the integrate-with-MT part to make things exciting.

Anyway, Charlie, if you want another pair of eyes or a halfway engaged antispam coder, you know where to find me. Good luck, though.

42:

@27: Secondly, your numerical captcha unfortunately discriminates against folks with dyscalcula, which is distressingly common. This may be fine if your audience are able to do mental arithmentic ...

Whereas I generally get seriously annoyed at web sites which don't cater for accessibility, I'm not sure I'd be that bothered about a simple single digit arithmetic task. (I'm quite prepared to accept that may be a considerable bias on my part - my first degree was called "mathematical physics" until the year before I matriculated.) If the would be poster is on something we'd call a computer they'll have a calculator app to hand, and the extra effort would act as a threshold between spammer and serious poster. You might lose casual posters, but that could be either a good or a bad thing depending on the blog. If they were on a mobile phone form factor device it wouldn't work, but I can't imagine a human spammer using a smart phone just yet, and we've haven't got phone botnets yet (although I'm not looking forward to the day they arrive), so discriminating by the browser identification should allow you to avoid challenging mobiles.

43:

And those of us who live behind firewalls with multiple exit nodes and who run RequestPolicy can go whistle? Categorizing on browser behaviour/IP regime is like categorizing fake versus real names. How well is that going for the preeminent web-technology company today?

44:

The problem doesn't end even if you have a wonderful advanced automatic filter that has solved Natural language processing (http://en.wikipedia.org/wiki/Natural_language_processing) and so can determine the relevance of any comment in addition to the other approaches above. As bellingham @21 pointed out, not all real comments are relavent to the discussion at hand but need to be missed by the filter. 8 and 20 are good examples of this.

In addition the spammers will have these tools available to them as well. So they start generating reasonable comments possibly even seeded with appropriate links in addition to the one they want followed. What do you do now?

Jim's idea @39 sounds very workable, although it would take a while to get up to speed. If it becomes popular enough to start working on a few of the bigger sites then it becomes less useful. As a spammer you simply start priming your bots to visit target sites a few times before posting a comment.

Another possible approach would be to have the spam filter pull out & resolve any URLs (to get rid of shortening tricks) and check them against something like the Spamhaus DBL.

45:

I've only read about it from one source, so I am not sure it works as well as advertised but there is apparently a distributed spam fighting system called ATHENA http://www.amazon.com/gp/product/B004Y3I6XW

46:

Actually the value of the login is to make sure real posters are whitelisted, as Charlie pointed out the damage is losing real comments to the automated guardians. I'd still run akismet or something like that alongside it (I have run a forum and I know registration gets you a lot of interesting usernames with .ru addresses)

47:

So, how long before Spamnet becomes self-aware and launches a spam attack against Russia?

48:

This MT plugin is an oldie but goodie. It resolved the problem with our company blog, which lives in a VPS - when we got a heavy spam run, although TypePad Antispam did a good job of identifying spam, the requests would pile up waiting for the TypePad API to respond. Being CGI, of course, it's all one thread. Eventually it would trip an automated alert and the hosting firm's robo-admin would kill the MT server.

The plugin mines the spam pile and blocks IP addresses by updating the .htaccess file, so the blocking occurs at the webserver level before requests hit the MT server. I turned up filter aggression so that everything went into the spam file rather than 99% into manual review. Since then, no problems, with a few thousand IPs blocked. (I set the filter threshold to 3 spams, ISTR)

49:

There's no reason an IP address regime would have to be applied as a blacklist. The example I gave even hints at that: the filter might be reasonably confident that someone posting from Alice's IP is indeed an interested reader, but not at all confident that the same is true for a post from Bob's IP. So Alice's post could bypass other, more stringent filters to which Bob's post would be subjected.

50:

I've proved to myself that, as Barny said (@44), 'topic analysis/filtering' wouldn't work all that well. I've been working on a news summarization system for only a few days and, if I feed it this page instead of a bunch of news articles, the output looks on-topic. Although, it's takes 7 seconds to analyze the existing text and pop-out new text... this would probably dissuade anyone from using it for spam, similar to HashCash.

The Spamhaus DBL suggestion is much more doable and fairly easy. A summary of the needed code: Step 1: $links = $dom->getElementsByTagName('a'); Step 2: http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/ Step 3: http://thecodecave.com/2006/12/02/build-your-own-spam-filter-with-php-and-dnsbls/

Assuming this server also has PHP, it looks like a day's job. I'd guess a similar work load would hold if you wanted it done in PERL.

You can avoid upgrade syndrome by not having it integrate with MovableType. Have it reference/prune the same database, but otherwise be a completely separate job.

Now, the part I have no idea how to fix are the spam comments that don't even have links. I'm damned if I know why anybody even bothers to submit them, but they make up a good portion of spam on some sites I admin.

51:

"If the would be poster is on something we'd call a computer they'll have a calculator app to hand, and the extra effort would act as a threshold between spammer and serious poster."

but someone could write something that passes any captcha that looks like an equation to gogole and grab the calculator result. one + two gives three. e to the i pi plus one, 0.

don't allow weird character sets? write a romanisation function.

52:

As a mod for the Zero State forum I am getting seriously pissed off with it. We have made a decision to automatically bounce anything with a URL

53:

As another data point. There's a blog I keep running for a friend using Word Press. It has 3 or 4 posts a week with 10 to 30 legit comments most days.

We manually toss less than 5 SPAM comments per day. But the one simple filter we run blocks over 120 comments per day. Ugh.

We don't require registration at this time.

54:

This report of your current travails explains the odd things I've been seeing in my inbox.

I've been using Web based Yahoo mail as my main email service for nearly 15 years. A few years ago I also added gMail as a secondary mailbox, next to my "tertiary" mailbox in hotmail/livesomething.

Normally the only time I see some spam is when I go fishing in the spam bin for things which should not have ended there. Yesterday I actually saw spam in my normally clean inbox. This means that the onslaught is becoming impressive, somewhere behind the scenes.

Still, I prefer seeing a bit of spam now and then than losing messages from friends. Bell Canada is one of those corporations that uses the service provided by Microsoft livesomething for giving Web mail to its customers, and when I send mail from Yahoo to a friend who's on that thing the spam filter sometimes gobbles it up without a trace. Bell can't/won't explain this random behaviour. So, when I want to be sure my own emails won't be gobbled up I have to go and log in to my own Bell Web mail account when I send off an email to somebody with Bell Web mail.

55:

I remember reading a story in Analog where the spam was so massive that the hero solved the problem by reversing the approach. He turned off spam filters and then used data mining to determine the legitimate posts :)

56:

Akismet is a crowdsourced spam filter for blogs. I have been using their Wordpress plugin with great success. They do have a MT plugin as well.

57:

I just checked my GMail spam folder. Apparently I have 1279 items in it. I assumed that most people got that much. Apparently I need to do a poll.

Good luck finding an answer to it!

58:

Out of curiosity, do URLs hyperlinked with a href attribute get moderated too, or is it just naked URLs? It's not that I plan to spam Zero State or anything, just interested in knowing.

59:

Right now its hand moderated, with a requirement for membership and 3 posts before an automatic pass-through. However, we propose that anything that looks like a URL be bounced. Fortunately the mailing list is where all the action is right now, so its not a big problem at present, but it could be in future.

BTW @ andrew.ducker.org.uk, my gmail spam box sits at 1460

60:

Pretty much any custom solution will work, nobody will bother customizing their spambot for a pagerank 5 site. I used to get hundreds of spam posts a day. Over six years, a single one got through the simple text captcha of "please type the word 'foo' into this field". And it was even a constant 'foo', never needed to make it vary over time.

61:

Out of interest, is there a reliable pattern to the spam based on time of day? I've recently seen a huge increase in the amount of email spam rejected by my mailserver - but only between 0600 and 2400 GMT; there's total silence for six hours starting at midnight GMT.

62:

I like the idea of crowd sourced spam filtering. Put a "flag as spam" button next to each post, and if it is flagged more than N times (for some small N greater than 1) then it goes away. It's actually rather satisfying to hit the flag button, like you're sticking it to the spammer.

A few possible drawbacks:

  • A few people will still see the spam
  • The botnet could turn the system around and flag legitimate posts, so you would probably also need some kind of reputation system so that only legitimate users can flag posts.
  • A few people still have to pay attention to the page. A spammer could dump a bunch of junk on an old post nobody is reading any more.

It's rather ironic, people stopped using usenet and email lists in favor of web forums in part on account of the spam, but blogspam is just as bad. Worse, in some ways..

63:

You missed: griefers will use it to get at commenters they disagree with, or just for lulz.

64:

As a person who includes links in his posts, I certainly hope Charlie's software doesn't get too many false positives. My posts will pass muster with another human, but robots do strange things.

Naturally, xkcd has already offered a solution.

65:

Spammers already exploit basic human gullibility with comments full of friendly praise and compliments. Spambots are much friendlier than the average human, at this point.

Their praise still lacks a little focus ("Great writing! for a blog without text), but I imagine the next generation will probably be able to actually hit in the ballpark of the actual activity you're doing.

66:

I've been amused for some time by the extent to which defeating captchas has turned into a business. Several dodgy-but-real companies are at it now, competing over "number of cents per thousand solutions" and "number of seconds to solve".

I'm not sure whether they're using the porn trick to do it, or just employing children in countries that are so poor, a cent will buy you lunch.

67:

Have the comment form demand a trivia question about your books. That way the spammer have to pay you to post up in here. (Or at least pay one of your fans...)

Speaking of which, how come Miriam didn't [-SPOILER-][-SPOILER-] recognize the corvee information trap as a trap? Even if that slip-up was intended to show a change in her character, how come her personal assistant didn't spot it and tell her? I think that was the least realistic part of the whole series. Miriam the financial journalist would have recognized the honeypot immediately.

68:

I just added a new plugin that should prevent spambots from directly injecting comments into the system via the comment script. It shouldn't require any human interaction, but if it does, a simple text captcha is coming next ("are you human? If so, type 'yes'").

In answer to your question at Miriam, the word you are looking for is "hubris". (She persistently underestimated the Clan until ...)

69:

The system described has been fairly robust against griefers on the stackexchange sites (even ones like crypto.stackexchange.com with low numbers of users). The idea is that you get points for good flags and lose points for bad flags, griefers would have to work really hard to get a high flag weight and they would only get to spend that flag weight once or twice. In the end it is not a 'profitable' griefer activity (they would reduce more greif than they cause).

70:
don't allow weird character sets? write a romanisation function.

That's not really an option. I threw away a snarky response with an attempt at illustrating the problem and all, but the short version is that romanisation is just not acceptable generally. For the specific cases where it is, it still is a hard problem. (A Hard Problem, in capital letters, too.)

And it can't work if you still want to be able to have some symbols for their own sake (math, talking about language, etc) or be able to spell things out ("google for _").

At least the Chinese spam is (apparently) identified as such, so it seems it won't be necessary to worry too much about what to allow and what to block...

71:

Hmm, sentiment anlysis looking for overly friendly posts to detect spam?

72:

And so we end up with the opposite of xkcd's utopia - only harsh criticism and confrontational statements are recognized as being human :)

73:

I don't focus on this anymore, but I moderated some Usenet newsgroups for 15 years until I was driven underwater by spam past the filtering I had in place and had to give it up.

Special black place in our hearts for them, we have. Yes.

74:

can you not use a variant of greylisting for comments? works really well on our mail server...

75:

Akismet seems to catch most of the spam sent to my blog, only letting an occasional one through (which I can kill manually), and incorrectly flagged legit posts are rare.

I don't know if its accuracy is based on the software alone, or if it's also using all the other blogs hosted on wordpress.com for comparison. (Most high-volume spam operations hit a lot of blogs at once, so a site that hosts lots of blogs could use the spam identified on one blog to help ID and squash similar spam on others. IIRC, Google does something similar to catch spam on its various Gmail accounts.)

76:

On the varnish.org site, we added a question where you need to show a very basic knowledge about what the Varnish HTTP cache is. That instantly killed all the robots and outsourced eyes&fingers.

The trick is to make it non-obvious for people outside the intended audience what the possible set of answers are, to make brute-force unappealing, but make it trivial for the intended audience to find the answer, if they don't already know it.

77:

limeyrock @ 3 And if you don't know ANY F1 racing teams, or even if you do, don't give flying fuck about so-called "motorsport" ? VERY bad example there, methinks.

General note I've just logged in - last log-in was yesterday @ about this time. My commercial SPAM-filter has picked up 6 in that period. And I'm just Joe Bloke, from their POV. OTOH, said simple commercial filter does seem to work. Um. I suspect, in the end, a serious international commercial treaty is going to be needed. Except, of course, the USSA ( NOTE * ) will refuse to sign. The idiots STILL haven't learnt the lesson of the CSS Alabama

@ 32 and others Disqus? Forget it. The Daily Telegraph usues it, and it's a complete bastard, form an outside-users' pov.

Dirk @ 52 You auto-bounce anything with a URL On a WEB discussion? So how do people redirect, or send useful liks, then? Not clever, IMHO

NOTE * "USSA" as in the Beatles song, ... back in the USA/back in the USSAH!

78:

The dynamics of conversation in a blog discussion are such that greylisting -- aside from it not working at all because commenters are not SMTP servers -- would disrupt the discussion completely.

79:

I've taken to posting links in plaintext here, saves me from getting caught in the queue and I figure anyone here interested knows how to cut'n paste a link. Most browsers have hightlight text>go to link functionality as well nowadays.

80:

This is probably not immediately relevant to our gracious host's problem, nice he'd like something off the shelf, but it seems to me that the optimal way of dealing with the issue is to do the spam filtering at comment-entry time. If the comment passes, accept it without further issue. If it fails, present the poster with hoops to jump through, either to demonstrate their humanity (i.e., some form of captcha), or to impose a small cost to make the transaction annoying to people posting (a message stating either "we're not sure you're human, please wait 20 seconds and then push 'resubmit'", or simply "server error, please wait etc.").

This would have the advantage of allowing stricter tests for posters of suspect posts without applying those costs to everyone, while allowing the paranoia of the spam filter to be turned up pretty high, as legit posters would be able to rescue their posts on their own.

Downsides would be additional processing cost at point of comment entry and giving spammers immediate feedback as to whether their posts were triggering the filter.

81:

You might want to read limeyrock's comment again. Here's the relevant part:

'I'm a fan of the non visual (logical or content specific) captchas personally; "What is the opposite of up?" or the "Which of these is not an F1 racing team?" on a Formula one forum, etc.'

I've bolded the important bit in case you can't figure out why you might owe limeyrock an apology.

82:

"You auto-bounce anything with a URL On a WEB discussion? So how do people redirect, or send useful liks, then?"

We will not, on the forum. We have a FaceBook page for that kind of stuff.

83:

Some of us don't use Facebook or other networking sites.

And I find the links posted in comments on this blog often take me to interesting and informative sites I wouldn't be aware of otherwise.

The breadth and depth of knowledge displayed by people who post here is one of the things that puts this blog top of my daily to read list.

Melvyn

84:

Mort, that's how Movable Type works already.

85:

Some of us are sort of required to use Arsebook and Titter by the very nature of our jobs.

It gives my publishers' marketing folks the warm fuzzies if they can see $AUTHOR pushing their wares on the web, and some of them get a glazed, slightly worried look if you corner them and earnestly explain how running a social hub of your own with 15,000 interested daily readers is more valuable than being grist for Zuckerberg's advertising revenue stream. Marketing folks are not necessarily as internet savvy as they think they are. So don't be surprised if over the next month or so you see FB and Twitter buttons showing up in the blog sidebar.

On the other hand, you don't need to worry about me abandoning the blog. On the contrary: I'll be updating it to auto-broadcast my effusions here to FB and T, rather than the other way round.

86:

I think we have simpler job on the ZS forum in that it is intended to be there to discuss ZS business. It might have started differently, but that's how it seems to be ending up. So eliminating URLs would not significantly diminish the experience. It is a much narrower forum than this one.

87:

Is twitter actually useful for anything?

88:

Twitter is basically broadcast SMS messaging over TCP/IP, far as I can tell. You can broadcast at people who follow (opt in on) your tweets, or you can follow other people and receive their broadcasts.

No idea what the social applications are like as I've only had a twitter account for 48 hours.

89:

I got one and all I seem to get is upper class spam. Can't imagine the type of person who would check their twitter account every few minutes in case they missed some of the dross.

90:

You can rename the form fields and keep the old form fields invisible, some with preset values, as 'honeypots'; reject if those are altered.

The automatic generation of questions is what makes captchas so incredibly annoying and is also quite unnecessary for a blog. You can add a form field where one has to e.g. enter the square root of minus one - a handcrafted question - and you'll probably be fine. It is worth spammer's time to spam all MovableType blogs, but it is not worth the spammer's time to spam your blog personally, if it's software is in the slightest bit customized.

91:

Which CharlesStross are you on Twitter, if I may be so bold as to ask?

92:

None of the above: I'm @cstross.

93:

There's another blogger I read who does that. (He doesn't directly post to twitter, ever, that I am aware of. But his blog-toolkit automatically Twitters a summary of every blog-post.)

I don't read Twitter, so I didn't notice except when he announced it in a blog-post.

Re: Spam. I don't know what to say.

I had a different problem. For a while, I ran a machine with a public-facing SSH service. I used the denyhosts plugin. (for info, see http://denyhosts.sourceforge.net )

Every once in a while, I would look at the logs, and see dozens of login attempts per hour from some obscure outside-of-the-US IP.

It's as bad as spam, and all they wanted was my machine.

94:

I run denyhosts too.

What you need to do URGENTLY is to set yourself up for public key authentication to log in on your host, then disable password logins. Otherwise the ass-hats doing the distributed botnet enabled password-guessing attacks will get lucky (because you may be as good as gold at setting a password, but someone else on the same machine will sooner or later leave the back door open by using an easily-guessed password).

On the other hand, if you don't use ssh for logging in to a server that's accessible over the internet, you're not at risk.

(Hint: there are two ways to log into this server of mine -- one involves public key authentication over ssh, the other involves public key authentication to an ISP's terminal server then login on the serial console. The password guessing attacks haven't stopped, but they're not going to get access because even if they magically guessed my password the ssh server won't let them log in remotely with it.)

95:

Wowza, I use MT for my own blog for much the same reasons you do, but I seem to get a lot more spam. In the 7 hours since I mucked out the spam bins this morning I already have 146 messages in there. Lately, a lot more have been falling into my 'needs approval' queue and are marked Ham by Akismet, most of them are huge blocks of text with a few key words and a couple links in the URL bar. In the last couple weeks, my 7 day spam rate is about ten thousand. And this is on a blog with two to three orders of magnitude less traffic. So I'm watching this thread and looking for tips.

I'm also a moderator on one of the StackExchange sites so have a much more detailed view on how their spam system works since I do a large part of the spam-hammering. That system only works for a very busy blog, and even then it'll require admin attention.

I also greatly wish a content-validator for MT that has post-length has a filter.

  • On the smaller StackExchange sites, it's a moderator that eventually hammers all spam since there aren't enough users flagging to kill messages normally. Spam-flagging merely brings it to the attention of the moderator.
  • On the larger StackExchange sites, posses of users tend to band together as content police which takes some of the load off of the moderators.
  • On the largest StackExchange site (StackOverflow I'm looking at you) ad-hoc posses of users and sock-puppets can band together to grief others in content wars. Moderators get involved when they invalidate flags and undo griefing.

How many posts you can spam-flag varies based on how well you've been flagging in the past. Those content-posse members tend to have high flag weights and lots of votes. Brand new sock-puppets have very few votes.

96:

most of them are huge blocks of text with a few key words and a couple links in the URL bar.

Yup, that sounds very familiar! Did it start up in the past couple of weeks?

97:

It most certainly has. Three months ago I was doing about 3000 a week for spam, now its up to 10K. I noticed a lot more getting past Akismet & Co. and had to go to full moderation for unauthenticated comments as a result. My comment rate makes that manageable, but still. In fact, this week has been particularly bad.

Between this comment and my last one, I'm now up to 319 in the bin. Just like an email server, I spend more CPU cycles keeping junk out than serving readers.

98:

Charlie mentioned:

word salad spams

Years ago I looked at valid incoming email versus spam and realized that a simple grammar checker could catch the majority of it. At the time I didn't find anything, but DOS grammar checkers were available in the 1980s, so I expected some type of filter would be available to filter between the MTA and MUA.

Hmm, a quick web search shows a couple of possible candidates now. I might check them out if I get the urge to join some mailing lists again. I don't know if it's a practical method for filtering on a blog, but given the literacy level of the people who post here I would expect it would be very effective.

99:

Thanks for all the useful tips on this discussion--lots to think about, for those of us starting a blog.

On another topic--

I gather a lot of the powerpoints and papers presented at the conference will become available online over the next days and weeks. I'll try and link to them when they do.

Any of those starship/space colony papers posted yet?

100:

Charlie should write a letter to President Obama asking him to declare a War on Spam. In my experience , if the U.S. Army can't fix a problem noone can.

On a more serious note , how about including in every opening post of a thread a phrase like "In order to reply the first line of your reply must have some-random-string". Attempted posts which have the random string on the first line get accepted , the rest get rejected. You can even make it that the random string gets filtered out before the post appears on the site. Seems simple and effective to me. It also guarantees that aspiring posters must have used a reasonable amount of attention in reading the opening post.

By the way , I just tried to sign up , thrice. I get a screen which says that a confirmation email has been sent to my account but I haven't received it. Yes , I did check my spam folder.

101:

I suspect, in the end, a serious international commercial treaty is going to be needed. Except, of course, the USSA ( NOTE * ) will refuse to sign. The idiots STILL haven't learnt the lesson of the CSS Alabama

Since earlier in your post you complain about unwarranted assumed familiarity of people with formula 1 , how about people who are not familiar with either the Beetles song or CSS Alabama ? I had a quick browse through the wikipedia page on the latter and I didn't see a connection with international treaties.

102:

How much time and resources are wasted with all this spam? How much of this effort can be usefully re-directed to solve some of the world's problems, like hunger, poverty and global warming? Just stopping the spam being sent out will save on our overall energy bills... as far as I'm concerned spammers are just bullies hiding behind data masks.

103:

Another thing you can do, if you don't have to support logins for anyone but yourself – and after disabling password logins – is moving the SSH daemon to a different port number. That'll stop the botnets even wasting CPU cycles when they connect; when I did so, it cut attempts to zero.

104:

Rosie @ 101 My point, exactly.

The (probable) only way out is a binding International Treaty.

I'm assuming there must be SOME profit in Spam, otherwise why is it being done? cui bono indeed. I wonder what the profit-margin is on Spam - is it really worth if for the spammers, or have they actually locked themselves into an illusory zero-sum game?

Any real figures on this, anyone?

105:

Not SMTP servers, true, but the set {browser setting +OS+software versions+accessible cookies+source IP} is presumably pretty defining - didn't EFF do a demo of how good that was at identifying your machine?

So you could use that, plus the name/email address/url combination, as your greylisting key. And just require the equivalent of a second click after a few seconds ("no I'm a spammer" "yes I'm a human") for greylisted IDs.

106:

I don't expect the US military to be any better than anyone else at winning a war on an abstraction. (Indeed, the whole 'war on [abstraction]' seems to have done very well for the military contractors and police agencies, but I don't expect the wars on drugs and terror to be won any time soon -- the first, because drugs are a movable feast (what about alcohol?) and demand for them is one of the constants of human nature, and the latter because terrorism is a movable feast (look how many respectable politicians are ex-'terrorists') and it seems to be one of the constants of human nature (insofar as repressive regimes generate opposition, and if the opposition turns violent then whatever tools they use are categorized as 'terrorism').

NB: Need morning tea before I post.

107:

I think that, if you're someone who would never respond to spam, it can be hard to understand where the profit margin is, but remember that it only takes a few people per million messages to click through and buy.

I work with a woman who fell for one of those "The secret to white teeth with 2 common products" scam ads on a website, and I never would have thought an intelligent person would fall for that, so there you go.

The costs of sending spam are negligible, especially since generally speaking, the spammers aren't the ones paying. Used to be there were notorious spamhaus ISPs that allowed people to exploit them (wanadoo.fr was one of them) but it's more sophisticated than that now. If they're running a botnet, ordinary home and business users are doing it for them for nothing, all unaware.

Anyway, a quick search gets you this 2003 article from El Reg. I assume the principles still hold true.

108:

Our tools indicate that my server is not originating spam but I am getting 65,000 backscatter a day, so it is obvious that my domain is being spoofed and probably thousands of millions of illegitimate messages attributed to it. I am dismayed by the costs associated with this behaviour, and wish there were a way to bring a class action against the perpetrators to recover their ill-gotten gains.

The custom email address I have given you is part of my spam control efforts. By giving everyone their own eddress and using white lists, I can capture my good messages rather than trying to filter out the cruft, and if one becomes tainted (as a spam source, for instance), it doesn't contaminate the entire pool.

In similar fashion, perhaps you could include a keyword in your post that would be entered in a field in the comment indicating that the commenter has actually read the post they are responding to, and use that field to gather the sincere comments and ignore the rest?

109:

One use of Twitter is chit-chat to chums. Another is announcing releases of something (eg software or music). Or "hey, how are you doing? Did the inteview go well?".

I've seen very little spam, but I'm likely too tiny a target.

110:

"Another is announcing releases of something (eg software or music)."

Over and over again ie spam

111:

Re Twitter: Two possible answers: * a few very interesting people use it to, essentially, curate links. Examples would be William Gibson (@GreatDismal) or Bruce Sterling (@bruces). * Twitter exists so Warren Ellis can put his disease in us. (note: this references a t-shirt)

113:

Hi there,

I'm very impressed that you have such a thriving comments section. I have two websites and due to spam, I am starting to give up allowing comments. The main site gets so much that it was ending up being hundreds a day. So I disabled all comments.

The second one is a science fiction site, a bit more niche, which currently gets about 25 spammers a day. Which is close to the limit I can be bothered with.

It's a sad state of affairs though. If I could be bothered, I would learn how to hack and would wreak bloody revenge on every site that spams me. But I can't.

Some good tips here though.

114:

Or at a frequency that is in line with what the reader (who has to opt-in) wants i.e. like an RSS feed. I doubt that one in 500 tweets that I see could be classed as spam.

115:

Something that may or may not be helpful is CloudFlare:

http://cloudflare.com

They're a DNS based caching system and CDN, but also have a lot of pretty nifty security options that have grown out of Project Honeypot.

I've seen this do wonders for comment spam, so either this or some other application firewall might be helpful in your case.

116:

Speaking as someone who works at Cloudflare, we'd be happy to help. I won't spam your blog with all the features etc. (:-)), but we do a bunch of things (for free) to try to improve the web. Feel free to drop me a note if you want more details...

117:

On CAPTCHAs: Shamus Young, when he was in the middle of his justly famous DM of the Rings screencap comic, instigated CAPTCHA-control of comments because he suddenly became very attractive to spammers.

He found that after about three days he didn't need to actually go to the trouble of building the word/phrase and morphing it. He found that by using an easily-recognised phrase as the same CAPTCHA day in, day out, the spam became a minor problem again.

Of course, there was always some twit ready to point out for the benefit of the bots that the CAPCHA wasn't working. A quick perl script would have taken care of that.

Specials

Merchandise

About this Entry

This page contains a single entry by Charlie Stross published on October 14, 2011 12:53 PM.

iOS 5: initial notes was the previous entry in this blog.

Design changes is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Search this blog

Propaganda