Pink Stuff Patrol

David | October 14, 2011 13:22

1:

Since the comment form requires javascript anyway, have you considered using some form of HashCash system? I know there are some for WordPress, no idea about Movable Type.

The basic notion is to make posting a comment require a certain proof of work, by finding hash collisions. This is done automatically by the posting form through javascript. A delay of half a second wouldn't inconvenience a legitimate poster, but would at least set a certain threshold for spammers. Of course, if people are so cheap that you can actually pay them to manually post comments on forms and you still make money from the spam, well, then there's little we can do.

davidrutt | October 14, 2011 13:28

2:

Have you got Bad Behavior installed? I've found that works quite well and it does have a plugin for Movable Type (IIRC).

There is no one-size-fits-all solution for spam, but combining approaches can work quite well. Askimet does a reasonable job of trapping most of the spam on my Wordpress blog

limeyrock | October 14, 2011 13:28

3:

I'm a fan of the non visual (logical or content specific) captchas personally; "What is the opposite of up?" or the "Which of these is not an F1 racing team?" on a Formula one forum, etc. The secondary benefit is that you can filter out a few complete imbeciles too.

The problem with these is that you need to put in the effort to create them, and determined spammers won't be kept out. You'll kill off a lot of the 'scattergun' spammer traffic though.

Laur | October 14, 2011 13:38

4:

Clever idea for blocking robots:

1) Create one more field in the comment form, called"Don't write anything here" for instance.

2) Use clever CSS trickery to position it out of the visible area

3) Block every comment that fills the field - because robots insist on filling every available field.

The idea is not mine; I've seen it implemented as a Wordpress plugin. Found the original link too:

http://wordpress.org/extend/plugins/ntzantispam/

Sadly it won't work on human posters.

Ben Thompson | October 14, 2011 13:38

5:

I got an email this week from one of the spammer suppliers (they have various pieces of software that auto-populates forums with spam posts). I don't use them I simply think its wise to keep an eye on what these people are doing.

They now claim to have OCR software that can read 98% of reCAPTCHA's captcha scheme so users would avoid paying the half cent or so the processing systems charge for using a rural Indian person to read the captcha.

Curiously I remember reading that they increased the size of their posts a while back to ensure that they bypassed filters which stopped short posts from new members being trapped.

Not that I've ever posted a link on here (or followed many to be honest) but are you adding the NOFOLLOW flag to all external Urls to tell search engines to ignore the link. I know it doesn't work as Google and others still give a score (albeit lower) to nofollow links but it stop the flow a little bit.

Charlie Stross replied to this comment from Ben Thompson | October 14, 2011 13:43

6:

Yes, adding NOFOLLOW to all external URLs. Using TypePad's antispam API. Using hand-tuned regexps. Looking into other solutions too. Preferably ones that don't require hand-hacking on Perl or JavaScript, because those will fail the next time Six Apart issue a mandatory security ugrade to MT: stuff that can be automated with boilerplate is vastly preferable to hand-rolled solutions.

Moderator Alan replied to this comment from Ben Thompson | October 14, 2011 13:43

7:

If you look at the source for this page, you'll see that 'rel="nofollow"' is indeed set.

Justin Boden | October 14, 2011 13:47

8:

I read of Douglas Adams's Restaurant at the End of the Universe that it demonstrated how successful his first novel had made him: instead of jokes about being a poor hitch-hiker they were about expense accounts and expensive dining. Having just finished Rule 34 I have to ask to what extent, if any, does running a website bias your prose?

Charlie Stross replied to this comment from Justin Boden | October 14, 2011 13:49

9:

I have no idea. Next?

Justin Boden replied to this comment from Charlie Stross | October 14, 2011 13:49

10:

:) Fair call.

Moderator Alan replied to this comment from Laur | October 14, 2011 13:49

11:

I don't think the problem right now is human posters. What we're getting is typically 500 words of apparently comprehensible (but not actually relevant) text, with a few mentions of ugg boots or whatever. And a half dozen assorted links.

They're pretty obvious - just scan the spam folder pretty well ignoring anything from 'anonymous' that is a few dozen lines long with no paragraph breaks.

What's sneakier is the opportunist: at least one tribute to Steve Jobs came in that was short and sweet and contained only a single url. However, it hit the wrong thread.

rdm | October 14, 2011 13:52

12:

where are the spam cops when you need them?

...anyways, I am just glad that electronic spam has not been declared legitimate by the Powers That Be (e.g. like third class postage in the U.S.). But mostly this is a billing issue.

Still, sending in that volume should leave tracks. Unfortunately, no one designs computers "for security" these days. (And, by that, I mean: people that provide computer products are actively hostile to idea that the owner of the computer should have or could even want a full understanding of what the thing is doing. It's death by micro-optimization.)

Megpie71 | October 14, 2011 13:53

13:

The problem of spammers seems rather analogous in electron-space to the problem of burglars in meat-space. In both cases, you have someone who is looking to hijack your hard work in order to avoid having to make any effort of their own. In both cases, the trick isn't necessarily to make your location absolutely secure, but rather to make it more secure than insecure - make it a hard enough target that they're likely to give up and go away because they can't get in and out quickly enough, or without leaving a whole bundle of evidence of their passing.

Straylight | October 14, 2011 14:01

14:

Obnoxious as the stuff is. I'm glad you're not likely to implement a graphic capture. Mostly a lurker here but as I'm using a screenreader that would end my speradic posts. I really like the fact I don't have to register / login to post here. Purely out of lazyness. That ease of use is unfortunately what the human spammers utilize.

Is there any way of determining how much spam is bot generated and which is done by meatware?

Is there any such software solution whereby readers that chance across spam posts on blogs can poison the search rankings for it's target? Like have a mark as spam button that some how filters back to Google? I realise such a system could be potentially horribly abused. (Note though I don't really no what I'm talking about here. Not up on SEO stuff.)

Straylight replied to this comment from rdm | October 14, 2011 14:05

15:

Posted whilst I was typing. This answers my human vs bot question I think.

Moderator Alan replied to this comment from Straylight | October 14, 2011 14:17

16:

The problem right now isn't spam that ends up being visible to readers — it's the spam that's getting caught automatically and being dumped into the spam folder. It's coming in at the rate of several an hour, but you lot aren't seeing it because it's being automatically filtered out.

The spammer is obviously not looking to see whether it gets through or not, or they'd have changed their behaviour - spam that is never seen is spam that doesn't actually work for SEO purposes. (Nor could you poison the links.) Also, it's arriving in almost identical format — and substantively the same content — from lots of different addresses, which rather indicates a botnet where multiple spam sources cycle round multiple spam targets, in order to avoid IP blocking

So, the problem isn't spam that readers aren't seeing, it's that that spam makes it harder to notice false positives — stuff that's ended up in the spam folder when it shouldn't have. When you have several dozen long wordy spams a day in that folder, it's all too easy not to see the occasional valued comment that's got itself lost.

(And don't think that actual columnists are immune.)

Bill St. Clair | October 14, 2011 14:26

17:

I used to use Drupal for my blog. The spam got so bad that I forced all non-logged-in comments to go to manual moderation. And I used a CAPTCHA. It seemed to be toning down, so I took away the CAPTCHA. Hundreds of automated spams resulted.

A little while back, I wrote my own blogging package, in Lisp. It's not ready for non-wizards to use yet, but I hope to get there. http://lisplog.org/

As part of the new blog, I wrote my own CAPTCHA system. It requires solving of a simple arithmetic problem. Two numbers from 1 to 9, find the sum, difference (guaranteed positive), or product. A timestamp to limit replays. Click "Add comment" at the end of a post at the link above to see it in action. I still moderate non-logged-in comments, but I've gotten very few of them.

Most of the spam that got through the Drupal blog appeared to be humans, doing some low-paying job. The same is true now, but apparently those humans are only trained on the big blogging platforms, and my simple arithmetic problem is too much for them. Wierd.

Nestor | October 14, 2011 14:26

18:

Well, there's always restricting comments to the logged in. I mostly go with open comments and akismet on blogs I run, but places with a strong enough community can easily justify the hurdle of a little registration. And OpenID in theory makes it painless, except most people don't know what it is.

I guess human spammers can still get logins but they'd get blacklisted/banned pretty quickly

Barney Scott | October 14, 2011 14:26

19:

Spam comments are a hard issue and i don't think that there will be a solution in anything like the near future. I'm thinking it will take at least dumb AI to make good judgements allowing for the increase in spammer sophistication. The are solutions that might help limit the flow somewhat. The first that springs to mind coming from a WordPress world is Akismet, although you would have to look into if there are any plugins for MT. A second approach would be something like Stupid Filter (http://stupidfilter.org/main/) although that looks like a dead project now.

Additional defenses are as mentioned above, make your blog a less attractive target so that spammers move on. I.e. Adding a delay when posting a comment or rather than a CAPTCHA you could use something like a simple maths quiz. This is prone to being broken by spammers just as easily as CAPTCHAs were.

What i see being needed is something like what Google uses to determine the topic/focus of a page being applied to each of your posts. Then when each comment was added having it checked against that to see if it is relavant. This would work quite well for you as your posts tend to be over a couple of paragraphs.

Liam Hegarty | October 14, 2011 14:30

20:

Hang in there! We're all rooting for you.

Moderator Alan replied to this comment from Liam Hegarty | October 14, 2011 14:35

21:

Which comment neatly shows how difficult it is to determine whether a comment is truly relevant or not, and therefore whether the solution in Barney@19's final paragraph would be feasible.

heteromeles replied to this comment from Bill St. Clair | October 14, 2011 14:39

22:

I wonder if the human spammers are getting paid per click, not per successful posting. They may have a disincentive for taking extra time to do it right.

Ms. Sunlight replied to this comment from rdm | October 14, 2011 14:42

23:

Where are the spam cops? Most of the spam cops of yore have given up the fight.

I used to do a bit of that back in the 1990s on Usenet. The time has long passed when a human spamcop could make any difference - the volume and automated nature of spam has made us obsolete. We're like cavalry horses.

Charlie Stross replied to this comment from rdm | October 14, 2011 14:46

24:

Still, sending in that volume should leave tracks.

As I said, they're using a botnet. Each of the "anonymous" word salad spams comes from a different IP address. (We've got a tool installed that hunts the spam bin for three spams in 48 hours from a single IP address then bans the hell out of it. Alas, it's getting barely any of them because the spammer in question has a herd of thousands to millions of machines to exploit.)

Barney Scott replied to this comment from Nestor | October 14, 2011 14:46

25:

You have to be very careful when requiring people to register to add a comment. You tend to run into the situation where you have to create yet another set of login details with a unique password and unless they are fairly invested in the topic / site then most people won't bother. Open Id is the best choice but as you stated very few technical people have heard of it/use it let alone the normal web user.

It could be interesting experiment for Charlie to require logging in for a month to track the fall in comments and see if they are still at an acceptable level. I would suggest that the quantity would drop by well over 50%.

As a side note for all the people who don't change their password for each site stop and think for a second. The likelihood is that you have just given a complete stranger the login details to your mailbox, which in turn will provide access to most of your online presence. Start using a password manager to get rid of the cognitive load involved in remembering so many alphanumeric strings.

Ms. Sunlight | October 14, 2011 14:50

26:

Social networking integration and services like Disqus seems to work reasonably well. I can't remember the last time I saw spam on a Disqus-enabled blog (for example Sociological Images) although that's only an outside impression; I don't run such a blog myself and I'm sure moderation and vigilance is still required.

It does however take a significant element of control out of the blog maintainer's hands, and the fact that it has a "like" feature enabled does seem to mean that people game their comments to get a high like-to-post ration.

Charlie Stross replied to this comment from Bill St. Clair | October 14, 2011 14:51

27:

Two things.

Firstly, manual moderation won't work here -- my blog typically gets north of a hundred comments a day.

Secondly, your numerical captcha unfortunately discriminates against folks with dyscalcula, which is distressingly common. This may be fine if your audience are able to do mental arithmentic (as you'd expect if you're catering to programmers or mathematicians, or even folks who learned to pound a mechanical cash register back in the day), but it's a huge deterrent to a younger or more general audience.

towo | October 14, 2011 14:53

28:

And to think that you're actually spared from most of the blog spam that goes around since you're using custom software.

Try having a Wordpress, it having a rather well-documented interface and all. Automated spammers will go apeshit on it, even for sufficiently low amounts of popularity.

Peter Van Eynde | October 14, 2011 14:54

29:

Maybe rejecting non-ascii text would be an idea? Comments in Chinese would be of limited value for you and most of your audience, not?

nixar | October 14, 2011 15:00

30:

I take it the spammers haven't read your last book? Or it's not been translated to Chinese yet?

Moderator Alan replied to this comment from Peter Van Eynde | October 14, 2011 15:01

31:

There is actually only a small minority of such spam coming in. There is one spam with a small fraction of CJK in it a day ago, and one primarily CJK spam about two days ago.

I see no Cyrillic, Arabic, Thai or such. The flood is almost entirely English.

(It's also likely to peeve Feorag if you start filtering out the occasional posts she may make with Japanese in.)

Ben Thompson replied to this comment from Charlie Stross | October 14, 2011 15:07

32:

Its hardly surprising that you never see the same IP address often. The bot or person will receive a message and a set of websites (possibly 100's, probably 1000's) and try and post that message to all the sites.

Personally while I can't get my head around Disqus (it seems to want to track users across multiple sites and I can't have the pseudonym I always use) they do seem to be winning the war on spam. I think that's because the more spam you see the easier it is to identify and remove it.

Sylvia Sotomayor | October 14, 2011 15:15

33:

I've been using Spam Free Wordpress which has blocked all the automated spam. I've received all of two spam messages (entered by humans, I am sure) in the past two months. I hope there is something similar for other platforms.

I also moderate first-time commenters, but then my blog doesn't get many readers. :-)

Moderator note: this message was rescued from the spam bin

paws4thot | October 14, 2011 15:31

34:

I hate captchas with a passion; sometimes they're not actually legible even with (corrected to) 20/20 vision.

Also, on the "login" point, are people really so stupid that they use the same password for discussion fora, their e-mail, e-tailers they use...?

eth3rs | October 14, 2011 15:41

35: Use the same method that stackexchange does. Users can flag spam. The users who correctly catch spam flags are weighed higher and get badges and such ( http://stackoverflow.com/review/ ). Turn the whole thing into a game ( http://en.wikipedia.org/wiki/Gamification ). Spammers can't win a battle of attrition with people that are catching spam for fun (some people like the taste of spam and will actually eat it).

Moderator Alan replied to this comment from paws4thot | October 14, 2011 15:45

36:

It's not so much stupid as lazy. If a user has to log in on a number of different sites, having to remember passwords for all those different sites is a monumental pain. And it's all too easy for a site that they originally went to for mere discussion to evolve into something with real potential for damage — say, a book discussion blog might get into selling books to its frequenters.

(I'll admit to having a shared password on a few sites, though none of those sites has any information even mildly critical. The worst that could be ruined is my online reputation in a few places I don't really frequent any more.)

Charlie Stross replied to this comment from paws4thot | October 14, 2011 15:51

37:

Also, on the "login" point, are people really so stupid that they use the same password for discussion fora, their e-mail, e-tailers they use...?

According to recent research, yes they are (about 40% of them, anyway).

Charlie Stross replied to this comment from eth3rs | October 14, 2011 15:52

38:

Has anyone implemented this as a plugin for Movable Type?

If not, sounds like a lot of work ...

Jim Rodovich | October 14, 2011 16:04

39:

If spam defies regexes & other automated tests, maybe "Is this a piece of spam?" would make a good captcha. Follow the ReCAPTCHA approach of pairing a known spammy or non-spammy post with a recent, not-yet-classified post. Of course, each correct answer provides only one bit of information, or even less if the signal/noise balance isn't close to 50/50, so you'll want to have people moderate a few dozen comments before they're allowed to post...

More seriously, if most of your spam comes from botnet IPs that you haven't encountered before, could you turn that fact to your advantage? If Alice's IP address hits your RSS feed a few times a day, presumably she's an actual reader & commenter. If Bob's IP shows up for the first time requesting pink-stuff-patrol.html but no external CSS/image files, and then five seconds later posts a 200-word comment, then maybe he isn't one.

eth3rs replied to this comment from Charlie Stross | October 14, 2011 16:12

40:

Write one or pay/request that someone write one (you have the powah!). As you've pointed out a few times, blog comment spam is a big problem. Movable Type users are a good chunk of the blog market.

Put out a Charlie Stross Challenge: The first person to write a good gamification spam fighting plugin wins: 'signed copy of a book' or 'gets their name in a book' or 'get a webpage on antipope proclaiming their greatness'.

vivtek | October 14, 2011 16:13

41:

I ran across a short article about building comment forms that screen out spam pretty effectively. It's not a Movable Type plugin as far as I know, but it might help: http://nedbatchelder.com/text/stopbots.html

To all the people talking about forcing login to stop spam: it doesn't. I've tried in the past. All you get is spam links on the user profiles and spam in the posts.

To the question of where have all the spamcops gone, long time passing: we tend to give up after banging our heads on the wall for a while. Periodically I get re-engaged with a spam problem every now and then, hearing that siren call of Big Data, but honestly, after the first few years of people not actually caring much, it gets tiring.

Jim Rodovich - tricks like that with IP addresses can sometimes help, but you still have the problem with false negatives when Alice is on vacation or reading from the airport. It's best to dump it all into some kind of Bayes evaluation, but that's hard to get right, and there's always the integrate-with-MT part to make things exciting.

Anyway, Charlie, if you want another pair of eyes or a halfway engaged antispam coder, you know where to find me. Good luck, though.

Arthur Chance replied to this comment from Charlie Stross | October 14, 2011 16:15

42:

@27: Secondly, your numerical captcha unfortunately discriminates against folks with dyscalcula, which is distressingly common. This may be fine if your audience are able to do mental arithmentic ...

Whereas I generally get seriously annoyed at web sites which don't cater for accessibility, I'm not sure I'd be that bothered about a simple single digit arithmetic task. (I'm quite prepared to accept that may be a considerable bias on my part - my first degree was called "mathematical physics" until the year before I matriculated.) If the would be poster is on something we'd call a computer they'll have a calculator app to hand, and the extra effort would act as a threshold between spammer and serious poster. You might lose casual posters, but that could be either a good or a bad thing depending on the blog. If they were on a mobile phone form factor device it wouldn't work, but I can't imagine a human spammer using a smart phone just yet, and we've haven't got phone botnets yet (although I'm not looking forward to the day they arrive), so discriminating by the browser identification should allow you to avoid challenging mobiles.

anonemouse replied to this comment from Jim Rodovich | October 14, 2011 16:18

43:

And those of us who live behind firewalls with multiple exit nodes and who run RequestPolicy can go whistle? Categorizing on browser behaviour/IP regime is like categorizing fake versus real names. How well is that going for the preeminent web-technology company today?

Barney Scott | October 14, 2011 16:29

44:

The problem doesn't end even if you have a wonderful advanced automatic filter that has solved Natural language processing (http://en.wikipedia.org/wiki/Natural_language_processing) and so can determine the relevance of any comment in addition to the other approaches above. As bellingham @21 pointed out, not all real comments are relavent to the discussion at hand but need to be missed by the filter. 8 and 20 are good examples of this.

In addition the spammers will have these tools available to them as well. So they start generating reasonable comments possibly even seeded with appropriate links in addition to the one they want followed. What do you do now?

Jim's idea @39 sounds very workable, although it would take a while to get up to speed. If it becomes popular enough to start working on a few of the bigger sites then it becomes less useful. As a spammer you simply start priming your bots to visit target sites a few times before posting a comment.

Another possible approach would be to have the spam filter pull out & resolve any URLs (to get rid of shortening tricks) and check them against something like the Spamhaus DBL.

marc sobel | October 14, 2011 16:50

45:

I've only read about it from one source, so I am not sure it works as well as advertised but there is apparently a distributed spam fighting system called ATHENA http://www.amazon.com/gp/product/B004Y3I6XW

Nestor | October 14, 2011 16:54

46:

Actually the value of the login is to make sure real posters are whitelisted, as Charlie pointed out the damage is losing real comments to the automated guardians. I'd still run akismet or something like that alongside it (I have run a forum and I know registration gets you a lot of interesting usernames with .ru addresses)

Anatoly | October 14, 2011 17:05

47:

So, how long before Spamnet becomes self-aware and launches a spam attack against Russia?

Alex | October 14, 2011 17:07

48:

This MT plugin is an oldie but goodie. It resolved the problem with our company blog, which lives in a VPS - when we got a heavy spam run, although TypePad Antispam did a good job of identifying spam, the requests would pile up waiting for the TypePad API to respond. Being CGI, of course, it's all one thread. Eventually it would trip an automated alert and the hosting firm's robo-admin would kill the MT server.

The plugin mines the spam pile and blocks IP addresses by updating the .htaccess file, so the blocking occurs at the webserver level before requests hit the MT server. I turned up filter aggression so that everything went into the spam file rather than 99% into manual review. Since then, no problems, with a few thousand IPs blocked. (I set the filter threshold to 3 spams, ISTR)

Jim Rodovich replied to this comment from anonemouse | October 14, 2011 17:09

49:

There's no reason an IP address regime would have to be applied as a blacklist. The example I gave even hints at that: the filter might be reasonably confident that someone posting from Alice's IP is indeed an interested reader, but not at all confident that the same is true for a post from Bob's IP. So Alice's post could bypass other, more stringent filters to which Bob's post would be subjected.

John Batey | October 14, 2011 17:25

50:

I've proved to myself that, as Barny said (@44), 'topic analysis/filtering' wouldn't work all that well. I've been working on a news summarization system for only a few days and, if I feed it this page instead of a bunch of news articles, the output looks on-topic. Although, it's takes 7 seconds to analyze the existing text and pop-out new text... this would probably dissuade anyone from using it for spam, similar to HashCash.

The Spamhaus DBL suggestion is much more doable and fairly easy. A summary of the needed code: Step 1: $links = $dom->getElementsByTagName('a'); Step 2: http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/ Step 3: http://thecodecave.com/2006/12/02/build-your-own-spam-filter-with-php-and-dnsbls/

Assuming this server also has PHP, it looks like a day's job. I'd guess a similar work load would hold if you wanted it done in PERL.

You can avoid upgrade syndrome by not having it integrate with MovableType. Have it reference/prune the same database, but otherwise be a completely separate job.

Now, the part I have no idea how to fix are the spam comments that don't even have links. I'm damned if I know why anybody even bothers to submit them, but they make up a good portion of spam on some sites I admin.

Sheila | October 14, 2011 17:36

51:

"If the would be poster is on something we'd call a computer they'll have a calculator app to hand, and the extra effort would act as a threshold between spammer and serious poster."

but someone could write something that passes any captcha that looks like an equation to gogole and grab the calculator result. one + two gives three. e to the i pi plus one, 0.

don't allow weird character sets? write a romanisation function.

Dirk Bruere | October 14, 2011 17:37

52:

As a mod for the Zero State forum I am getting seriously pissed off with it. We have made a decision to automatically bounce anything with a URL

David L | October 14, 2011 18:19

53:

As another data point. There's a blog I keep running for a friend using Word Press. It has 3 or 4 posts a week with 10 to 30 legit comments most days.

We manually toss less than 5 SPAM comments per day. But the one simple filter we run blocks over 120 comments per day. Ugh.

We don't require registration at this time.

Alain | October 14, 2011 18:23

54:

This report of your current travails explains the odd things I've been seeing in my inbox.

I've been using Web based Yahoo mail as my main email service for nearly 15 years. A few years ago I also added gMail as a secondary mailbox, next to my "tertiary" mailbox in hotmail/livesomething.

Normally the only time I see some spam is when I go fishing in the spam bin for things which should not have ended there. Yesterday I actually saw spam in my normally clean inbox. This means that the onslaught is becoming impressive, somewhere behind the scenes.

Still, I prefer seeing a bit of spam now and then than losing messages from friends. Bell Canada is one of those corporations that uses the service provided by Microsoft livesomething for giving Web mail to its customers, and when I send mail from Yahoo to a friend who's on that thing the spam filter sometimes gobbles it up without a trace. Bell can't/won't explain this random behaviour. So, when I want to be sure my own emails won't be gobbled up I have to go and log in to my own Bell Web mail account when I send off an email to somebody with Bell Web mail.

Frixos Kopsachilis | October 14, 2011 18:27

55:

I remember reading a story in Analog where the spam was so massive that the hero solved the problem by reversing the approach. He turned off spam filters and then used data mining to determine the legitimate posts :)

Helen | October 14, 2011 19:04

56:

Akismet is a crowdsourced spam filter for blogs. I have been using their Wordpress plugin with great success. They do have a MT plugin as well.

andrew.ducker.org.uk | October 14, 2011 19:23

57:

I just checked my GMail spam folder. Apparently I have 1279 items in it. I assumed that most people got that much. Apparently I need to do a poll.

Good luck finding an answer to it!

Lodore replied to this comment from Dirk Bruere | October 14, 2011 19:26

58:

Out of curiosity, do URLs hyperlinked with a href attribute get moderated too, or is it just naked URLs? It's not that I plan to spam Zero State or anything, just interested in knowing.

Dirk Bruere replied to this comment from Lodore | October 14, 2011 19:42

59:

Right now its hand moderated, with a requirement for membership and 3 posts before an automatic pass-through. However, we propose that anything that looks like a URL be bounced. Fortunately the mailing list is where all the action is right now, so its not a big problem at present, but it could be in future.

BTW @ andrew.ducker.org.uk, my gmail spam box sits at 1460

Juho Snellman | October 14, 2011 19:45

60:

Pretty much any custom solution will work, nobody will bother customizing their spambot for a pagerank 5 site. I used to get hundreds of spam posts a day. Over six years, a single one got through the simple text captcha of "please type the word 'foo' into this field". And it was even a constant 'foo', never needed to make it vary over time.

Chrisj | October 14, 2011 19:56

61:

Out of interest, is there a reliable pattern to the spam based on time of day? I've recently seen a huge increase in the amount of email spam rejected by my mailserver - but only between 0600 and 2400 GMT; there's total silence for six hours starting at midnight GMT.

Peter Amstutz replied to this comment from eth3rs | October 14, 2011 20:44

62:

I like the idea of crowd sourced spam filtering. Put a "flag as spam" button next to each post, and if it is flagged more than N times (for some small N greater than 1) then it goes away. It's actually rather satisfying to hit the flag button, like you're sticking it to the spammer.

A few possible drawbacks:

A few people will still see the spam
The botnet could turn the system around and flag legitimate posts, so you would probably also need some kind of reputation system so that only legitimate users can flag posts.
A few people still have to pay attention to the page. A spammer could dump a bunch of junk on an old post nobody is reading any more.

It's rather ironic, people stopped using usenet and email lists in favor of web forums in part on account of the spam, but blogspam is just as bad. Worse, in some ways..

Charlie Stross replied to this comment from Peter Amstutz | October 14, 2011 21:05

63:

You missed: griefers will use it to get at commenters they disagree with, or just for lulz.

scott-sanford | October 14, 2011 21:21

64:

As a person who includes links in his posts, I certainly hope Charlie's software doesn't get too many false positives. My posts will pass muster with another human, but robots do strange things.

Naturally, xkcd has already offered a solution.

Nestor | October 14, 2011 21:37

65:

Spammers already exploit basic human gullibility with comments full of friendly praise and compliments. Spambots are much friendlier than the average human, at this point.

Their praise still lacks a little focus ("Great writing! for a blog without text), but I imagine the next generation will probably be able to actually hit in the ballpark of the actual activity you're doing.

asuffield.myopenid.com | October 14, 2011 21:37

66:

I've been amused for some time by the extent to which defeating captchas has turned into a business. Several dodgy-but-real companies are at it now, competing over "number of cents per thousand solutions" and "number of seconds to solve".

I'm not sure whether they're using the porn trick to do it, or just employing children in countries that are so poor, a cent will buy you lunch.

theDAWG | October 14, 2011 22:02

67:

Have the comment form demand a trivia question about your books. That way the spammer have to pay you to post up in here. (Or at least pay one of your fans...)

Speaking of which, how come Miriam didn't [-SPOILER-][-SPOILER-] recognize the corvee information trap as a trap? Even if that slip-up was intended to show a change in her character, how come her personal assistant didn't spot it and tell her? I think that was the least realistic part of the whole series. Miriam the financial journalist would have recognized the honeypot immediately.

Charlie Stross replied to this comment from theDAWG | October 14, 2011 22:05

68:

I just added a new plugin that should prevent spambots from directly injecting comments into the system via the comment script. It shouldn't require any human interaction, but if it does, a simple text captcha is coming next ("are you human? If so, type 'yes'").

In answer to your question at Miriam, the word you are looking for is "hubris". (She persistently underestimated the Clan until ...)

eth3rs replied to this comment from Charlie Stross | October 14, 2011 22:06

69:

The system described has been fairly robust against griefers on the stackexchange sites (even ones like crypto.stackexchange.com with low numbers of users). The idea is that you get points for good flags and lose points for bad flags, griefers would have to work really hard to get a high flag weight and they would only get to spend that flag weight once or twice. In the end it is not a 'profitable' griefer activity (they would reduce more greif than they cause).

magetoo replied to this comment from Sheila | October 14, 2011 22:21

70:

don't allow weird character sets? write a romanisation function.

That's not really an option. I threw away a snarky response with an attempt at illustrating the problem and all, but the short version is that romanisation is just not acceptable generally. For the specific cases where it is, it still is a hard problem. (A Hard Problem, in capital letters, too.)

And it can't work if you still want to be able to have some symbols for their own sake (math, talking about language, etc) or be able to spell things out ("google for _").

At least the Chinese spam is (apparently) identified as such, so it seems it won't be necessary to worry too much about what to allow and what to block...

Toby replied to this comment from Nestor | October 14, 2011 22:46

71:

Hmm, sentiment anlysis looking for overly friendly posts to detect spam?

Nestor replied to this comment from Toby | October 14, 2011 23:18

72:

And so we end up with the opposite of xkcd's utopia - only harsh criticism and confrontational statements are recognized as being human :)

George William Herbert | October 14, 2011 23:34

73:

I don't focus on this anymore, but I moderated some Usenet newsgroups for 15 years until I was driven underwater by spam past the filtering I had in place and had to give it up.

Special black place in our hearts for them, we have. Yes.

rich! | October 15, 2011 00:14

74:

can you not use a variant of greylisting for comments? works really well on our mail server...

John Mark Ockerbloom | October 15, 2011 01:03

75:

Akismet seems to catch most of the spam sent to my blog, only letting an occasional one through (which I can kill manually), and incorrectly flagged legit posts are rare.

I don't know if its accuracy is based on the software alone, or if it's also using all the other blogs hosted on wordpress.com for comparison. (Most high-volume spam operations hit a lot of blogs at once, so a site that hosts lots of blogs could use the spam identified on one blog to help ID and squash similar spam on others. IIRC, Google does something similar to catch spam on its various Gmail accounts.)

Poul-Henning Kamp | October 15, 2011 09:34

76:

On the varnish.org site, we added a question where you need to show a very basic knowledge about what the Varnish HTTP cache is. That instantly killed all the robots and outsourced eyes&fingers.

The trick is to make it non-obvious for people outside the intended audience what the possible set of answers are, to make brute-force unappealing, but make it trivial for the intended audience to find the answer, if they don't already know it.

Greg. Tingey | October 15, 2011 10:00

77:

limeyrock @ 3 And if you don't know ANY F1 racing teams, or even if you do, don't give flying fuck about so-called "motorsport" ? VERY bad example there, methinks.

General note I've just logged in - last log-in was yesterday @ about this time. My commercial SPAM-filter has picked up 6 in that period. And I'm just Joe Bloke, from their POV. OTOH, said simple commercial filter does seem to work. Um. I suspect, in the end, a serious international commercial treaty is going to be needed. Except, of course, the USSA ( NOTE * ) will refuse to sign. The idiots STILL haven't learnt the lesson of the CSS Alabama

@ 32 and others Disqus? Forget it. The Daily Telegraph usues it, and it's a complete bastard, form an outside-users' pov.

Dirk @ 52 You auto-bounce anything with a URL On a WEB discussion? So how do people redirect, or send useful liks, then? Not clever, IMHO

NOTE * "USSA" as in the Beatles song, ... back in the USA/back in the USSAH!

Charlie Stross replied to this comment from rich! | October 15, 2011 10:47

78:

The dynamics of conversation in a blog discussion are such that greylisting -- aside from it not working at all because commenters are not SMTP servers -- would disrupt the discussion completely.

Nestor | October 15, 2011 10:52

79:

I've taken to posting links in plaintext here, saves me from getting caught in the queue and I figure anyone here interested knows how to cut'n paste a link. Most browsers have hightlight text>go to link functionality as well nowadays.

Mort | October 15, 2011 11:10

80:

This is probably not immediately relevant to our gracious host's problem, nice he'd like something off the shelf, but it seems to me that the optimal way of dealing with the issue is to do the spam filtering at comment-entry time. If the comment passes, accept it without further issue. If it fails, present the poster with hoops to jump through, either to demonstrate their humanity (i.e., some form of captcha), or to impose a small cost to make the transaction annoying to people posting (a message stating either "we're not sure you're human, please wait 20 seconds and then push 'resubmit'", or simply "server error, please wait etc.").

This would have the advantage of allowing stricter tests for posters of suspect posts without applying those costs to everyone, while allowing the paranoia of the spam filter to be turned up pretty high, as legit posters would be able to rescue their posts on their own.

Downsides would be additional processing cost at point of comment entry and giving spammers immediate feedback as to whether their posts were triggering the filter.

Samuel Erkison replied to this comment from Greg. Tingey | October 15, 2011 12:42

81:

You might want to read limeyrock's comment again. Here's the relevant part:

'I'm a fan of the non visual (logical or content specific) captchas personally; "What is the opposite of up?" or the "Which of these is not an F1 racing team?" on a Formula one forum, etc.'

I've bolded the important bit in case you can't figure out why you might owe limeyrock an apology.

Dirk Bruere replied to this comment from Greg. Tingey | October 15, 2011 12:47

82:

"You auto-bounce anything with a URL On a WEB discussion? So how do people redirect, or send useful liks, then?"

We will not, on the forum. We have a FaceBook page for that kind of stuff.

Melvyn Barker replied to this comment from Dirk Bruere | October 15, 2011 13:08

83:

Some of us don't use Facebook or other networking sites.

And I find the links posted in comments on this blog often take me to interesting and informative sites I wouldn't be aware of otherwise.

The breadth and depth of knowledge displayed by people who post here is one of the things that puts this blog top of my daily to read list.

Melvyn

Charlie Stross replied to this comment from Mort | October 15, 2011 13:31

84:

Mort, that's how Movable Type works already.

Charlie Stross replied to this comment from Melvyn Barker | October 15, 2011 13:34

85:

Some of us are sort of required to use Arsebook and Titter by the very nature of our jobs.

It gives my publishers' marketing folks the warm fuzzies if they can see $AUTHOR pushing their wares on the web, and some of them get a glazed, slightly worried look if you corner them and earnestly explain how running a social hub of your own with 15,000 interested daily readers is more valuable than being grist for Zuckerberg's advertising revenue stream. Marketing folks are not necessarily as internet savvy as they think they are. So don't be surprised if over the next month or so you see FB and Twitter buttons showing up in the blog sidebar.

On the other hand, you don't need to worry about me abandoning the blog. On the contrary: I'll be updating it to auto-broadcast my effusions here to FB and T, rather than the other way round.

Dirk Bruere | October 15, 2011 13:52

86:

I think we have simpler job on the ZS forum in that it is intended to be there to discuss ZS business. It might have started differently, but that's how it seems to be ending up. So eliminating URLs would not significantly diminish the experience. It is a much narrower forum than this one.

Dirk Bruere | October 15, 2011 13:53

87:

Is twitter actually useful for anything?

Charlie Stross replied to this comment from Dirk Bruere | October 15, 2011 14:03

88:

Twitter is basically broadcast SMS messaging over TCP/IP, far as I can tell. You can broadcast at people who follow (opt in on) your tweets, or you can follow other people and receive their broadcasts.

No idea what the social applications are like as I've only had a twitter account for 48 hours.

Dirk Bruere replied to this comment from Charlie Stross | October 15, 2011 14:36

89:

I got one and all I seem to get is upper class spam. Can't imagine the type of person who would check their twitter account every few minutes in case they missed some of the dross.

Dmytry | October 15, 2011 15:25

90:

You can rename the form fields and keep the old form fields invisible, some with preset values, as 'honeypots'; reject if those are altered.

The automatic generation of questions is what makes captchas so incredibly annoying and is also quite unnecessary for a blog. You can add a form field where one has to e.g. enter the square root of minus one - a handcrafted question - and you'll probably be fine. It is worth spammer's time to spam all MovableType blogs, but it is not worth the spammer's time to spam your blog personally, if it's software is in the slightest bit customized.

heteromeles replied to this comment from Charlie Stross | October 15, 2011 16:02

91:

Which CharlesStross are you on Twitter, if I may be so bold as to ask?

Charlie Stross replied to this comment from heteromeles | October 15, 2011 16:09

92:

None of the above: I'm @cstross.

karrde replied to this comment from Charlie Stross | October 15, 2011 17:18

93:

There's another blogger I read who does that. (He doesn't directly post to twitter, ever, that I am aware of. But his blog-toolkit automatically Twitters a summary of every blog-post.)

I don't read Twitter, so I didn't notice except when he announced it in a blog-post.

Re: Spam. I don't know what to say.

I had a different problem. For a while, I ran a machine with a public-facing SSH service. I used the denyhosts plugin. (for info, see http://denyhosts.sourceforge.net )

Every once in a while, I would look at the logs, and see dozens of login attempts per hour from some obscure outside-of-the-US IP.

It's as bad as spam, and all they wanted was my machine.

Charlie Stross replied to this comment from karrde | October 15, 2011 17:37

94:

I run denyhosts too.

What you need to do URGENTLY is to set yourself up for public key authentication to log in on your host, then disable password logins. Otherwise the ass-hats doing the distributed botnet enabled password-guessing attacks will get lucky (because you may be as good as gold at setting a password, but someone else on the same machine will sooner or later leave the back door open by using an easily-guessed password).

On the other hand, if you don't use ssh for logging in to a server that's accessible over the internet, you're not at risk.

(Hint: there are two ways to log into this server of mine -- one involves public key authentication over ssh, the other involves public key authentication to an ISP's terminal server then login on the serial console. The password guessing attacks haven't stopped, but they're not going to get access because even if they magically guessed my password the ssh server won't let them log in remotely with it.)

sysadmin1138 | October 15, 2011 21:17

95:

Wowza, I use MT for my own blog for much the same reasons you do, but I seem to get a lot more spam. In the 7 hours since I mucked out the spam bins this morning I already have 146 messages in there. Lately, a lot more have been falling into my 'needs approval' queue and are marked Ham by Akismet, most of them are huge blocks of text with a few key words and a couple links in the URL bar. In the last couple weeks, my 7 day spam rate is about ten thousand. And this is on a blog with two to three orders of magnitude less traffic. So I'm watching this thread and looking for tips.

I'm also a moderator on one of the StackExchange sites so have a much more detailed view on how their spam system works since I do a large part of the spam-hammering. That system only works for a very busy blog, and even then it'll require admin attention.

I also greatly wish a content-validator for MT that has post-length has a filter.

On the smaller StackExchange sites, it's a moderator that eventually hammers all spam since there aren't enough users flagging to kill messages normally. Spam-flagging merely brings it to the attention of the moderator.
On the larger StackExchange sites, posses of users tend to band together as content police which takes some of the load off of the moderators.
On the largest StackExchange site (StackOverflow I'm looking at you) ad-hoc posses of users and sock-puppets can band together to grief others in content wars. Moderators get involved when they invalidate flags and undo griefing.

How many posts you can spam-flag varies based on how well you've been flagging in the past. Those content-posse members tend to have high flag weights and lots of votes. Brand new sock-puppets have very few votes.

Charlie Stross replied to this comment from sysadmin1138 | October 15, 2011 21:25

96:

most of them are huge blocks of text with a few key words and a couple links in the URL bar.

Yup, that sounds very familiar! Did it start up in the past couple of weeks?

sysadmin1138 replied to this comment from Charlie Stross | October 16, 2011 02:32

97:

It most certainly has. Three months ago I was doing about 3000 a week for spam, now its up to 10K. I noticed a lot more getting past Akismet & Co. and had to go to full moderation for unauthenticated comments as a result. My comment rate makes that manageable, but still. In fact, this week has been particularly bad.

Between this comment and my last one, I'm now up to 319 in the bin. Just like an email server, I spend more CPU cycles keeping junk out than serving readers.

TRX | October 16, 2011 02:52

98:

Charlie mentioned:

word salad spams

Years ago I looked at valid incoming email versus spam and realized that a simple grammar checker could catch the majority of it. At the time I didn't find anything, but DOS grammar checkers were available in the 1980s, so I expected some type of filter would be available to filter between the MTA and MUA.

Hmm, a quick web search shows a couple of possible candidates now. I might check them out if I get the urge to join some mailing lists again. I don't know if it's a practical method for filtering on a blog, but given the literacy level of the people who post here I would expect it would be very effective.

Joan Slonczewski | October 16, 2011 04:58

99:

Thanks for all the useful tips on this discussion--lots to think about, for those of us starting a blog.

On another topic--

I gather a lot of the powerpoints and papers presented at the conference will become available online over the next days and weeks. I'll try and link to them when they do.

Any of those starship/space colony papers posted yet?

Spiros Bousbouras | October 16, 2011 08:20

100:

Charlie should write a letter to President Obama asking him to declare a War on Spam. In my experience , if the U.S. Army can't fix a problem noone can.

On a more serious note , how about including in every opening post of a thread a phrase like "In order to reply the first line of your reply must have some-random-string". Attempted posts which have the random string on the first line get accepted , the rest get rejected. You can even make it that the random string gets filtered out before the post appears on the site. Seems simple and effective to me. It also guarantees that aspiring posters must have used a reasonable amount of attention in reading the opening post.

By the way , I just tried to sign up , thrice. I get a screen which says that a confirmation email has been sent to my account but I haven't received it. Yes , I did check my spam folder.

Spiros Bousbouras replied to this comment from Greg. Tingey | October 16, 2011 08:39

101:

I suspect, in the end, a serious international commercial treaty is going to be needed. Except, of course, the USSA ( NOTE * ) will refuse to sign. The idiots STILL haven't learnt the lesson of the CSS Alabama

Since earlier in your post you complain about unwarranted assumed familiarity of people with formula 1 , how about people who are not familiar with either the Beetles song or CSS Alabama ? I had a quick browse through the wikipedia page on the latter and I didn't see a connection with international treaties.

Rosie | October 16, 2011 09:17

102:

How much time and resources are wasted with all this spam? How much of this effort can be usefully re-directed to solve some of the world's problems, like hunger, poverty and global warming? Just stopping the spam being sent out will save on our overall energy bills... as far as I'm concerned spammers are just bullies hiding behind data masks.

magetoo replied to this comment from karrde | October 16, 2011 09:19

103:

Another thing you can do, if you don't have to support logins for anyone but yourself – and after disabling password logins – is moving the SSH daemon to a different port number. That'll stop the botnets even wasting CPU cycles when they connect; when I did so, it cut attempts to zero.

Greg. Tingey | October 16, 2011 09:49

104:

Rosie @ 101 My point, exactly.

The (probable) only way out is a binding International Treaty.

I'm assuming there must be SOME profit in Spam, otherwise why is it being done? cui bono indeed. I wonder what the profit-margin is on Spam - is it really worth if for the spammers, or have they actually locked themselves into an illusory zero-sum game?

Any real figures on this, anyone?

rich! replied to this comment from Charlie Stross | October 16, 2011 09:58

105:

Not SMTP servers, true, but the set {browser setting +OS+software versions+accessible cookies+source IP} is presumably pretty defining - didn't EFF do a demo of how good that was at identifying your machine?

So you could use that, plus the name/email address/url combination, as your greylisting key. And just require the equivalent of a second click after a few seconds ("no I'm a spammer" "yes I'm a human") for greylisted IDs.

Charlie Stross replied to this comment from Spiros Bousbouras | October 16, 2011 10:23

106:

I don't expect the US military to be any better than anyone else at winning a war on an abstraction. (Indeed, the whole 'war on [abstraction]' seems to have done very well for the military contractors and police agencies, but I don't expect the wars on drugs and terror to be won any time soon -- the first, because drugs are a movable feast (what about alcohol?) and demand for them is one of the constants of human nature, and the latter because terrorism is a movable feast (look how many respectable politicians are ex-'terrorists') and it seems to be one of the constants of human nature (insofar as repressive regimes generate opposition, and if the opposition turns violent then whatever tools they use are categorized as 'terrorism').

NB: Need morning tea before I post.

Ms. Sunlight replied to this comment from Greg. Tingey | October 16, 2011 10:53

107:

I think that, if you're someone who would never respond to spam, it can be hard to understand where the profit margin is, but remember that it only takes a few people per million messages to click through and buy.

I work with a woman who fell for one of those "The secret to white teeth with 2 common products" scam ads on a website, and I never would have thought an intelligent person would fall for that, so there you go.

The costs of sending spam are negligible, especially since generally speaking, the spammers aren't the ones paying. Used to be there were notorious spamhaus ISPs that allowed people to exploit them (wanadoo.fr was one of them) but it's more sophisticated than that now. If they're running a botnet, ordinary home and business users are doing it for them for nothing, all unaware.

Anyway, a quick search gets you this 2003 article from El Reg. I assume the principles still hold true.

J B Tait | October 16, 2011 13:04

108:

Our tools indicate that my server is not originating spam but I am getting 65,000 backscatter a day, so it is obvious that my domain is being spoofed and probably thousands of millions of illegitimate messages attributed to it. I am dismayed by the costs associated with this behaviour, and wish there were a way to bring a class action against the perpetrators to recover their ill-gotten gains.

The custom email address I have given you is part of my spam control efforts. By giving everyone their own eddress and using white lists, I can capture my good messages rather than trying to filter out the cruft, and if one becomes tainted (as a spam source, for instance), it doesn't contaminate the entire pool.

In similar fashion, perhaps you could include a keyword in your post that would be entered in a field in the comment indicating that the commenter has actually read the post they are responding to, and use that field to gather the sincere comments and ignore the rest?

ehedgehog replied to this comment from Dirk Bruere | October 16, 2011 16:57

109:

One use of Twitter is chit-chat to chums. Another is announcing releases of something (eg software or music). Or "hey, how are you doing? Did the inteview go well?".

I've seen very little spam, but I'm likely too tiny a target.

Dirk Bruere replied to this comment from ehedgehog | October 16, 2011 17:49

110:

"Another is announcing releases of something (eg software or music)."

Over and over again ie spam

michael | October 16, 2011 21:38

111:

Re Twitter: Two possible answers: * a few very interesting people use it to, essentially, curate links. Examples would be William Gibson (@GreatDismal) or Bruce Sterling (@bruces). * Twitter exists so Warren Ellis can put his disease in us. (note: this references a t-shirt)

Nestor | October 16, 2011 23:37

112:

Apparently 3% of twitter is at all times dedicated to Justin Bieber

thewordofward | October 17, 2011 09:39

113:

Hi there,

I'm very impressed that you have such a thriving comments section. I have two websites and due to spam, I am starting to give up allowing comments. The main site gets so much that it was ending up being hundreds a day. So I disabled all comments.

The second one is a science fiction site, a bit more niche, which currently gets about 25 spammers a day. Which is close to the limit I can be bothered with.

It's a sad state of affairs though. If I could be bothered, I would learn how to hack and would wreak bloody revenge on every site that spams me. But I can't.

Some good tips here though.

errolwi replied to this comment from Dirk Bruere | October 17, 2011 11:27

114:

Or at a frequency that is in line with what the reader (who has to opt-in) wants i.e. like an RSS feed. I doubt that one in 500 tweets that I see could be classed as spam.

Frank | October 17, 2011 19:35

115:

Something that may or may not be helpful is CloudFlare:

http://cloudflare.com

They're a DNS based caching system and CDN, but also have a lot of pretty nifty security options that have grown out of Project Honeypot.

I've seen this do wonders for comment spam, so either this or some other application firewall might be helpful in your case.

David Conrad | October 17, 2011 20:05

116:

Speaking as someone who works at Cloudflare, we'd be happy to help. I won't spam your blog with all the features etc. (:-)), but we do a bunch of things (for free) to try to improve the web. Feel free to drop me a note if you want more details...

roxysteve | October 18, 2011 19:33

117:

On CAPTCHAs: Shamus Young, when he was in the middle of his justly famous DM of the Rings screencap comic, instigated CAPTCHA-control of comments because he suddenly became very attractive to spammers.

He found that after about three days he didn't need to actually go to the trouble of building the word/phrase and morphing it. He found that by using an easily-recognised phrase as the same CAPTCHA day in, day out, the spam became a minor problem again.

Of course, there was always some twit ready to point out for the benefit of the bots that the CAPCHA wasn't working. A quick perl script would have taken care of that.

By Charlie Stross

117 Comments

Buy my Books

Quick Stuff

Specials

Merchandise

About this Entry

Search this blog

Propaganda

Categories

Monthly Archives