Back to: Rudy #8. In The SF Drawer | Forward to: Sometimes I hate being right

FAQ: spam

This blog gets hit by spammers.

A couple of years ago the spam load was pretty bad; then we moved to a server with a new IP address. I reckon the blog spammers are using tools hardwired to go to a specific IP address (saving the DNS resolver overheads) and blast some fields at a CGI script. Unfortunately in October the spammers caught up with the new IP address and the spam load hitting this site has been rising ever since, from around 200 spams/month to around 7700/month currently (and rising).

The vast bulk of the spam (around 90% of it—over 200 spams/day) come from roboposters these days. And they have a common characteristic: they either leave the "your name" field blank when posting, or they fill it with "anonymous".

As you can see if you hang around the discussions here, not much spam gets through. We have filters; we also have volunteer moderators (both to nuke any spam that makes it past the filters and to enforce the moderation policy).

Unfortunately, the robospammers mean that some changes to the way I run this blog are necessary.

For starters: I don't insist that you use your real name when posting comments. Pseudonyms or anonymous handles will get you closer scrutiny by the moderators, because they are frequently used by trolls, but there are legitimate reasons for not wanting to use your true name on the internet. However you should use a pseudonym other than "anonymous" because all comments posted by "anonymous" go straight in the spam bin. Which is so full-to-overflowing that nobody bothers to check it for misplaced ham these days. NB: calling yourself "a. n. onymouse" (or variations thereon) will work fine.

For seconds: a huge amount of spam relates to three categories: (a) dodgy financial assistance (loans, credit, etc), (b) luxury designer brands (you would not believe the number of spammers who seem to think you all want to buy cheap Karen Millen dresses and Gucci handbags), and (c) prescription medication (Cialis, Viagra, you name it, they want to sell it to you). If you want to refer to these medicines or high-end designer brands in a comment, mangle the spelling slightly. Stick a space or a punctuation character in it. (I feel safe telling you to do this here because the spammers, almost by definition, aren't part of the conversation and won't read these helpful tips).

Thirdly: if you post something and it vanishes into the moderation black hole, feel free to post another comment saying "Moderator HELP!". We'll look for it. (Please bear in mind, though, that we may well live in a different time zone from you and be in bed at the time. And this is not the New York Times, with a paid staff on 24x7 duty.)

Finally: as noted here, this blog doesn't take advertising (because it is an advert). However, I won't snarl at you if, in the course of a discussion you post a recommendation or a link to some commercial product or service that may interest other readers as long as you have no financial interest in the product. Found something that you like using and want to share it? That's fine. Taking revenue from click-throughs or otherwise boosting the noise level for purposes of search engine optimization? That's not cool. Not sure whether what you want to push is okay? Just ask.



Good job keeping the spam out! This is can really be a PITA. I have had some success employing IP-based blocking using the anti-spam ACL from on my sites.


The pharma injunction is pretty ironic post rule 34.

Absolutely brilliant book incidentally. I just finished. But are you going to give your spamfans a cut of the take?

Seems only fair :-).

Also, what is the connection between ATHENA and Bruce Wayne?


"True Names" huh?

Didn't Ms. le Guin have several things to say about that? Be careful what you ask for, especially if his true name is Yevaud

I hate to say it, but a simple artihmethic-question requiring an answer-in-a-box, or something similar (NOT a "capcha", please, though!) might become necessary before too much longer?

Does OGH have any further thoughts on the matter?


I still don't understand the random spam I was getting on my little blog that seemed to have no point to it - I.e. no links, and a pointless off-topic generic sentence saying something like "great post, you have got to the nub of the problem there alright" Often commenting on weeks old posts

My wife has a blog for her art degree, (it's a distance learning thing and the college recommend blogging in place of a logbook so the tutor can see her logbook notes at any time, so it's not even really a proper public blog) and yet she has started getting the same rubbish now too

If anyone can explain to me what the hell the point of those is I would be grateful - What are they trying to achieve?


What? No offer of dates, hookups, or "Lolita Kitty wants to be your friend?"

And no porn?

You got off lightly.


Yes: honeypots.

It's possible with some CSS to add a field to the comment form that is invisible to humans but visible to robospammers. The robospammers fill it in. The comment script needs to check for that field and discard the comment automatically.

The logic is simple. Alas, the guts of Movable Type are anything but simple, and if I hand-tweak the comment script the changes will be lost at the next update. So --


Just found an MT plugin that blocks harvesters -- i.e. botnets. It checks if a comment is submitted by an IP address other than the one that requested the comment form; if so, it's almost certainly a bot (your computer doesn't magically change IP addresses between requesting a page and submitting a form, right?).

I am going to install it ASAP.

your computer doesn't magically change IP addresses between requesting a page and submitting a form, right?

Except when I open a post at home, but don't comment until I'm at work. :-/


You know the bit about using CSS to hide a text input field from humans, and it if is filled in and submitted it is a bot, right?


Users on mobile phones are also likely to change IP address between requests. From personal experience the hIdden field honeypot method works very, very well.


"What are they trying to achieve?"

I get a steady stream of those in Google+. I have three theories.

  • If there's an URL attached, the URL is either for search engine optimization or else it points to malware for you to infect yourself with.

  • They'd like to make friends with a wealthy westerner, then later on they'll ask for financial or immigration help.

  • They're taking a distance learning course that requires them to post comments to English-language 'blogs. (-:

  • 11:

    "IP address magically changing" used to be an issue for AOL customers due to their proxying setup: each individual HTTP request could, and did, travel via a different proxy server.

    That was the situation 5-6 years ago, at any rate: at that point I switched jobs and no longer had to care about the issue.


    Mark @#4: If anyone can explain to me what the hell the point of those is I would be grateful - What are they trying to achieve?

    That sort of drive-by comment spam may have a URL underlying the username. (Mine should point to my website, for example.) This gives Google the impression that there are many links pointing to the spammer's site and this increases the spam-site's page rank. This influences how early in the search results the spam site will be.

    Or they may just be testing your site to see if their automated system can post to it. Many sites have anti-spam countermeasures, so they'll do some drive-bys to see if their script can work.


    The plugin I'm using permits comment responses from a different IP from the form request as long as they belong to the same Class C network. This should take care of the AOL or cellphone problem. (Does anyone do random IP allocation from a Class B network?)

    If this comment gets through, then switching on botnet blocking doesn't block me from posting from a static IP address. ... No, didn't work. (Something wrong with my templates. The documentation is a little vague.)


    Oh for fuck's sake ...!

    Bloody thing requires PHP, for no sane reason.


    My ISP has this stupid policy where they reset your connection every twelve hours (the contract says "for technical reasons", but they could never give me an explanation of why they do it, and the only thing that makes sense is that they want to charge more for static IP addresses). Anyway, this means that my IP address changes every twelve hours, and there's nothing I can do about it. Of course, it's very unlikely that I'll read an article before the change and comment on it after it, but it's a possibility.


    Perhaps they hope to be able to come back and edit the message with some links after it's been moderated? Or to get on to your white list?


    Okay, I have an Apache/Movable Type/Perl/PHP headache now.

    TL;DR is that there is a very nice plugin that should block all the botnet spam. Trouble is, the author of the plugin assumed that anyone running Movable Type would also be running PHP, and processing MT's files through PHP on the way out the door, in order to insert the requester's IP address in the headers.

    However, this site runs on static content, because it takes a hammering; if it had to run PHP every time it served up a page, it'd do so a couple of thousand times an hour under normal work-loads.

    Oh well, scratch one good idea (and now to back out the changes I was making to support the plugin).


    I see that on a PHPbb forum I frequent. Those spammers will normally have a signature with one or two spam links; perhaps your blog's comment software has signatures turned off.


    Several years ago, I noticed that spam advertising those medications was always misspelling their names. I presumed it was to get past keyword-based blocklists. The result was that apparently the only mails that spelt the names correctly were ham and not spam. Maybe this is not the case any more.


    Re: extra invisible fields: on one simple feedback form I added 2 tickboxes: one "I am a human" and one "I am a dirty spammer". If a wrong combination is ticked, the form contents is discarded. I haven't seen a single spam from that one since.



    Have you considered giving an anti-blog-spam service like Akismet a try? While it was originally written by/for the Wordpress folks, there are plugins for many systems including MT. I've found that it works remarkably well, and assuming the MT integration is decent (no promises from me on that), it's probably worth the $5/month cost.


    Jon, I already use TypePad Antispam (which is basically equivalent to Akismet, only leveraging TypePad's user base for collaborative filtering) and a variet of other anti-spam tools (notably AutoBan and MT's own SpamLookup blacklisting service). These tools are free. Adding $5 a month is an extra 5% on the cost of running the blog and I can't justify that when there are free alternatives. That's why I stopped using Akismet.


    /16; /24 : CIDR


    Nacho -- the 12 hour reset was normal behavior for ISPs that used old-style telephone modems, with cheap all-you-can-eat monthly subscriptions. To prevent someone from camping out on a telephone line all month long, they would kick everyone off after a timed interval.


    I get about 1500 spam per month, but gmail wipes out almost all of it. Any way of using a gmail account to filter the input to this blog?


    No. What you're seeing is email spam -- not the same at all. Blog spam goes in via comment scripts, such as the one the form at the bottom of this discussion points at.


    "(your computer doesn't magically change IP addresses between requesting a page and submitting a form, right?)."

    Unfortunately, yes, it might. If the user is behind a bunch of load-balanced proxy servers then the second request may go through a different proxy than the first and so show as originating from a different IP address. I believe AOL used to suffer from this; dunno if it still does. By employer used to suffer it as well (a decade ago; I don't know if that's changed).

    Also interesting are systems such as "bluecoat" where the first request may be diverted via their network to verify it's not a restricted site, and then further requests go directly from the proxy. Makes for fun in log analysis! (eg I just hit my own home page; the first request came from but the single embedded image request came from my employers IP address range).


    Thirdly: if you post something and it vanishes into the moderation black hole, feel free to post another comment saying "Moderator HELP!". We'll look for it. (Please bear in mind, though, that we may well live in a different time zone from you and be in bed at the time. And this is not the New York Times, with a paid staff on 24x7 duty.)

    Averts eyes, turns a little red, and shuffles feet. I usually try to take time difference into account, but then the other day was the first time I've had one held for no apparent reason.

    Some of the moderators seem to have some odd hours--assuming they're all in the same time zone.

    It checks if a comment is submitted by an IP address other than the one that requested the comment form

    I mentioned that my held comment was my first from my iPad, because I was, vaguely, wondering if something like that happens. I was thinking more like; if our names are associated with a particular IP address, and a comment comes in from another, is it held?


    Some of the moderators seem to have some odd hours--assuming they're all in the same time zone.

    We're not.

    Alan and I are in the UK, although both of us travel (not necessarily at the same time) to other time zones.

    We've got a mod in Massachusetts, another in California, and IIRC one in Japan.

    But one person per time zone isn't really enough to give coverage, because folks go shopping/sleep/have ISPs go tits-up for the day. So while somebody usually takes a look at the blog every few hours, there's no guarantee that this will happen.


    Some of the moderators seem to have some odd hours--assuming they're all in the same time zone.

    They're not. The main and secondary moderators are in, I think, three different timezones. Possibly more.

    Although odd hours are indeed kept.


    That doesn't take care of "the cell phone problem" for folks with data caps who try to manage their data use by using wifi when reasonably possible.

    When I wake up, my phone is talking to my home wifi, NATed behind an access point that has a static IP address. As I start going to work, at some point I'll walk out of range of that, and glom onto AT&T's network, where I'll stay during my bus ride. At some point I'll get to my workplace, and the wifi signal strength from my employer will get strong enough, and I'll get a DHCP-assigned address from their dynamic pool, right on the internet without NAT or significant firewalls.

    When I go out for lunch, I might remain on my employer's network if I pick a venue close enough (we have a lot of buildings locally), or on AT&T, or maybe on the Starbucks network if for example I pick the sandwich shop next-door to them.

    And then in the evening, the morning routine is reversed. On my iPad, similar flitting about happens, but without the AT&T bits as I am not willing to pay for a data plan for my iPad. (It's fine if I can only use it when there's wifi around, or, in an "emergency", I can tether it to my phone.)

    But it's probably fine, as long as the rejection notice contains enough information for me to know what's going on. I bet if I hit my "back" button, the forms will contain all the stuff I typed in, and I can reload the page and hit "submit" again and it'll be fine. Right?

    (If not, maybe I'll just make a point of only accessing your stuff via a VPN under my own control. That'd fix it all for sure.)


    Detecting spam may hasten the singularity.

    I'll let xkcd explain:


    A few years ago I read that about %80 of the spam dropped out when one Russian server(?) was cut out of the net. Maybe its time for another cut.


    I can't speak to ones from bots... but, there's a lot of shady individuals and low-end advertising firms that pay US currency for "realistic" human-authored blogspam. Contact between parties is facilitated by services like Amazon Mechanical Turk.


    On a related note, you might want to check this out, Charlie.

    It's on the Spanish amazon but I've seen them on the others too, seems to be print on demand books stuffed with scraped wikipedia content and ambiguous titles designed to trick buyers into purchasing them by mistake.

    I imagine there must be rather a lot of them.


    Ah, good old VDM. Their primary Wikipedia copypasta imprints have "published" well over half a million articles.


    Re: Antibot measures. As I'm slowly teaching myself PHP and currently mashing together an email contact form. (The one I'm currently using is rubbish.) Would putting a field that asks the contacter to type . The 4 letter English word meaning opposite of up, in full caps. And then dumping any forms that don't have this correctly filled in.

    Obviously this doesn't get round the human spammer. but I imagine I'm missing something here as I don't see this simple form of measure taken in many places, so presuming it doesn't work for some reason.

    Yeah, I know i can get prewritten contac forms with antispam measures. but I want to understand by writing my own.


    One reason it isn't used is that it deters commenters.

    What we have at the moment are automated filters that have the following features:

  • They do not require commenters to do anything: you just give a name, click in the entry box, type away, and click 'submit'.
  • They stop about 99% of incoming spam from reaching the screen. The last 200 spams took only (checks) 23 hours to come in, and none slipped through that I'm aware of - and this is currently typical.
  • They correctly allow through about the same proportion of valid posts.
  • The problem with making your genuine participants jump through lots of hoops is that some end up going 'well, sod it', and not post after all. The end result of that would be a less-free discussion and threads would wither on the page. Even those who don't mind the hoops would end up not posting, because they have nobody to reply to. And the spammers: the Mechanical Turk ones would jump through those hoops, and we'd not have achieved anything.

    We could hold all posts for moderation and stop any spam getting through that way. But lacking anything like enough moderators for real-time monitoring, that would freeze conversation even worse than the previous option.

    As it is, for this blog, the current system appears to be working remarkably well. An occasional spam gets through and has to be killed out. Some of those get through at times when none of the team is paying attention (we sleep, we travel, we work), and may sit there for a while. And an occasional sane comment gets filtered out and needs hunting down and reinstating.

    Hoop-jumping may make sense for other sites. But no one solution fits all, and at the moment, Charlie doesn't believe it fits this one.


    If we had one or more orders of magnitude more traffic then it might make sense to add hoop-jumping measures. Then we'd be looking at over 1000 legit comments and 2000 spams per day, on average, which is a freaking inhuman workload.

    On the other hand, if we had that much more traffic I'd be making a big enough profit off book sales through the site to pay a specialist Movable Type admin to write filters and/or fine-tune the site. There is a money/time trade-off, and right now I run this blog effectively as a self-funding hobby (because being a full-time stay-at-home novelist is kind of isolating).


    AAAAAND ...

    We just rolled past 8000 spams in the self-emptying-after-30-days junk trap last night!

    Which means 270/day average, but as the level is rising it means the spam load is accelerating.

    This is why, if a comment of yours is held for moderation for more than 8 hours, you need to give us a heads-up. Nobody goes wading through sewage looking for dropped treasure without knowing that some treasure has indeed been dropped.



    Which means the average over the entire period is actually only 135/day.

    But it's 200 in the last 18 hours, which means your daily average is actually about right.

    (And the last 100 arrived in the last 6 hours. Eek! It does fluctuate by time of day though.)


    Thanks. I suppose I was thinking out loud. Appreciate on a fast moving blog another hurdle to commenting would have a detromental effect. I'll try the capture idea in my conctact script. Traffic not a problem on my site. (there isn't any.)

    Was quite surprised to discover peple are sometimes paid (allbeit poorly) to spam sites.


    Shouldn't someone complain? I mean, they're probably just barely squeaking through legality, thought the title names could be trademark infringements, but even so, Amazon ought to stop carrying them as they are pretty clearly scams intended to dupe uninformed buyers.

    I guess I'll go ahead and complain as per Dirk's "someone should do something" article.

    Re Hurdles: It can be VERY difficult to get a community going, this one here's fairly hale and hearty, but on the average blog you're looking at 1000s of spams per single human commenter, who sometimes doesn't have anything particularly interesting to add. Or just wants to complain :)


    Here's an idea: Hire a bunch of the kind of people that today are employed to solve captchas for spammers and have them identify spam instead.


    IF you've ever any hope of the spam bin being useful in the future, you might as well just perma-delete posts from anonymous, rather than stick them with things filtered by other methods.


    CSS-based hacks will break if a user disables CSS or uses a browser that doesn't support it. Rare, but that covers most screen-reader type software for visually impaired folks.


    If advice and experience from the realm of email is any use for the comments domain, then here it is.

    • all manually-defined rules are a pain to compose; work for a short while before being circumvented, and catch innocent messages.
    • it's important to distinguish between volume of spam received, and volume of spam received and not filtered, because…
    • statistical model-based filters are king. They have the best accuracy and - crucially - are adaptive without intervention, but…
    • …they work best when supplied with the most evidence, which means, (sounding a lot like Martin Lewis here)

    Lots of spam isn't bad, lots of misclassified spam is bad.

    In short: someone should write a plugin for crm114 to do comment filtering; don't try to reduce the volume of spam received, because it is actually good to get lots of evidence of spammy messages; sit back and relax.


    January 5th: 8000 spams in the 60-day trap.

    January 10th: 8850 spams in the 60-day trap.

    So the incoming spam load has accelerated by 170 spams per day over a 60 day period.

    Aaagh ...!


    However the last 200 came in in a period of over 24 hours, whereas I've been noting something in the 18-22 hour range in recent days. So it's actually dropped back slightly from the peak.

    This might be just one actual spammer dropping off (perhaps on noting that this is a lousy target).


    Mark, the answer to that is that many blogging softwares offer a setting specifying that a first post from a new commenter must be moderated, but once they've had a post approved they can comment without moderation.

    That first comment is meant to be the Trojan horse. A jumble of enthusiastic words-- hoping you won't notice the meaninglessness of the sentences, or the way they touch on absolutely nothing in your post, and you'll hit the approve button.


    ... And we have rolled past 10,000 spams in the recycle bin.

    Which means in 15 days, the spam surge slowed from 170 extra spams per day to a mere 135 spams/day.

    My work here is done (and I'm shutting down comments on this thread lest it turn into another spammer magnet).



    About this Entry

    This page contains a single entry by Charlie Stross published on January 3, 2012 12:39 PM.

    Rudy #8. In The SF Drawer was the previous entry in this blog.

    Sometimes I hate being right is the next entry in this blog.

    Find recent content on the main index or look in the archives to find all content.

    Search this blog