Back to: Crib Sheet: The Atrocity Archive(s) | Forward to: The language of alienation

Admin notice: server crash

You might have noticed that my blog has been down since Monday afternoon.

That's because the server it runs on began to experience weird out-of-memory issues and then a catastrophic file system failure (sometimes RAID 1 just makes everything worse).

We've now swapped in a new server and restored from backup. Everything seems to be working again — except the backup predates the last blog entry and, oh, about 84 comments. Luckily I've got a dump of that blog entry, so I'm about to re-post it (along with all those comments as a trailer: I'm not going to try and figure out how to turn them back into actual comment records in a MySQL database! Update: the comments are toast. Sorry ...).

Thereafter, normal service will be resumed.

20 Comments

1:

Hooray!

(LiveJournal right now is showing me a picture of a goat. If anyone ever hoped 'the whole of the Internet' could be up at the same time, they should long since have been disabused.)

2:

These things do happen occasionally; as some of you may be aware, I'm more aware than most about just how impressive an up-time record you have.

3:

Note that there are over 105,000 comments in the database. We've lost 84 -- just the ones posted after 5am BST on Monday, before the server died around 5pm.

4:

Any idea of the root cause?

Just a nosey nerd being curious.

5:

Nope, which is worrying.

My first notice was on Monday, mid-morning, when the server stopped responding. I fired up the terminal login and saw lots of processes exiting abnormally with "out of memory". (On an 8Gb Linux server with plenty of swap space.) So I rebooted. It ran okay for about 15 minutes, then started again -- same pattern.

A couple of reboots later (I was trying to troubleshoot it) I ended up with fsck barfing and demanding manual intervention. Manual intervention resulted in a borked filesystem -- TBH it had probably been borderine or completely irretrievable before that point. So I called in the big guns.

$SYSADMIN checked the hardware and didn't find anything wrong with it, so he re-imaged the machine and restored from the last backup (5am that day) on Tuesday. All was fine for about 15 minutes ... then the same shit started up again.

At this point $SYADMIN assumed hardware failure and swapped it out for a new machine. Cue the re-image, restore fandango again.

We got it restored, then restored the MySQL dump ... which trashed everything. Turns out the MySQL backup script had been failing semi-silently since October last, due to a database table with a funky character in its name. Not to worry: the generic backup from 5am was sufficiently clean that, barring a single crashed table (manually fixed), we were up and running again by 11am.

... Then we discovered a bunch of permissions problems, which we're still tracking down and fixing, but those only affect mail for antipope.org and a handful of other domains. Not your problem.

(Those backup scripts are about to receive some more scrutiny. Meanwhile, this all coincided with me getting massively upgraded broadband speeds at home -- I now have up to 20mbps outgoing and 80mbps incoming. So I'm considering setting up a fallback server at home, which can at least serve up a "help, things aren't working!" page if this happens again.)

6:

Charlie,

On Monday you replied to someone in the language of alienation thread and said

"I am unfamiliar with this person. Where should I start? (Bearing in mind that stand-up comedy often doesn't translate well across cultures, even within the same language group.)"

Do you remember who that was? I have been wondering since the blog went down. Thanks.

8:

I thought it may have been Paul Krugman's fault. He posted about Neptune's brood on Monday and I noticed the site down around then.

Coincidence? Or are there alot more economics nerds lurking than at first glance?

9:

You start with Louie, season 2. You can start with season one, but he was doing an experiment with the series and it's after the first season that he finds his groove. Don't worry, the humour WILL translate.

http://en.wikipedia.org/wiki/Louie_%28TV_series%29

(Lucky Louie was his earlier, more conventional sitcom which is an easy miss. It's not bad! It's just not as good.)

10:

Nope, server went tech before that piece came out.

11:

"server went tech"

I don't know what that means (typo?) but it sounds like it belongs in the "alienation" thread :-)

12:

"tech": Tends to be used as shorthand for "technobabble". In this case, probably the only way of meaningfully summarising a complicated set of circumstances where the server ate itself in a very messy way.

Refer to TVTropes at http://tvtropes.org/pmwiki/pmwiki.php/Main/TechnoBabble for more info and further definition and uses of "technobabble" and "tech" (specifically the entries on Star Trek).

13:

have we already covered why you don't use a cloud service? Seems like I remember the talk, but I can't seem to spot it.

14:

Yes.

1. Inertia.

2. I like owning my own box. Or renting it by the year. Whatever. It's more concrete.

3. Last time I looked, cloud service costs weren't much cheaper than I was already paying. More to the point, cloud services charge on the basis of demand. One DDoS attack or major traffic spike (like the time I accidentally broke a political news story a couple of days before the newspapers picked up on it) and I'd be in the hole for enough extra £££ to wipe out any savings. And that sort of thing happens to me. (This blog gets boingboing'd/HN'd/slashdotted on a monthly or sometimes even weekly basis.)

15:

For informational purposes, the cloud providers I have used* charge by system image "turned on" time, not on CPU usage thereof. You install a committed usage Medium-large sized webserver image, you leave it running, you're paying the same hourly price x 24 hrs/day x 30ish days/month x 12 months/year etc. Regardless of whether it's 1% busy or 100% busy during those times.

They usually have some cents per GB of web I/O across your systems as well, but the systems themselves are flat rate.

* I have used AWS and Joyent at significant scale.

16:

"Gone tech" is an aviation term used to describe the situation when an aircraft has a technical problem. The first flight out to Paris seems to regularly go tech while are sitting on it, wondering whether we're going to make the connection.

17:

I'm not doubting your voracity in terms of what you're hearing at Turnhouse, but IME the more usual terms are "gone U/s" or "FUBAR".

18:

I do so hope you actually meant 'veracity' there.

19:

Mai proff-reiding appeerz 2 bee FUBAR. ;-)

20:

Apropos of nothing at all, could we be mistaking Cthulhu's 'voracity' for Cthulhu's 'veracity'? Arabic script has no vowels, I think. So the Mad Arab might have written 'vrcty'.

That would put thongs in a new light.

Specials

Merchandise

About this Entry

This page contains a single entry by Charlie Stross published on May 22, 2013 11:19 AM.

Crib Sheet: The Atrocity Archive(s) was the previous entry in this blog.

The language of alienation is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Search this blog

Propaganda