Back to: Designing society for posterity | Forward to: Imbeciles

Bloody computers.

I spent most of Tuesday driving home. Checked my email before I left; there seemed to be a network issue with my server, but I didn't bother checking until the evening, when I got home.

Bad mistake.

The box this blog runs on threw a kernel panic some time late on Monday night/early on Tuesday, and panicked repeatedly on reboot. Diagnostics demonstrated that (a) the RAID array was hosed, and (b) the machine's memory had gone bad ... right as it was handling probably 30-50,000 emails and 50-100,000 http requests a day.

After a brisk disk transplant into a new machine, we spent the whole of Wednesday fscking the hell out of the main filesystem, which showed worrying signs of bit rot. This morning, the filesystem was recovered, so I spent another few hours pulling the contents down my home ADSL line (all 39Gb of them; it took about four hours). If you can read this blog entry, it means I've finally succeeded in un-gunking the MySQL database enough to post via Movable Type, which is a Good Thing. Hopefully we haven't lost anything (but cloud backups are going to start happening in the near future, i.e. just as soon as the new machine gets a clean bill of health).

We're not out of the woods yet. The mail transport daemon is squatting in a corner and gibbering to itself (metaphorically, I hope). If you use a mailing list hosted on my machine, it won't be back until we've got exim sorted out. (This means you.) There's probably other stuff I've forgotten about. It's not beyond the realm of possibility that there's hidden damage to some subsystem or other and we're going to have to rebuild from scratch on a new Debian system.

But if you can see this, we're on the way back.

36 Comments

1:

Good luck. I think we've all been there, to some degree or other. Totally sucks.

2:

It won't help with your data, but debsums -s will check if any packaged files are damaged.

4:

I'm not sure if this is useful but this post showed up in my feed reader (google reader)and I can view it on http://www.antipope.org/charlie/blog-static/2009/11/bloody_computers.html but when I go to http://www.antipope.org/charlie/blog-static/index.html it's not visible. (Firefox 3.5.5 on XP if that matters).

5:

Good to see you sort of back with us.

6:

Lest I speak too soon: the dpkg status file is somewhat borked. So is exim's configuration. There are other issues.

All that's definitely back from the dead is the web side of things, and the blogs. (Although that's by far the most important bit.)

7:

I'm not sure I understand. Is your server at home, so to revert to an offsite backup you had to suck the bits down the domestic wet string? Or is your server colocated somewhere, but your backup was at home, so you had to blow the bits up the string?

You can get faster fscking with a suitable choice of filesystem. Softupdates, journalling, etc etc.

What sort of RAID were you running? We use a GEOM-based mirroring RAID now. We can unplug any one of the disks and put it into any other machine and it automagically comes up as a working filesystem; add more disks and the new machine will rebuild the mirror.

Take care with cloud backups. They can compound the wet string problem.

8:

It has begun. The machines are rising. The test run on mr. Stross' machine was the first nibble.

9:

Re: cloud backups, check out Backblaze. I tried Mozy and eventually kicked it to the curb (and spat upon it) but Backblaze has been a dream. $5/month and I've got nearly a terabyte backed up. No bandwidth throttling, so you can push it up there about as fast as your connection allows.

10:

Nick: the server's colocated. As for backups ... what backups?

(I just sucked the files down so I had something to poke at offline.)

11:

Christopher@7: nah, Charlie's Facebook-using alter ego has explained why that's not the case: http://www.youtube.com/watch?v=LzRF-5IWrm0

12:

Sorry to hear about your troubles! But at least one good thing came out of it.

I have been lurking around since I found my way here about half a year ago (I had never read your books), but have never actually gotten to buying one of your books. As I was really enjoying the discussion in your last entry, I was very frustrated to not be able to follow it as your server crashed.

Good this is, in the blog's absence the only thing I could do was to run down to the book store and get Accelerando and Singularity Sky. So now you officially have one more reader thanks to the crash.

13:

Turns out there is hidden damage -- the dpkg database is somewhat corrupted. I've hand-massaged bits of it back to a semblance of life, but it ain't happy and further work is going to have to wait on $SYSADMIN getting back to within spitting distance of a terminal. We're in deep Debian doo-doo here, and I suspect I may be doing a salvage-and-move to a clean system over the weekend, rather than simply resurrecting this one.

(At least the blogs are salvaged, though.)

14:

Benedict: you have a dirty cache (or your ISP is caching stuff for you and they have a dirty cache).

15:

When I searched for news about the status of your server while it was down I couldn't find anything relevant. Is there a news site or service online that handles such "internet news"?

16:

Re: cloud backup - check out tarsnap, cheap, secure, Amazon-cloud based backup, usage-based pricing. The founder/developer is or was the Security Officer for FreeBSD, and a very sharp guy.

17:

Charlie, you weren't parsing HTML w/ Regular Expressions by any chance: http://www.codinghorror.com/blog/archives/001311.html

18:

For cloud backup I can recommend http://www.jungledisk.com/

19:

Another suggestion for Internet backup services: Tahoe.

It provides a distributed file-store, where multiple nodes are used to provide redundancy. Storage nodes are not trusted with unencrypted data.

So you can use it with a centrally-provided storage service, or if you have enough friends with some spare storage capacity, you can hook up a private friendnet and store each other's backups securely.

I don't know if it would be suitable for you given your bandwidth concerns; something based on rsync would (after the initial sync, and assuming a comparatively small churn) may work better.

I've been meaning to set it up with some friends for quite a while now..

http://allmydata.com/ http://allmydata.org/trac/tahoe

20:

I'm sorry for your trouble, but all I could think about while reading this was whether it is necrophilia if you spend a weekend fscking a file system that exhibits signs of bit rot.

21:

Depending how far the rot has spread, you might be able to update the dpkg status file from apt's Packages file. Have a look at dpkg --update-avail and friends. Newer versions of Debian seem to hold dpkg at one remove from the package management, but then again I've never had to resuscitate a machine with a corrupted root partition...

22:

Aaaaand ... $SYSADMIN has fixed the dpkg database and the glitch with exim; looks like we're back in business!

(I expect other snags to surface, but for now the server's functional again.)

23:

This is the just desserts for your previous posting: You asked people how to design a society and didn't ask the machines, so they went on strike.

24:

The FAA is having computer problems and a lot of flights are delayed. I wonder which of you kept more people from doing what they wanted.

25:

All I can say is that I'm glad it's back up and that the damage/loss appears to be minimal.

I just hope things are going well with the cat, too.

26:

I thought you may have been victim of a DNS snatch at first, but it was just old-fashioned grumpy machines. Glad things are approaching normal again...

27:

That was a scare.

28:

Next time think twice before installing some piece of software coming from the Laundry. :)

29:

Ah yes, ye aulde bitrot. Never fun, but not absolutely unexpected over time. Glad $sysadmin's been able to get it rocking again for everyone. Debian's usually pretty stable, but eventually the uptime of any machine will approach zero.

30:

Charlie, glad to see the gremlins have been chased back to mountain and fen and the System is once again Clean and Lovely. I didn't comment earlier because (a) I had nothing useful to say and (b) I had no doubt that you would triumph in fairly short order.

I didn't want to whinge before (well I did really, but I manfully restrained myself), but one of your links hit one one of my Tunbridge Wells lobes. As you well know, when that happens one is powerless to prevent some spleen venting. (Note to furriners: Royal Tunbridge Wells is a small affluent town whose inhabitants have a reputation for writing angry letters to the press on trivial subjects.)

Why oh why oh why ... does anyone who wants to co-opt a vaguely revolutionary look and feel take the Russian alphabet and substitute their "ya" letter for our "R" throughout? Your chums at "Wyaiter's Blfk" have gone one further and substituted "f" for "o". Perhaps they are subconsciously close to accepting that cruel, capitalist society is more likely to employ their talents in the customer service trades?

Come on, chaps and chappesses. You've come up with a great name: you can do the look and feel with fonts and layout. You don't need cheap jokes. (Russian, being phonetic, would spell it "Rayters Blok").

Sorry about that. I know it's a weakness. I struggle against it. I'll try not to let it happen again.

31:

TBR@15: I use isitup.org to check if an unreachable site is really down or if the problem is at my end. There may be situations a site is up but unreachable by both my and isitup.orgs computers, but as a second opinion it's quite useful.

32:

Was glad to see you back up yesterday morning, but had nothing useful to say -still don't, and wasn't sure if it was safe to comment.

Two days without a fix was rough.

33:

Welcome back, Charlie - you and your blog were keenly missed.
AHS, ASS. One hopes your trip was enjoyable and productive (at least the part before returning home).

TwistedByKnaves @30: this furriner thanks you for supplying the context to 'Tunbridge Wells lobe'.

define Lower_Tadfield "Royal Tunbridge Wells" //ObGoodOmens

34:

I guess my RSS feeds gafiated at the right time; Google developed a bug that deleted iGoogle settings when you logged in from a mobile device, and a whole bunch of users, including me, innocently destroyed our home pages. My page contains all my RSS feeds, 50 or 60 of them at last count, so I had no idea that your blog was down until Google restored me earlier today (took three days for them to realize and solve the problem for all users involved).

Good to hear you're back. Did you ever make any sense out of that mail transport daemon's gibbering, or was that just the result of too much Laundry tech melting down?

35:

Bruce: yeah, the mail server's sorted out. There's a wee problem with the mailman (list server) web GUI, that I'll take a poke at it later (when I've recovered).

36:

Does cloud backup make sense yet? Every time I've looked into it, the prices are very expensive, so just doing rsync to a home machine always seems to win.

For example, $5/mo for 1TB still costs much more than just buying a disk with a 4 year lifespan. And every time I've looked into online backups, it costs 1 to 2 orders of magnitude more than that.

If you have enough data motion, I guess you have to factor in a better business class internet link at home, but even then the cloud price seems to end up extreme...

Specials

Merchandise

About this Entry

This page contains a single entry by Charlie Stross published on November 19, 2009 3:23 PM.

Designing society for posterity was the previous entry in this blog.

Imbeciles is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Search this blog

Propaganda