Back to: Lost in Translation | Forward to: Books I will not write, #2: Iron Sunrise Variations

Service interruption

The new server is up and tests out okay, so we're switching over on Monday morning (UK time — in the wee hours if you're in the USA). You won't be able to post comments on the blog during the move — between midnight GMT on Sunday evening and probably noon GMT (or thereabouts) on Monday. When you see the comment entry box again, you'll know the migration is complete.

3 Comments

1:

Be sure and post some extra provocative articles just before you disable comments. (-:

2:

farcebook outage...(having been an electrical engineer in a former life w/SUN Microsystems as a product/test/verification engineer)

http://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919

The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed. The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.

patch on the fly/setting the wrong bit in a control register via a priveleged instruction across multiple processors??? (ok I can see this happening, but did anyone lose their job over this one?)

3:

If they're smart (most companies aren't) nobody lost their job, but everybody in ops got a very detailed lesson (with no blame attached) in exactly what went wrong and why, and why it would be a Bad Idea to repeat the experience.

Those industries that are most safety-oriented -- such as the airline business, or passenger rail -- have a safety culture in which priority #1 in mopping up after an accident is to work out where things went wrong, and then to prevent the same situation arising in future. In other words, not a blame-oriented culture (where people are obsessed with covering their asses), but a problem oriented one (where folks are obsessed with improving the process to reduce the risk of problems arising on their watch).

Specials

Merchandise

About this Entry

This page contains a single entry by Charlie Stross published on September 24, 2010 5:29 PM.

Lost in Translation was the previous entry in this blog.

Books I will not write, #2: Iron Sunrise Variations is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Search this blog

Propaganda