« I get mail: especially about Linux | Main | Incoming! »

Preventing the New Dark Ages: Start Here

Incidentally, that previous blog entry reminds me: people regularly ask me, "well, why don't you use (Windows | Microsoft Office | [insert program here)? Everybody else does, and it would make your life so much easier." Or they ask me "why bother using Linux? It's so much easier to use Windows." And so on.

Well. Why do I swim against the tide?

A good starting place is to read this Guardian opinion piece by the chief executive of the British Library. She starts off: "Too many of us suffer from a condition that is going to leave our grandchildren bereft. I call it personal digital disorder. Think of those thousands of digital photographs that lie hidden on our computers. Few store them, so those who come after us will not be able to look at them. It's tragic." And then she goes on, just barely scraping the surface of a dismal phenomenon that I've been aware of since the late 1980s — purely by accident.

My history with computers starts thus: in 1981, I acquired a Sinclair ZX81. And I played with it for a bit, before deciding it sucked. (I sold it and used the proceeds to buy a programmable scientific calculator, which I still own.) It wasn't until 1985 that I saw a computer that did what I wanted and that I could actually afford to buy (on a student's shoestring): the Amstrad PCW 8256. I loved my PCW, and lavished on it a memory upgrade, and a second 3" floppy disk drive (not 3.5"; the PCW ran on a unique flavour of media that nobody else used), and then — quite radical in those days — a hard disk that cost me two weeks' wages on my first post-graduation job. Then in 1987 I got a much better job, bought myself an IBM PC clone, and ...

But enough about the hardware history. The software history is quite different.

The Amstrad PCW came with a word processor called LocoScript. LocoScript got me through my final year at university and, coincidentally, was what I wrote my first professionally published short stories on. But it was very limited — if your files went over about 40Kb in size it slowed right down, and there was no word count facility. I realized early on that I wanted something better, and I bought a copy of Protext (before LocoScript 2 came out, which fixed most of my complaints). As I was also teaching myself a bit about programming (and from 1987 doing a night school course in computer science) I stopped using LocoScript, and simply started new work in Protext.

When I switched to an IBM PC clone (an Amstrad PC 1512) it had 5.25" floppy disks, rather than the eccentric 3" disks used by the PCW. And the PCW had no serial port (the serial port accessory cost all of £66, a fair bit of money in 1987). So I copied my writings onto a couple of 3" floppes and sent them off to a bureau who, for a small fee, returned them to me — along with a brace of 5.25" floppies. Then I started hunting for a WP for the PC.

Being halfway to broke at the time (I'd just bought my first flat, and interest rates went up roughly 4% over the next year, doubling my repayments) I poked around various shareware libraries first. Along the way, I settled on PC WRITE 2.4 for a while, and wrote a novel on it. (Various ideas, reworked in other forms, ultimately ended up in "The Atrocity Archive" many years later). I was reading magazine reviews, and had a fair idea of what bugged me about existing word processors; so when Borland Sprint came out, I coughed up the relevant amount of money and bought a copy of that, which served me well for roughly five years, by which time Windows 3.0 was sitting on my 386, and I'd acquired a student copy of Word 2.0 for Windows while at University.

... Are you noticing a pattern yet?

In the space of six years, I went through five word processing packages. Being naive at the time I didn't export my files into ASCII when I moved from CP/M and LocoScript to MS-DOS. I learned better, and when I switched from Sprint to Word I halfway ASCII-fied those files; they're a bit weird, but if I really wanted to I could get into them with Perl and mangle them into something editable. Along the way, I lost the 3" floppies from the PCW. Then I had a hard disk die on me — in those days, the MTBF of hard drives was around 10,000 hours — and it took the only copy of most of the early work with it.

Score to 1993: two years' work is 90% lost. And a subsequent five years' work is accessible, kinda-sorta, if I want to strip out all the formatting codes and revert to raw ASCII.

In 1992 I got a Mac (a Macintosh LC, with all of 4Mb of RAM and a 40Mb hard disk). I was also working in a UNIX shop, where text processing was an important part of my job and SGML was the coming thing. And I began to notice something ...

Every time Microsoft bought out a new release of Word, they introduced a new file format. The new version of Word could read documents created by about the last three versions, plus RTF. If you were in business and needed to exchange electronic documents with business partners, you had to upgrade in lockstep so that you could read the files they sent you. This was used quite coldly as a marketing tool, to compel the herd to buy new copies of a word processor — which, by then, was a mature technology. The upgrade cycle was about 18 months to two years long, and I suspect it had more to do with accounting and depreciation rules (so that a corporate customer for MS Word licenses would only have one generation of the software depreciating on the books at a time) than with development time. The upshot was that, unless you took precautions, your documents would become inaccessible due to designed-in obsolescence within about 4-6 years.

I am not in a business with a 4-6 year document retention cycle. I am in a business where I hope that what I wrote ten years ago will still be accessible a century hence. Microsoft's policy was deliberately destroying my life's work.

Of course, Microsoft was not (then) in the business of selling software designed to meet the requirements of novelists; it's in the business of making money by selling software to offices where the average document has a life of a couple of months to a couple of years, and where paper files are routinely destroyed after 5-10 years to save archival storage space. And realistically, how do you go about selling a mature product (word processors) into a market like that? Well, a simple solution is to get the users to give you their data — and then charge them rent for accessing it. Microsoft charged rent in the form of payments for regular rolling upgrades. Now they're pinning their hopes on Cloud Computing, where all your data will be stored in a nebulous cloud somewhere on the internet — sort of like Google Docs and Sheets with a Microsoft tax on top (Google monetize it by advertising, of course).

It's not just word processing. I briefly looked at Microsoft Outlook as an email client, once. It turns out that Outlook stores email in a proprietary data format that only Outlook can easily read. Needless to say, I wouldn't touch it with a barge-pole. Ever since I first got email in 1989, my acid test for an email system is "can I get at the content from outside?" To put this in perspective: last month I rediscovered a DC 6150 tape cartridge containing a backup of emails I'd sent and received in the period 1991-95. I'd thought it lost forever, and indeed, I had no way of reading it. But thanks to a friend of mine who did, we were able to retreive the contents — mailboxes stored in MMDF and Mbox formats (MMDF is similar but not compatible). Both are still in use, and still readable, to this day, using open source clients — or even a text editor (they're simply long text files with individual email messages separated by a header). The tape had been written using tar, a UNIX archiving tool that's been around since the late 1970s.

I can't really blame the big corporations for wanting to seize all our data and charge us for access (either a monthly fee, or by forcing us to pay attention to adverts); corporations behave the way they do for structural reasons. (It's like the fable of the Scorpion and the Frog.) But I don't need to cooperate with them.

As a matter of personal policy, for those activities that involve creating data, I aim to use only software that is (a) cross-platform, (b) uses open or well-published file formats, and (c) ideally is free software.

This is in some ways a handicap; Thunderbird (my mail client of choice) and OpenOffice aren't as colourful and feature-rich as, say, Apple's Mail.app or Microsoft's latest Word. However ...

Firstly, they run on Macs, Linux systems, Windows PCs, and even on some other minority platforms. This protects my data from being held to ransom by an operating system vendor.

Secondly, they use open file formats. Thunderbird stores mailboxes internally in mbox format, with a secondary file to provide metadata. (This means I can claw back my email if I ever decide to abandon the platform.) OpenOffice uses OASIS, an ISO standard for word processing files (XML, style sheet, and other sub-files stored within a zip archive, if you need to go digging inside one). I can rip my raw and bleeding text right out of an OASIS file using command line tools if I need to. (Or simply tell OpenOffice to export it into RTF.)

Thirdly, they're both open source projects and thus the developers have no incentive to lock me in so that they can charge me rent. I don't mind paying for software; where an essential piece of free software has a tipjar on the developer's website, I will on occasion use it. And I'm writing this screed on a Mac, running OS/X; itself a proprietary platform. But the software I use for my work is open — because these projects are technology driven rather than marketing driven, so they've got no motivation to lock me in and no reason to force me onto a compulsory (and expensive) upgrade treadmill.

I'll make exceptions to this personal policy if no tool exists for the job that meets my criteria — but given a choice between a second-rate tool that doesn't try to steal my data and blackmail me into paying rent and a first-rate tool that locks me in, I'll take the honest one every time. And I'll make a big exception to it for activities that don't involve acts of creation on my part. I see no reason not to use proprietary games consoles, or ebook readers that display files in a non-extractable format (as opposed to DRM, which is just plain evil all of the time). But if I created a work I damn well own it, and I'll go back to using a manual typewriter if necessary, rather than let a large corporation pry it from my possession and charge me rent for access to it.

|


127 Comments

1:

Yes, exactly!

With one big extra advantage for me (less important for most people): If files are in an accessible format (text most often), and I'm using only software that doesn't require a proprietary format, then I can bounce the same information from database to GIS to stats software and back again. And anyone else who needs it can still read it, on windows, mac, linux, and anywhere else.

For me, it makes it entirely worth the (horrid) inconvenience of being the only linux user in a government research lab full of windows users.

After the last discussion, I did go ahead and install ubuntu-eee. Whoever told me the install was fast was quite right - it worked beautifully. But that isn't the timesink aspect I was referring to: it ALWAYS takes the rest of the day to configure everything, install essential programs, and make all the little tweaks that help to turn a computer into an extension of my brain and fingers. Still, with Ubuntu installed I can do that, rather than feeling lobotomized as I did with the Xandros install.

Is it disturbing that I feel that way about my computers??

2:

We share nearly the same pattern of "IT evolution": I started with ZX80 (1k RAM) and followed a similar path, except for the Mac (I just worked in an Apple shop for a year).

Your words must be read bt many IT Managers, who thinks that they canno be blamed for choosing Microsoft (the same IT managers in the 80s and 90s thought they cannot be blamed for choosing IBM).

Open standards and open document formats are extremely important, unfortunately a few of IT professionals understand this matter. Maany consider this issue just a matter of uber geek Linux evangelists.

With the current economic situation many IT manages are asking themselves if is still the case to pour a big amount of money in Microsoft bank accounts.

The good news is that maybe the tide is turning, tha bad news is that it required a big financial crisis.

3:

Different string of pre-Win3.1 machines and software for me, and we used Lotus SmartSuite at work in the 90s -- but a rather similar pattern of "WTF?" when I noticed what was going on with MS Word on *other* people's machines. That, right there, is one of the two primary reasons why even though I've drunk Redmond's OS Koolaid, I don't voluntarily use their office software. (The other primary reason is that Word drives me to screaming incoherent fury within ten minutes if I have to use it for anything other than editing the text in a pre-formatted memo, and sometimes even then.)

I usually work in Lotus Word Pro's native file format, but save the text every so often into a plaintext file so that as long as I've got the file itself, I can read it with *something* in ten years time. For that matter, I can still open and read my native format files from when Word Pro was Ami Pro 3.1, or at least I could the last time I had reason to look at one of my 1990s files a couple of years ago.

4:

I just had a look, and Mail.app doesn't appear to be a closed-format offender; it stores messages in individual text files and indices in sqlite databases -- although since the file format it uses isn't quite the same as any standard one, it may fail the resiliency test for local folders. (See http://www.cosmicsoft.net/emlxconvert.html )

I had been using Mail.app for a while, but I started using mutt on my Eee 1000 and rather like it, so I'm probably going to switch my Mac over to that. Since I access my mail from multiple computers, I keep all folders on the imap server (fastmail.fm) and let my mail clients handle syncing everything automatically. (For mutt, I use offlineimap rather than mutt's built-in imap client because I like having access to everything while offline.)

5:

All very good points, but it's worth mentioning that a lot (not all, but a lot) of Apple's applications store data in non-proprietary formats, and where they do use proprietary formats it's often to augment the accessible stuff -- for example, if you use Mail.app, then a proprietary Spotlight database makes it fast to locate messages with particular text in them, but that's in addition to, and separate from, the messages themselves -- each of which are just saved out individually as plain ascii text files. Similarly for iTunes, which maintains an .xml file of all your tracks -- but the tracks themselves are simply stored as mp3 (de-facto cross-platform standard) or aac (ISO standard) files you can access directly. iPhoto, again, the pattern repeats itself: Standard, accessible files plus a database to rapidly index and search metadata. I think it's interesting; it's a good way to approach the problem of keeping your user's data open and accessible, while still providing a good user experience.

This was, to be fair, the problem with a lot of those early systems and file formats -- they had to be more than just ascii text in order to support the features that you (presumably) cared about if you used a word-processor (as opposed to a raw text editor)... formatting, mainly, but in those days just holding a novel-sized chunk of text in memory (or swapping in enough of it to view and edit) was a challenge. So all sorts of tricks were required, and these were reflected in the file formats. It was a technological necessity.

Somewhere along the line -- I'd argue in the mid-to-late 90s -- this stopped being the case, as computers finally got to the point where even a large technical book with formatting and screenshots and accented characters and an automatically-updated chapter index and the rest of it, was something that could either fit easily in memory or be spooled quick enough off the now-very-fast hard drive that dicking around with binary optimisation tricks was just a waste of programmer time. Which made it curious that Microsoft continued to do so, but there you go. I suppose they had the nightmare of backwards-compatibility with the hideous mess that was OLE to contend with, but even so... anyway, by that point I'd stopped using WPs at all, and just hacked all my documents out in HTML in a plain-text editor.

It is notable that Apple's office suite, iWork, is an exception to the "open standards plus metadata" pattern above. Their file formats do change irritatingly from version to version, although they're still relatively transparent (at least if you're somewhat technical -- probably not to the average iWork user). They consist of XML plus standardized/"well known" compression formats (gzip for iWork 08, PKZIP for iWork 09). I have to say though, the XML is a mess and not human-readable (at least, not in the way HTML is; you can see and extract your plain ascii text from it in a pinch). It's reverse-engineerable*, but you could say "Yes, but why even get into that position in the first place?" and I've have to concede your point :-)

(* and I have, for reasons unrelated to long-term data achiving and archeology -- I needed one of my applications to extract some info out of iWork documents in realtime. It's pretty easy if you throw Python and ElementTree at it.)

One point regarding mail specifically, the true secret to email preservation and portability has nothing to do with your choice of email client: it's in your choice of server. If your messages are on an IMAP server, preferably one you control yourself, then your choice of client is irrelevant as long as you tick the "Leave mail on server" box. Your client's mailbox is just a cache for speed and/or offline access. You can swap back and forth between them as much as you like, they all stay in sync, and if one of them goes the way of the MSX you just install a new one and point it at the same server. I do this all the time: Start an email using "pine", postpone it, it turns up in my "drafts" folder in Mail.app where I can resume it, send it, later on I can check my sent mail folder on my iPhone, and it's there. This is how email should be, not held hostage to any particular electronic device, no matter how unproprietary the software it runs!

Incidentally, it's interesting how many writers got into computing via the PCW 8256. I mean, I know a lot of people who've been computing since the 80s, getting into it through all sorts of platforms, whether 8-bit home computers like the Sinclairs or the Beeb, or 16-bitters -- mostly the ST or the Amiga. But if they write stories or novels, I think they almost all started with LocoScript on Alan Sugar's green-screened contraption.

user-pic
6:

Funny. I came to the same conclusions but for slightly different reasons.

I started with the school's CBM PETs, then the BBC Micro (which was the first computer I owned), then in university I started to learn DOS (and 8086 assembler, TSR programs etc) and Unix (BSD based) at the same time. I bought a PC clone from Morgan Computers (Atari branded!). I started to learn C and TurboPascal (I bought TurboC and TurboPascal using student discounts). I started to use a program called "Cardbox" under DOS to store my CD and book lists (fortunately for me this program actually stored its data in a plain text file with formfeeds between records so I was able to re-use that data). I started to learn TeX and tried writing up one term's worth of lecture notes in it.

When I got my first job and then a 386 based machine I played with DesqView. But then I noticed how expensive software was. No longer being a student I didn't get the discounts and couldn't afford the stay on the upgrade treadmill. WordPerfect changed from 4.2 to 5, MS Word became Word for Windows. Windows! It was costing lots of money to just stand still.

At work we were using Unix systems for everything (even the secretaries would use them) with VT220 terminals connecting via terminal servers to a telnet command prompt. All documents were created with "vi" and "nroff". We had a series of wrapper scripts to detect how the document was to be displayed and formatted according to screen/daisywheel/dot matrix/postscript etc. The same documents could be viewed on SVr2, SunOS 4, CTIX, and even DOS machines. I loved the idea! Free software with built in portability (it's text!) that I could also program to (it's text! I can do grep/sed/awk! Not some closed binary format).

So when Linux 0.10 came out I jumped on it. Never looked back.

My CV is still written in nroff. Documents I created 18 years ago may need slight formatting changes (header/footer margins mostly) but otherwise print perfectly (heh, could even do it under MacOS X now that has groff :-)). My check book software is an old version of "check book balancer" (written in tcl and perl), and this uses tab delimitted text files (look! I can write programs to generate reports!); most the time I just vi the datafiles rather than use the GUI front end, which I mostly now use for statement reconciliations.

For me, simple data storage formats win because they allow me to extend and customise and _reuse_ the data for my own needs; second is open standards or easily convertable (so I might use msql or similar because I can export to SQL statements). My original driver (poor straight-from-uni so can't afford modern software) is now not important... but the annoyance of the upgrade treadmill is enough to keep me away. Funny how cost and closed formats and regular upgrade requirements seem to go hand in hand :-)

I'm not too sure about OASIS and other such standards. They probably (hopefully!) will help prevent format obsolescence, but they're too complicated for me to write quick'n'simple one-liners against. And that programmability is real important for me!

7:

Canis: The best example of the formats you describe was Nisus Writer, which used actual text files with the formatting stored in the resource fork, so a human-readable copy was guaranteed to be there. Resource forks can be evil, but that was a good use of them. (The OS X version is AFAIK a complete rewrite, and I haven't used it so don't know whether it uses a similar format.)

8:

There's yet another reason to use open software and open formats:

Security.

In my line of work — both now and in my prior career — it's not just the archival access, but the hostility to executable code, that makes RTF an appropriate file format. (Pragma: If your archival document needs to have executable code in it to display coherently, you didn't create it properly.) RTF is, after all, just heavily commented ASCII to a filter program. Similarly, for spreadsheets, properly-commented CSV works better. And so on.

The obvious exception is when intermediate calculation methods matter... but, as any properly trained programmer should be able to tell you, that's what good documentation practices are for.

Finally, there's another aspect of the various "language barriers" that counsels using RTF. I've been in-house at a publisher that had Windows machines in editorial and Macs in production for typesetting... not to mention authors using a wide variety of Mac-based word processors outside the US sending stuff to a US publisher. RTF handled the different character-set encoding of this machines far more transparently than did any Microsoft-proprietary file format; only WordPerfect came even close, and I had to run that on my personal laptop (at the time, WP5.1 was the default word-weapon-of-choice in the US legal community, primarily because Word didn't support legal characters, forms, and formats well... and still doesn't).

In short, it's not just archival integrity; it's sheer usability. The irony that RTF is a slightly translated subset of SGML has been lost on much of the community...

9:

Phiala@4: Take notes. It takes about 50 customization steps for me to get a new Ubuntu machine just right, but I just run through the checklist. The checklist has been growing and improving for seven or eight years now (it used to be for Debian).

10:

My migration to Linux was driven by incompatibility between word processors. Back in 2006 I acquired a Dell Axim PDA, and I intended to use it to edit documents which I would then sync with my laptop running XP. However, I only had MS Works on this laptop, which isn't a bad basic office suite for most people, but (at least at that time), didn't support MS Word documents. The PDA had a pocket version of Word, which wouldn't save to a format that Works would accept, unless you were prepared to go with RTF or plain text. So it was highly inconvenient to share files between them, unless I was prepared to get a copy of MS Office, whether through legitimate or illegitimate means.
Then I discovered OpenOffice, and that was the answer to my prayers. It worked well with MS Word documents, and I was easily able to edit documents with my PDA, then sync them and edit them again on the laptop using OpenOffice.
OpenOffice was the gateway drug that led me into using free and open source software. I was also looking for a decent email client at the time and settled on Thunderbird, which I still use. I heard a lot of the buzz about Firefox and switched to using that instead of IE6.
At that point I was hearing about more and more open source software so inevitably I started hearing about Linux, and was curious to try it as the wi-fi on my laptop was a real pain under Windows and I really wanted a more reliable connection. I had a voucher for money off one of those "Dummies" books so I bough "Linux for Dummies" and that got me started. I tried a few distros - Fedora, Xandros and Linspire but eventually decided I liked Kubuntu best and have never looked back. Of course that meant I couldn't get the PDA working with it but by then I wasn't too bothered, I was just glad to have a way out from under Microsoft's control of all my documents. Now I always use Google Docs for all my documents so I can access them from anywhere.
As for the "why bother using Linux, Windows is so much easier", I think that is utterly wrong, and people just think what they use is easiest. I actually had a problem with a laptop running Windows Vista today where the registry had become corrupted, resulting in the OS failing to detect the CD drive. I had to edit that using regedit. If I hadn't found instructions on a website I would not have had the faintest idea what to edit. The text configuration files in Linux and Unix are at least vaguely comprehensible to mere mortals. Anyone who thinks Linux is harder to use than Windows has probably never delved into the registry.

11:

I call this "data lock" and I write about it fairly often:

http://notes.kateva.org/search/label/data%20lock

I tried to get "data lock" as a concept into Wikipedia, but it was rejected as being too much of a dictionary entry.

This is also very relevant to the work I do with health information systems, but that's an other story.

I think we need a term for this concept that people can work with. We've been dealing with "data lock" since the 80s (I lost data in Volkswriter), but people are still rediscovering it. Clearly the meme is getting lost.

Maybe "data lock" is the wrong term. As a wordsmith and memesmith, can you suggest a better one?

12:

This argument is a bit silly. Do you believe that Word documents will not be readable in 10, 50 or 100 years? Why? Supporting old formats is cheap and easy. There's little reason to drop support as long as the underlying technology is the same. Even if the code base is rewritten, and the company decides not to include the old format in the rewrite, it can still access old components. For instance, Word 2024 could include a Word 2007 dll and use it to parse old documents.

At some point the OS will no longer be able to run old components. Then you just emulate an older OS. Or you emulate within an emulation. This could all be wrapped behind an easy-to-use GUI, say, the Give Me My Old Documents Back 2034 Pro Ultimate Edition (Windows Vista License Included).

Both Microsoft, other software companies, and open source developers have an interest in helping you read your decades-old Word documents. My bet: As long as the data is preserved, and the format is not too obscure, it will be readable for all time.

Your other point is much better: By introducing new formats with every edition of Office, Microsoft pressures users to upgrade every couple of years. The pressure is gentle - you can download a Office 2007 parser for your Office 2003 for free - but it's clearly a deliberate strategy. And it works.

That said, I'm not saying you or anyone should write novels in Word, or at least not store them in Word format. I would use something like RTF that nobody has any problem parsing, simply because it's inconvenient to convert old documents to new formats all the time. But inconvenient is all it is. It won't be impossible, difficult, or expensive.

13:

FWIW, while it doesn't change the history of things leading to your conclusion the Office formats today are in an open standard that anyone can implement. It only took 20 years or so of Windows products to get there.

user-pic
14:

You put your history in a very didactic way.

In fact, as Linux user since 1992 (I was working at University at that time) my experience is similar to yours. I almost lost some articles and even my MSc work...

One additional point in using open source platforms is that every time someone do something really interesting for proprietary platforms, the owners of those (since EULAS are just licenses to use) come out with something new that makes that interesting stuff incompatible while they "present the market with an astonishing new feature" that encompass that thing. If the developer is lucky they purchase his company at a nearly reasonable price. If he is not, he'll have to litigate for a long, long time...

15:

Bjørn --

Word documents go inaccessible *now*.

Let me tell you a little story -- someone needs to produce a large document from the design database. The document is basically a list of bits and what their states mean; it prints as this vast rat-killing phone book of a document. No one wants to type this in by hand, so some code is lashed together to produce an rtf document from the database contents. The rtf in question is carefully reverse-engineered from the specific behavior of a specific -- the one on Fred's computer, right there -- Word version.

This is a business critical document; customers need to have this so they can perform their own design activities, other parts of the company (like the higher level software teams) need to have this document so they can tell how to do what they need to do.

It goes on being dumped to this specific version of Word for a number of years. (From Word it got stuffed through Acrobat and turned into a PDF, an operation that took on Fred's by-then old machine something on the close order of 24 hours.)

Then Fred's old machine dies, heroically trying to produce one more final version on the day of first customer ship for a new product.

That version of Word cannot be had for silver, cannot be got by trade, cannot be took by courtesy of pike and cannonade; it just plain no longer exists.

The reverse-engineered RTF produced by the program written by the guy who isn't here anymore crashed modern versions of Word about one time in three. No one has the least idea why. (It also takes on the order of an hour to load, so the utility of experimentation is kinda limited.) A couple of 16 hour days later, there's a messy, ugly, cumbersome sequence of incantations that can go from that RTF to a PDF by a completely different pathway, but it's repeatable and improvable; it's not a black box.

Meanwhile, the internal-use HTML version of that document has indexes and searches and tree views and various ways to slice and display the data, takes minutes to generate, and even if it's designed to be IE only (the evil that is & without the semicolon) it can be adjusted quickly and easily using programs like tidy. HTML is a craptastic example of an open format until you get to XHTML Strict, even. But the open format version can have the collective effort of this huge number of smart people applied to it; one does not have to solve all the problems your own self.

RTF would be lovely if the various MS products implemented it consistently; they don't. It's not really very safe. (Try table in table sometime, across a couple generations of Word for Windows and then on Mac. It's not pretty.)

The current winner for format stability is nroff/troff/groff; Brian Kernighan's original Bell Labs papers have examples, and the examples work in groff, across 30 years. (Troff is, alas, not the winner at anything much else.)

But, really, any kind of plain text, including annotated plain text like HTML, XML, SGML, troff, or TeX -- and since that now includes unicode plain text, it really doesn't matter what language you're using -- is a better bet than any format with binary data.

Note that Open Document is a compressed plain text (XML) format; you can even do funky things like forcibly replacing the styles.xml file using command line zip tools to guarantee that none of the engineering knowledge contributors have change the styles used in the document...

Word, well, the Word ISO standard still has binary data in it.

16:
This argument is a bit silly. Do you believe that Word documents will not be readable in 10, 50 or 100 years? Why? Supporting old formats is cheap and easy. There's little reason to drop support as long as the underlying technology is the same. Even if the code base is rewritten, and the company decides not to include the old format in the rewrite, it can still access old components. For instance, Word 2024 could include a Word 2007 dll and use it to parse old documents.
Word 2024 could do that, yes. If it chose to. If it still exists in 2024. Or 2124, or 2224. This won't affect you; but this is at least in part about the concern of librarians and archeologists that there will be a hole in the digital archeological record covering our period of time, due to data being stored in formats for which reading technology no longer exists.

The critical point is that, whatever the makers of Word choose to do, you don't get that choice. You get to use what they give you, and if that doesn't meet your needs or if they no longer exist to provide, what then?

The Office 1997-2007 file format is published. It's also 210 pages long and of dubious readability -- as someone who has reverse-engineered file formats in the past, even with the documentation, I'd just throw up my hands and say "screw it".

If every web browser on earth vanished overnight -- not just the installed applications, but the source code, the specs, the html format documentation, the whole lot -- you could still open any HTML file in a basic text editor and extract the text, and make pretty good guesses as to the formatting too.

Hopefully ASCII won't go anywhere for a while (and even if it does, it's susceptible to rediscovery by statistical analysis -- just treat it as a substitution cypher). UTF-8 is also a well-distributed standard, and is the same as ASCII for the first 127 code points so unaccented text would still be retrievable even if somehow all UTF-8 documentation was lost, and help provide a rosetta stone to recover the rest of the code-points.

17:

Hey, I grew up in the same house as an Amstrad PCW 8256!

And I lost docs that were created in a Lotus wordprocessor application of the early 90s (but still in use with us on the first PC in the late 90s) and saved to a floppy.

Recently an architect friend of mine and I were discussing the idea of an open specification language for building components - self-documenting buildings. Which he said would make me rich. I had to explain that making it proprietary would be not just a bad idea but, given the application, frankly irresponsible. Imagine if no-one had a copy of the client that reads the DON'T USE A POWER SAW ON THIS IF YOU VALUE YOUR LUNGS message...

18:

Microsoft would appear unlikely to go bust, on the face of it.

But then, so did Pan Am. And the Bell Telephone Co. Or Chrysler. (FIAT are acquiring a 35% stake in Chrysler for, it would appear, one bent cent plus access to FIAT's engineering and marketing channels. This suggests that Chrysler's owners currently consider the car manufacturer to be worth approximately three bent cents without a rescue package.) The list goes on and on ...

The life expectancy of a large corporation is 35-70 years. After it's gone, good luck getting support -- especially when its assets are virtual.

Good luck clawing your data back from the cloud when the hosting facility pulls the power plug on the servers owned by a company that stopped paying its bills.

19:

Bjørn Stærk @ 10:

Both Microsoft, other software companies, and open source developers have an interest in helping you read your decades-old Word documents
But Microsoft and other software companies have a more compelling (to them) interest in continuing to force customers to upgrade (that's a large part of their ongoing revenue stream). And the compatibility issues between versions only affect a fraction of their customer base, and don't really prevent most of that fraction from upgrading, so it's not cost-effective to put much engineering resources into format interchange utilities. And the only users who can reliably use the more esoteric interchange techniques (emulations of emulations) are typically software developers themselves, rather than ordinary users, or even ordinary sysadmin / IT people.

My wife discovered the problem with Microsoft word versions when she wrote a few novels in the early 90's on Word 5.1. Several computers later, that version didn't run on any of the machines we had, and Word 2000 didn't translate the files correctly. Luckily, there is a commercial translation utility that converts any Word format to any other; I just convert all the files to Word 2k / XP format, and Open Office can read them with trouble.

I'm largely agnostic as to word processors; all I care about is that they do reasonable formatting without lots of magic incantations, that they generate both RTF and something that Open Office can read, and that there be some way to generate PDF from them (on Macs these days, that just means printing the file, Windows still can't do that easily; even with a third-party PDF generator installed it takes a couple of steps when it should be as simple as clicking on "Print to PDF"). So I'll use whatever is handy on the machine I'm on. Similarly, I'm one of the few people I know who use both vi and emacs routinely; vi's great for changing a few lines in a config file, or typing in a quick note to a text file. Emacs is much nicer if you use several different programming languages and need to switch formatting rules between files, and I can keep and maintain easily (I'm fluent in Lisp) a bunch of macros that are useful in complex manipulations of structured text like programs). Of course, neither one is really a word processor, so I still keep copies of Open Office on all my computers and Linux partitions and virtual machines, and have the latest Mac word processor (currently Pages.app) on the Mac OS partitions and VMs.

It is indeed deeply ironic that file formats seem to be inexorably moving to some form of XML, given that they are all recognizably SGML, the parent of XML, which has been written off several times as a "failed experiment".

20:

If yor're worried about storing your data in the cloud, there's an interesting open-source project:
http://allmydata.org/trac/tahoe
which allows people to back up their data on their friends computers (in encrypted form). It stores the data on multiple computers, so if one persons computer goes down, you can still get to your data.

Having said that, I don't actually know far along they are with making it easy to use. (checks web page) The instructions still seem rather techie oriented.

21:

Agree 100%. How you do plan to address the legacy problem, though? By legacy, I mean the trunk of papers that most people before now have left behind when they died? All of my data is stored various places in open formats, but it's all locked behind password barriers, as I imagine is yours. How do we as a society assure both security now and allow our descendants and colleagues access to our data (be it interesting or banal) after our deaths?

22:

Evan: it's called making a will setting up a literary trust, and leaving a sealed written record of passwords and servers to hit on behind for the executors.

Right now, this is strictly for the professionals whose residual post-mortem income is likely to be of interest to their bereaved inheritors.

But yes, it's going to be an interesting cultural problem when everybody wakes up and realizes 99.9% of the information that they've assembled during their life is going to go away when they die. Like being buried in an unmarked grave.

user-pic
23:

I have attempted (mostly successfully) to avoid MS Office. In part because of the file format changes but mainly because as a trained biologist I know monocultures are unstable- Also I'm tightfisted WP Office 6 cost less than a third of the MS Office of the time.

I've needed long term access to my data and so for 15 years I stuck with a old DOS based spreadsheet and two WP programs Wordstar 6 (I know!) and Wordperfect Office (as versions 6 through 10 used the same file format I continued to used V6 with 10 to read the WORD docs people insisted on sending) I've used Open Office since 1.4 but i's only since V2 that I've moved wholly over to it, For OOo to gain access to the spreadsheets I need to open it in Quattro Pro 10 (WP Office 10 spreadsheet component) save as XLS and transfer to Calc WP documents are trickier unless I save as ASCII as I've already discovered that implementations of RTF vary between WP 6 and 10 let alone assorted MS Office variants (more recent files will save to MS DOC 98/2000 and can be read by OOo earlier have differing success rates)

My pictures on the other hand are His res low/no compression JPG I get twitchy about the 'professional' RAW formats as there is no standardisation, each company's 'standard' is proprietary and requires special software.

It's annoying when I get tricked out PowerPoint in the multimeg range when the same data could fit an less than 10K of plain text. Pretty yes; useful- like H***

24:

I'm an archaeologist who occasionally moonlighted as an IT manager for archaeological units, and trying to get legacy data from old projects, where you have very little information about the original propriety software continues to be a major problem! You may be interested in the work of the archaeological data servicehttp://ads.ahds.ac.uk/
which has been working on this sort of thing for some time now

25:
Recently an architect friend of mine and I were discussing the idea of an open specification language for building components - self-documenting buildings
I believe I read about this in the Bruce Sterling novel, Distraction. There are these sort of "technologically-mediated, asynchronous barn-raisings". A bunch of self-describing components, a "smart blueprint" and some tech to convey this to humans. Anyone could wander up and help out. A bit of Googling turned up this quote:
Oscar peeled a strip of tape from a yellow spool and wrapped the tape around a cinder block. He swept a hand-scanner over the block, activating the tape... "I'm a cornerstone," the cinder block announced.

"Good for you," Oscar grunted.

"I'm a cornerstone. Carry me five steps to your left." The construction system was smart enough to manage a limited and specific vocabulary. Unfortunately, the system simply didn't hear very well. The tiny microphones embedded in the talking tape were much less effective than the tape's thumbnail-sized speakers. Still, it was hard not to reply to a concrete block when it spoke up with such grace and authority. The concrete blocks all sounded like Franklin Roosevelt.

It always sounded interesting if not-entirely-practical with current tech. Although we're getting closer all the time.

26:

Anybody (Bjørn) who thinks Charlie's arguments are a bit silly should try reading a Microsoft Works document from the early 1990s:

http://www.edavies.nildram.co.uk/2007/03/max/Manual.html

When I came to do the conversion I had a copy of Works (2.0?) which could read the files but not convert it to anything else of any use. The copy of Word 95 I had could, as far as I can remember only read Works 3.0 documents. It was lucky that I found a copy of Works 2000 in a cupboard which was not in use but which I could install temporarily to give a conversion path. With the various travelling backwards and forwards I did just getting these documents into a modern word processor was most of a couple of days work.

27:

Graydon: "That version of Word cannot be had for silver, cannot be got by trade, cannot be took by courtesy of pike and cannonade; it just plain no longer exists."

In your example no information is lost. The company just doesn't want to hire a consultant for a week to update their software. If that is their approach to software, they'll likely have the same problem with all their other decades-old technologies.

Canis: "The critical point is that, whatever the makers of Word choose to do, you don't get that choice. You get to use what they give you, and if that doesn't meet your needs or if they no longer exist to provide, what then?"

If there are many people like me, somebody will write a parser for old Office documents. You say you'd just throw your hands up and say "screw it", but if there were no such software, there would be a lot of money in it. You might give up, others would not.

"This won't affect you; but this is at least in part about the concern of librarians and archeologists that there will be a hole in the digital archeological record covering our period of time, due to data being stored in formats for which reading technology no longer exists."

But that is no different from the challenge of reading these documents 20 or 50 years from now. Future historians won't need to write an Office 2003 parser, they'll use ones that exist today or will be written in the near future, run through multiple layers of emulation if necessary.

Charles Stross: "Good luck clawing your data back from the cloud when the hosting facility pulls the power plug on the servers owned by a company that stopped paying its bills."

Yes. You can take backups - but that depends on there being some way to export the data. I don't see an export button on GMail, for instance. For business critical software I would expect responsible IT departments to have a policy about this, (or get one when this takes off.)

Bruce Cohen: "And the only users who can reliably use the more esoteric interchange techniques (emulations of emulations) are typically software developers themselves, rather than ordinary users, or even ordinary sysadmin / IT people."

But if the need for this was large enough, one could make an easy to use GUI wrapper for this. If it's possible to do it at all, it's possible to make it easy to use - and it is possible.

"Luckily, there is a commercial translation utility that converts any Word format to any other; I just convert all the files to Word 2k / XP format, and Open Office can read them with trouble."

Well, exactly. That's my point. Now fast forward 20 years. There are an insane amount of Office 200x documents lying around in old databases. Microsoft goes bust, or launches a new Office that only supports Office 14 or greater. So some other company steps in and makes a lot of money. Problem solved.

Tileman: "trying to get legacy data from old projects, where you have very little information about the original propriety software continues to be a major problem!"

Yes. I've worked on exporting data from databases that were only a few years old, but nobody were around to explain how they worked, and nobody had ever documented them. This is a real problem in the software consulting business. But it's not directly relevant to the world's most popular file formats. Even assuming there is no specification of a format, the incentive to reverse engineer it is large.

28:

The problem goes (slightly) further than just open / closed formats : its information loss.
I got a Sony eReader this christmas, mostly as an experiment. I used to read books on a PDA until it died, and replaced it with the eReader.

For books its fine (Charlie: can you prod someone to release Saturn's Children in something other than Kindle format? is it DRM'd or could Calibre read it?). But most of my reading material is scientific papers, in PDF format.

Double columns. Lots of equations. Which i get to read in flyspeck 2.

When you try to resize equations in PDFs, unlike text they get destroyed. This is because PDF is a graphical format, not a semantic one: the equations are not in MathML, but whatever font makes them look right on an A4 page.

The trouble is, just about all scientific docs I see now are in PDF format. Information is then lost in archiving them this way. Ok possibly you could OCR it, but OCR is still cr*p at math notation: correcting OCR'd docs is slower than re-entering them by hand.

(Now, PDF is can have some semantic content, and there are ways of doing "good" PDFs (eg tagged PDFs, etc), but most software doesn't do this. So I carry literally thousands of A4 pages of papers purely because of bad software.

A jihad for ODF and mathml is needed, I think.

29:

Bjørn, why would Charlie or anyone else want to pay large amounts of money to fund a reverse-engineering effort (or buy a commercial RE'd translator, which is effectively the same thing) when they could just use an open format now and sidestep the whole issue? And, from an archeological perspective, I'm not sure I'd want to lay my faith in The Free Market(tm) to provide solutions to niche academic interests. And, in some situations and territories, such solutions might also be illegal, were the legal framework the same as it is today.

I don't see an export button on GMail, for instance
It doesn't need one. Hint: it's a mail server.
30:

Bjorn: there is indeed an export button from Gmail: it's called imap.gmail.com, and I use it routinely. (In fact, I don't use the webmail interface at all, if I can help it.)

I am, however, pessimistic about getting at elderly data formats. Getting into MS Word might be possible, but what about Protext or LocoScript? Those are going to be dead, dead, dead in fifty years. And I know a couple of novelists -- serious ones -- who still use those platforms for their work. Don't rule out the importance of preserving minority formats.

Alastair: I have no power over my publisher's choices in ebook formats. Hell, my editors don't get a say. (That's corporate IT policy -- set several levels up and several divisions across from them, inside a multinational publishing group. And I have yet to see a multinational publishing group ebook policy that has been written by people who can find their arse with a map, a tub of Vaseline, and a periscope.)

Agreed about the evils of PDF. On the other hand, once we have A4 300dpi e-paper displays, it oughta be just about controlable. (And pigs will fly.)

31:

Bjørn Stærk, so basically your solution to the problem is "I don't need to care now, somebody somewhere sometime will make the problem disappears".

Well, thank you very much but in this age of pervasive software patents and increasing pressure towards hardware-protected data paths I'm not going to trust my data to anything that it is not free, open-source, standardized with publicly available and free-of-charge specifics RIGHT NOW.

Are you sure that your elusive company is going to be able to reverse engineer a proprietary format when the parent company is not around anymore and nobody can really find out who's in charge of their "intellectual property"?

32:

Canis: "Bjørn, why would Charlie or anyone else want to pay large amounts of money to fund a reverse-engineering effort (or buy a commercial RE'd translator, which is effectively the same thing)"

There's a large difference between funding a product and buying it. About the difference between 1 million euro and 10 euro.

"I'm not sure I'd want to lay my faith in The Free Market(tm) to provide solutions to niche academic interests"

No need to. Trust academics themselves. Or open source. Look - imagine if there was a way that we could recover half of all the documents that were written in Europe between 1700 and 1740. All it takes is that somebody spend a few thousand hours reverse-engineering an old technology. I don't think such a project would lack in either funds or volunteers.

And that's assuming you need to reverse engineer a complex format. With Office 2007, you could get all the text out by just knowing how to parse XML.

"It doesn't need one. Hint: it's a mail server."

So what do you do when GMail shuts down? Select all + forward?

33:

John Wilson @ 32:

My pictures on the other hand are His res low/no compression JPG I get twitchy about the 'professional' RAW formats as there is no standardisation, each company's 'standard' is proprietary and requires special software.
It's more complicated than file format incompatibility. The image data in a raw file is dependent on the hardware design of the camera that captured it. Several camera-dependent transformations on that data are necessary to produce a camera and application-independent file.

However, there is now a standard raw file format: DNG. This was developed by Adobe, but they've published the spec, and provide a free conversion utility from camera-dependent files to DNG, running on Windoze and Mac, and there is a (relatively new) open source converter for Linux.


Canis @ 25:

Recently an architect friend of mine and I were discussing the idea of an open specification language for building components - self-documenting buildings.
At one point a few years ago, the architect Cristopher Alexander was trying to recruit software developers to work on a similar project: a self-describing language for specifying architectural "Pattern Languages" and designs described in them. I've lost track of what Alexander's been doing lately, and am not active in the software community he was talking to, but looking quickly at his website, it looks like he's been developing manual processes, so that project hasn't produced anything yet.

34:

My path to computer literacy went by way of the Sinclair Spectrum and Sinclair 128 – unlike the ZX81, eminently useful even for quite advanced things and languages other than the built-in BASIC. From there, it was the Atari ST line, and as far as wp went, something called Papyrus. The program was feature rich and early on Word compatible to a large extent. The developers were also behind interactive intellispell, which they did a deal with MS on.

For work, I did however need better compatibility with MS Word (then in v6) so had to have a machine with Windows. For a while, I could be clever and run the GUI on top of IBM Warp 3.0, much more stable, but MS did some deliberate API changes that ruined this strategy. When Word finally crashed one day, overwriting every doc file it knew about, including those in the MRU list, with a corrupted temp fragment, I knew I couldn't take its eccentric behavior any more. Thus I became an early OpenOffice user, just when its Word compatibility was getting good enough for my needs.

By then I was dabbling into Linux, settled on Debian andf never turned back. I currently run LinuxMint, a Ubuntu derivative, loaded with both GNOME and KDE. Happy.

35:

Giacomo: "Bjørn Stærk, so basically your solution to the problem is "I don't need to care now, somebody somewhere sometime will make the problem disappears"."

Yes.

"Are you sure that your elusive company is going to be able to reverse engineer a proprietary format when the parent company is not around anymore and nobody can really find out who's in charge of their "intellectual property"?"

Technical ability? Absolutely. These are not encrypted data we're talking about.

Legal ability? Yes. It might be a gray area, but let us assume that nobody knows how to read Office 200x documents in 2050. Some company somewhere probably owns all the old Microsoft patents, but they're not using them. So a company makes an Office converter. Nobody cares. End of story. Or maybe ex-Microsoft cares, and asks for money. They get it, (end of story), or they drive the company out of business, then (having realized there's a market there) make a converter themselves. Or maybe they just drive it out of business and then do nothing, because they are _evil_. Then somebody releases the source code on Freenet. End of story.

This isn't about proprietary vs commercial, Microsoft vs Linux, evil monopolies vs invisible hand. We are talking about a simple problem, (at least from my perspective as a programmer). The real issue is if there is sufficient motivation to solve that problem.

Charles Stross: "there is indeed an export button from Gmail: it's called imap.gmail.com"

Huh. That actually never crossed my mind. I'm beginning to think too much like a web person.

"Don't rule out the importance of preserving minority formats."

Yes, and that's where the threat lies. Not with Word, but with obscure formats. And not because it's technically difficult, but because there may be insufficient motivation to do it.

36:
There's a large difference between funding a product and buying it. About the difference between 1 million euro and 10 euro
Well, I originally wrote "...the same thing, except on timeshare" but I left it out in the end, figuring people would get the gist.

Spread out or lump sum, it's still a million € that could've gone to better things.

So what do you do when GMail shuts down? Select all + forward?
It's a mail server. It supports both POP and IMAP. You just hook up your client of choice and click "get new mail".
37:

I use Mac OS X because it works nicely, I find it easy to use, and it's open enough that I don't feel like I'm getting locked in if I ever want to move away from Apple.

My mail is in a format close enough to mbox that I'm happy with it, graphics I create are stored in multiple formats, but a layered tiff and/or a pdf is close enough to a standard that I'm happy with that too.

Minor nitpick with this article, and the previous one though. There's no slash in Mac OS X

user-pic
38:

Damn right, Charlie. Large chunks of my own PhD - documents that I created on Word2 on Windows 3.1 (still the best...) are now unreadable on my current set-up (Word 2003 on XP) Procedure: (1) Open using OpenOffice. (2)Save as .rtf. (3)Open in Word.

Perhaps I should have pasted them into VinVN and posted them all to USENET?

However, I am now going to say why I stick with this particular set of defaults. All my work communications (including quite a bit of multiple comments on complex documents) are, as yet, predicated on me using Outlook, Word, IE and and Acrobat. I'd have to translate everything both ways, and I'd be forever labelled as the Faculty Software Weirdo-Bore. More to the point, if I got stuck I'd have to sort it out myself, because my next-level IT support are good at solving problems in MS Office, but don't want to have to learn a whole different suite. So won't.

I know it's wrong, but I keep on doing it.

On the other hand, my workplace is the world's largest user of Moodle, so the default may yet move towards freedom.

39:

Chris: You're a victim of peer pressure just like everyone else in the workplace :(

I've taken a geeky vow around 1999 or so to only use free software and free/open file formats in my work, and so far it's working quite well :) So it IS possible if one accepts the consequences. And interestingly, people are slowly catching up, the formats are more and more accepted. It's great to see this happen.

But like Charlie, I have no problem playing proprietary games. My vow concerns only software and formats used in my work, or files given to other human beings :)

40:

Re Gmail etc... the easier way to do that if you're just after backups is to install getmail or similar and have it run every X time units (one a day or whatever). The IMAP client approach is better for most people but suffers from the human issue - most people will not actually do it.

Charlie's concerns extend to the online world to. What happens if Facebook disappears (or Bebo, Myspace, etc)? Millions of people have uploaded photos, etc up there. What happens to those? Sure, most of the photos are of family holiday parties, etc. but even that could be important to some future anthropologist.

user-pic
41:

I run a system 9/10 Macs and solve my access and archiving problems to my old documents in the following ways.

i) When I buy a new computer I just network it in to all my existing ones. If I find an old document from 1990, I can send it back to my Mac plus running system 7 if need be.

ii) I convert the contents of any document in any program I have to pdf files by the process of initiating the printing of a document but saving it as a post script file once the print driver interface comes up. I then open the post script file with Adobe Illustrator and save it as a pdf. (Any good post script reading program should do.) This keeps the formatting and layout of the original document.

Programs with propriety formats have a weak spot. The documents have to be able to be printed, so they have to set up in such a way that a LaserWriter driver can read them and convert them to a post script format. (Some laser writer's have drivers that produce more universally readable post script files than others, but you don't have to have the actual laser writer just the print driver to do this.) I assume the situation is similar for Windows machines.

42:

I'm a fellow mbox user and have mail archives going back to the late 1980s in the format that are perfectly usable today. The original backup tapes have all fallen apart, but by moving them to modern RAID drives (thanks the the timeless tar format), the contents have been preserved.

On the file format front, the Long Now Foundation is working to prevent the "Digital Dark Ages" in which we're unable to read anything produced from this era. Even with a codicil, I'm not sure that it is anywhere as easy to handle digital media and formats as older analog ones.

As a personal example of how quickly technical obsolescence appears, recently I found my parent's 1964 wedding footage, shot on a Super-8 camera. We were able to identify it easily by holding it up to the light and easily convert it to a modern format by telecine. Even if we were unable to locate a Super-8 projector, we could have scanned the film strip and done some image processing work to extract individual frames from the image to produce a movie. In the same box was a Betamax tape that we believed to be my uncle's 1980 wedding, but there was no way to play it back to verify the contents nor to do a trivial conversion to anything else.

Knuth created TeX for exactly the reason that you discussed -- his second printing of TACP was "ruined" by an upgrade in the typesetting process at his publisher, so he designed his own that would never be changed. Since the input files are plain text, there are no conversion issues (other than line endings, which are ignored anyway). The design has been frozen since 1989 and only occasional bug fixes have been required. In fact he pays a hefty bounty for any bugs that are discovered, but very few have been reported.

43:

Hudson: I used to work with a guy who has not one but two cheques from Knuth. Both of them post-1997.

And yes, I've been thinking about the digital dark ages for a long while; a chunk of the plot of GLASSHOUSE was predicated on the question "what will the "dark ages", 1950-2040, look like 700 years hence?" ... and then there's this.

44:

rick@40: Best precedent for this would be when Pownce closed down recently. They gave everyone two weeks to export their data then shut down.
That's not really a long period of time in the grand scheme of things. You could easily have been on holiday during that period and come back to find your Pownce profile gone and no way to retrieve their data. If this happened with a web app like Zoho Writer you could easily lose important documents without a hope in hell of getting them back.
As far as I know, networks like Pownce aren't under any kind of obligation to let you have your data if they shut down. If Facebook went belly-up there's no guarantee you could get your pictures and other content back. Maybe sites like this ought to sign up to some kind of voluntary data preservation charter backed up by a third-party organisation so that there's some kind of guarantee that if the site closes down you'll still be able to get your information for a pre-specified period of time. A year would be reasonable. Or maybe there could be a company that offers content insurance - the site in question, say Facebook for instance, insures its content against insolvency or closure and if they have to close down the insurer steps in and pays for the content to be available for export for a specified period of time. Something like that seems prudent to me.

45:

I learned my lesson about data preservation in college: I don't have any of that work anymore because of my choice of word processors. All of my writing now is in vi, and when required, I render it into other formats (Word, PDF, whatever).

I was worried about my pictures being mostly taken in Cannon RAW, but worked around that at first by making sure to archive a quality JPG, so that there would be something. Now, with DNG, that's fixed, too.

My biggest concern at this point is database related; I'm a database geek by profession, and my last 15 years of financial information is in one. I do export it to text representation, so it isn't that big a deal and I don't expect SQL to go anywhere (for better or worse), but simply reading a text dump of a 47 table DB to try to answer a simple question is less than optimal. And this is just a schema for my relatively uncomplicated finances; I've worked on projects for companies with multiple, multi-thousand table schemas. Even dumping those can approach impossible, for interlocking business and technical reasons.

I'm reminded of Vernor Vinge's _Deepness In The Sky_, explaining archives that can't be duplicated, but also _Guns, Germs and Steel_, where I first got the notion that ownership of data was actually a curation responsibility, not actually an ownership position.

46:

Charlie -- did you friend cash his Knuth reward cheques? Unfortunately in October 2008 he stopped sending real cheques due to bank fraud. As of 2008 there were only 427 known bugs, which is pretty phenomenal.

Perhaps it will be in everyone's best interest if every ten years or so there was some sort of infopocalypse in which Facebook et al lost all of their older data. That way the embarasing or actionable photographs from teenage parties would not live on to haunt future Presidents or MPs.

47:

I learned my lesson about data preservation in college: I don't have any of that work anymore because of my choice of word processors. All of my writing now is in vi, and when required, I render it into other formats (Word, PDF, whatever).

I was worried about my pictures being mostly taken in Cannon RAW, but worked around that at first by making sure to archive a quality JPG, so that there would be something. Now, with DNG, that's fixed, too.

My biggest concern at this point is database related; I'm a database geek by profession, and my last 15 years of financial information is in one. I do export it to text representation, so it isn't that big a deal and I don't expect SQL to go anywhere (for better or worse), but simply reading a text dump of a 47 table DB to try to answer a simple question is less than optimal. And this is just a schema for my relatively uncomplicated finances; I've worked on projects for companies with multiple, multi-thousand table schemas. Even dumping those can approach impossible, for interlocking business and technical reasons.

I'm reminded of Vernor Vinge's _Deepness In The Sky_, explaining archives that can't be duplicated, but also _Guns, Germs and Steel_, where I first got the notion that ownership of data was actually a curation responsibility, not actually an ownership position.

48:

An old CPM machine (wordstar, unreadable 5.25"), an Amiga (data more or less completely lost, even if emulators exist), WordPress on DOS (still somehow accesible by Word), WP and Word on Win3.1 to WinXP. Stuck with Word 2000 at home and Word 2000+1 at the office. Thinking about OpenOffice/StarOffice a lot (Asus ships even it's XP EEEPC with it), but my writing practices are MS Word practices, and I don't want to relearn them, and my document formats are MS Word formats, and I don't want to redo them. Not sure about the next step.

user-pic
49:

I like Linux a lot - and thanks to Wubi I am trying it again. If I can get dual monitor support to work properly this time I will probably try to stick with it.

That's always been my sticking point in the past - getting video to work properly. Once that gets solved I will feel a lot more comfortable using Linux.

50:

I like the last sentence of fishbane's comment.

I suspect that many of the problems mentioned here will diminish some because the software changes that occured in many programs had more to do with changes in hardware and capability. Computers and computing are a much more mature technology than it was in the 90's. They are still increasing in power and all that, but net utility has been increasing at a much slower rate.

I also think that the combination of emulation and a philosophy of distributed media--thumb-drives, discs, hdds, clouds as well as distributed computing such as pda's, ebooks, cell phones will drive programs towards supercompatibility. We will have more problems keeping data confined to a specific workspace than preserving it as time goes on.

Lastly, I think these concerns are ultimately silly. Not silly for anybody who *works* at this stuff like the author here, but this stuff is a matter of metis. Metis is always ephemeral, and there is no way you can bottle success at an occupation. Moreover, both the metis and the tools that are used in the activity are one way or another cloudspaced derived. That is, the user is *always* dependent on the community. When Charles Stross chooses open source software, he is really talking about which community he trusts. I know he said this in his post, but I think it's important to emphasis just how deep and multidimensional the statement is. The community that stores the skills best gets used the most.

51:

It's a problem archivists are trying to find an answer to. It's increasingly common that when they're given a deceased person's "papers" there are actually an assortment of disks included. What used to be personal letters, manuscripts, notes, journals are now emails & word processor files. And things like blog posts are entirely lost. Who knows how they'll deal with social networks if they become a permanent fixture of our culture.

For text I wonder if an international archival standard bitmap might not be the way to go. Converting it into editable text again could be done via OCR, assuming fonts aren't to unusual. But for other sorts of communications, I'm at a loss.

user-pic
52:

First computer I built/owned was a Science of Cambridge Mk14 in 1974, although I had used other micro-based computers before then, including an Opus from the Open University. No word processors for something that simple. First word-processor experience was writing code for 8080-based dedicated WP systems, then showing pharmacology grad students how to compose their theses on the Uni mainframe in runoff and saving them 50p a page in typist's fees.

Word processing has had a lot of bells and whistles added to its functionality, growing from a "simple" text entry and formatting tool to something much more capable than a dedicated desktop publishing package of a decade ago. The underlying file formats have had to change to support that extra functionality. There is a commercial decision to be made to stop supporting older formats; it costs money and effort to include that support in new packages and the returns (in terms of profit) are minimal. As someone else pointed out MS' obsolete formats are actually documented and publicly available. The whine is that it takes effort to understand the formats -- this reminds me of the old programmer's joke about badly/un-documented code. The expectation that MS should spend money to fix this problem is, I think, misplaced.

As for archiving stuff, I'm not sure... we create a lot of digital information these days, and I doubt even 0.0001% of it will ever be of any interest fifty years from now even to the most obsessive researcher. Forgetfulness can be a blessing, sometimes.

53:

Bruce @33 --

dcraw, Dave Coffin's raw converter, has been around for quite awhile now. It knew about DNG pretty much as soon as there was DNG to know about. So I'm a bit curious about which utility you mean.

Bjorn @27 --

Lots of information was lost, we just didn't know what the information was. We lost the information required to print the document. (Information causes change; that's how you can tell it from data. There was definitely a change.) Minimum time estimate to do the reverse-engineering again for Word was given as a quarter, three months. The document had to ship within a week. Being late at all cost major money.

Was that a stupid situation to be in? Sure. Was it easy to fix? No. Was much of the problem treating Word as a stable format? Oh yes.

54:

"why bother using Linux? It's so much easier to use Windows."

My experience is the exact opposite, I started using Linux because doing stuff with a collection of shell and awk scripts was easier than using MS's bloated tools.

55:

The problem of proprietary data formats going stale and unreadable is a real one -- even non-proprietary but obscure ones suffer from this -- but Charlie, I have no idea what you are talking about when it comes to Word docs. I just checked and I can read MS Word documents I wrote as far back as 1994 (which is the oldest document I can even find).

56:

Your linkage of: open formats matter (true) and using Linux is better than Windows/OSX etc is false.

Your point about open formats is very important, and I always liked your characterization of the "information dark ages" in Glasshouse. However, you can use open formats in most OS's. OpenOffice that you use can be run across OS platforms, so that should make the OS choice irrelevant.

What might be relevant about OS's is that rent seeking by commercial organizations that try similar games with the OS. The move from 8->16->32bit systems, the deliberate removal of old OS versions. OTOH, fossilizing the OS will prevent technology improvements through new concepts. Since data is important, but not the platform, I am of the opinion that data should have open formats (preferably published) but that OS platforms can be whatever you like best, with support for the data formats likely to be an important criterion for purchase.

57:

I had decent computers at work, back in the late 70s to mid 80s and never had a computer at home until after I had the first renal failure and didn't go back to work. I haven't had transition problems with data (I've only owned four computers, one of them brand new). But because I don't write often, I didn't install Word when I put XP on this one, for example. I use an earlier version of MSWorks, and actually usually write in Wordpad.

What I do to save data is to copy it onto a thumbdrive and swap it into my safe deposit box at the credit union every week. It does take some organization, but I wouldn't want to leave things on other people's servers.

John (Gordon) Faughnan @11, re: health information s/w, I'd be happy for lab software that copied over the standing orders. Kaiser gets new lab software at least once a year and I have to take my list of 23 standing orders and email the doctors to ask them to put them back in. I'd also like lab software that is keyed to me instead of the doctors, so I don't have to bring the list and prompt the lab techs into getting all the orders right when I come to be stuck.

58:

Graydon @ 53: My brain slipped a cog and forgot about dcraw; I was talking about digiKam's converter.

Dave Robinson @ 49: The last 2 computers I built for work had a primary boot partition with Ubuntu 8.04, and both of them had dual 19" displays. The graphic cards were Nvidia, and the Nvidia drivers (which are now open source, I believe) made it relatively easy to set up the 2 displays as a single huge desktop. That said, I have to say that making that setup work correctly with KDE was a pain in the wazoo; Gnome was relatively easy, but it took me quite a while to get KDE to hook up to both displays correctly. Don't ask me why; I'll just look grumpy and growl at you.

59:

Charlie: Glad to see you on a roll, both with posts and with topics that draw interest, at the least mine but it seems to be more widespread than that.

Rescuing old data is sometimes sadly at the mercy of compatible hardware as much as software standards, which can be even uglier in its way. Some hardware and the standards they rode in on have already long since passed and legacy computer electronics seems to be even more of a specialized realm than legacy A/V gear.

My new laptop and OpenOffice go together like, well, I don't know how to describe it. My desktop just updated to OpenOffice3 and it still loads painfully slow - perhaps by a factor of seven.

Like many it seems before me, I ran into the dreaded pay-upgrade treadmill with MS products. I absolutely cannot justify to myself spending the money on MS Office 2007, even the most lightweight version. Even MS Office 2003 is an expenditure I could do without.

I had not greatly considered the archaeological aspect of old data until reading some of the replies here. In much the same way as people who never do backups have the moment where a power surge or hard drive failure brings them into a new point of view, I think there will have to be a generational data-loss of irreplaceable nature for people to consider seriously how important access to their digital information is for their heirs, assigns, or lawyers. Third party services, in some ways similar to code escrow services, could definitely serve a role in such a situation. Though a death certificate and other legal proofs tend to unlock barriers such as passwords, a major component and difficulty is determining what information is available - especially with web based applications.

Given the number of e-mail addresses, logins/profiles, and other data storage methods that a typical heavy computer user accumulates this will prove more difficult other time. Tools such as legal discovery in the US allow huge dragnets to be cast for information, but only if the proper parties are involved.

Taking a step beyond the death of the immediate producer of these files, you have the death of the company. As I mentioned above, code escrow seems to be a favored method for some contract situations. This ensures that, upon the (inevitable) demise of the company, parties whose information is locked in a proprietary format have the legal ability to access source code which allows them to transition their information out of that format or even possibly maintain the format themselves. Mind you IANL and I am absolutely certain that even should Microsoft suddenly implode there would be other parties immediately purchasing and maintaining viciously the rights to their source code as property.
But at least it helps.

However, there is as clearly stated here an easier solution. Open formats. MS's OfficeOpenXML is a nod in the right direction, but nowhere near enough. As part of working through the MS Office 2003 to MS Office 2007 transition at a .com running Gentoo on server side, we learned exactly how difficult they made it to work with when we started working on support for that file format. Especially before its release, just stripping the files down to nothing but raw text was at least irritating though far from fatal and caused our developers a few weeks of headaches.
So, @Bjorn: It takes a fair amount more effort than a 'simple' XML parsing to clean up documents from Office 2007 or even 2003. Direct, uncleaned text output is easier. But you've got multiple stage issues to deal with - embedded spreadsheets, pictures, change-tracking, notation, and other layers of data which ARE lost or at best scoured of proper reference points. This information can be just as important, from archival and historical standpoints.

Even RTF is not truly acceptable - MS wrote and owns Rich Text Format. Though they might suffer from not having protected it from possible infringement, it is theirs. While RTF is marginally better than the primary Word format in terms of cross-platform usage, from a legal standpoint it is exactly the same. If your license expires and cannot be renewed you are in the same situation as if you were using MS Word.

Much the same can be said for other things. Hotmail. You want to access Hotmail using Thunderbird? You have to use a customized "webmail" extension, because MS doesn't believe anyone using a non-MS email client should have access. Mind you, Hotmail is almost as bad as AOL mail. But there you are.

IM is another issue. I know that open IM clients have faced great resistance and even legal challenges against the branding companies and their proprietary chat software. Not to mention the neutered corporate versions, such as MS Communicator, which of course lovingly archive everything done into a neat little proprietary package.

While the fact that some things will eventually become 'abandonware' and thus supported by a (small) interested community is nice, it remains largely in the realm of the possible and not the absolute and certain. Looking at it from a very long term point of view, we could save our descendants tens of thousands of hours of work.

Open source and free software may not be for everyone. It may be painful for many people to even consider changing email clients, web browsers, word processors, spreadsheets. Learning new things means a loss of time, of effort, of money, of patience.

I'm amazed nobody has broken out the "total-cost-of" arguments, honestly. But those get to be circular extremely fast, so hopefully no one does.

Last thoughts before sleep takes over:
Free and open source projects often work on shoestring budgets, if any. That means that if you make a suggestion and a donation, your voice means something. You can be heard. If you're not heard and are capable you can fork it and implement the change or suggestion yourself. Or pay someone to program it directly - and that work can then be donated back into the community, without increasing the cost to you and likely without loss of revenue. All this without violation of your software license.

When you want a special feature added to proprietary software program, you have to either pay the "one off" and then continued support fees (likely far outside your budget) or hope there is a large enough community pressure to implement this feature. If you're a small subset of users with a specific need, don't hold your breath. Many of the in-house work arounds or programming slap patches directly violate license or ToS agreements, but have to be disregarded because there is no alternative. Given how draconian licensing and ToS have become, this is likely to get worse rather than better.

user-pic
60:

Robert Sneddon wrote: "an Opus from the Open University".

Tell me more. I have a terrible feeling that we (the OU - although my personal responsibility for this one is limited because I'd have been about six at the time) haven't kept one of everything.

user-pic
61:

Bruce @33
"However, there is now a standard raw file format: DNG. This was developed by Adobe, but they've published the spec, and provide a free conversion utility from camera-dependent files to DNG, running on Windoze and Mac, and there is a (relatively new) open source converter for Linux."

That I didn't know, I'll go hunting. I've played with Linux on and off (last full system install was SuSe Linux prior to the Nowell take over - then the HD died). I'm toying with installing Ubuntu on one of the systems here butI guess I'm nervous and lazy (and I need time to play!)

user-pic
62:

Same here. So glad someone introduced me to TeX at university (1992). Haven't lost a document since. I think I've all my email since 1995. As with Charles, stuff before that is lost.

63:

Matthew @43 - I expect something like that will only emerge after two or three high profile data losses for any of the variety of reasons discussed here and elsewhere. Even after that, I'm sure there will continue to be new and interesting ways to lose data.

Facebook terms of use make it clear that they own everything you upload to them and you just have a license to use it as they permit. This is probably to avoid people suing when their embarassing teenage photos etc. get shown to everyone with a web browser and their mother, but if they own it, in theory they can just dump it.

So (referencing rick @40 and carbon sequestration from 2 threads ago) clearly we need to insist they keep copies in underground bunkers. On paper from renewable sources. In triplicate. This will remove carbon from the atmosphere, have a stupidly unwieldy backup that politicians can actually understand, a valuable resource for future historians and in a few million years valuable deposits of coal and/or diamonds.

user-pic
64:

Chris @ 60: This is from memory, which in my case is faint, but while I was at Stirling University in 1974-75 or thereabouts doing a computing course I got to play with an Opus (I think the name is correct). I was told it was designed and built for the Open University as a tool for teaching introductory computing fundementals -- I do know that Stirling ran OU summer schools and as one of the few Scottish universities at the time with a computing department they might have operated that sort of school and hence been in possession of lab kit such as the Opus.

It was a small wooden box with a small keypad, some toggle switches and pushbuttons and some (two or four) 7-segment displays. On powering it up a couple of dozen bytes had to be toggled into the RAM to get the keypad to work, and after that code and data could be entered in hex a bit more easily.

It might have been based on a 4040 or an 8008, I can't remember, but I don't think it was an 8080. The Z80 was still in the future.

65:

I don't think the idea of closing off our access to our old data even makes sense for money-grubbing corporations - how can it benefit them? All they gain is bad feelings on the part of users who have lost out to them and an increased chance that said users will gravitate towards friendlier products.

It pains me to have to use Windows, Orifice and Outlook at my job but we're tied in to Microsoft for some obscure reason which everyone seems to have forgotten. My attempts to moved things over to open source have been pitiful failures so far but I've only been here two months and been moved from contract to permanent so there's still hope.

By the way thanks again for the Linux post the other day, finally installed Eeebuntu and I couldn't be happier. So much simpler for an OS noob like myself...

66:

I really wish there was a realistic, open alternative for film/video editors and effects artists. Avid and Final Cut are not particularly interoperable (apart from with EDLs and OMFs, which are not 100% reliable). And when it comes to After Effects / Shake / Nuke or anything like that, it's just not happening. There just doesn't seem to be a free / open alternative. So much of our history is recorded in a visual, documentary form that this seems to me to be a real problem.

67:

'But yes, it's going to be an interesting cultural problem when everybody wakes up and realizes 99.9% of the information that they've assembled during their life is going to go away when they die.'

A nice point but what's the significant difference between losing 99.9% or losing 99.99% like everyone who died in the last million years?

I also note that no-one has suggested _printing_ their work as a means of saving it for posterity...

Too many nerds in here!

68:

I agree completely with Charlie about format lock-in. I don't mind so much about ephemera, but for anything I expect to keep for more than a day, it needs to be able to be stored on my home Linux fileserver, and it should at least be theoretically possible to get at the data with no more than a trivial amount of Perl. Thus, I didn't bother getting a DVD player until the format was well and truly cracked, and so BluRay is unlikely to grace my home any time soon.

As for the specific complaint about Apple's Mail.app being propietary, this was trivially fixed by not having Mail.app be the canonical repository of my mail: I run Dovecot - an IMAP server - on a Unix box which holds my mail going back to 1991 in Maildir format, and point Mail.app at that. Mail.app turns out to be rather more efficient at grepping many gigabytes of archived witterings than, er, grep.

IMAP also offers the rather useful benefit of having multiple clients to the mailbox and they all magically keep in sync.

69:

This 2nd dark age thing is something that regularly worries me too. Photos, Music, everything that everyone seems to commit to a hard drive is waiting to crash into non-existence. And most people don't back up. Future generations will not find hauls of old records and photos in the loft that belonged to granny. Most of them won't even know what granny looked like when she was young!

Pretty much everything in our age is designed without thought for the future. By the way - your books are printed on paper so acid that in 10 years they will all be brittle and yellow! In 100 years they will be dust, I suspect!

I have books by the likes of Asimov, Heinlein, Doc Smith, Van Vogt, etc. from the mid 70s that are all but unreadable without the dark yellow brittle pages falling out of the glue binding. Conversely my wife bought a mid/late 19th century volume of Byron poetry on ebay last year and apart from the odd black spot the well thumbed volume is absolutely fine.

70:
I also note that no-one has suggested _printing_ their work as a means of saving it for posterity...
That's because it's actually not a very good method, unless you take great care over the choices of paper and ink you use. Otherwise you can easily come back to a piece of faded, crumbling cellulose that you could've sworn used to have something on it...

There is such a thing as archival paper and there are places that specialise in printing documents for long-term storage, but I believe most office or home printers and consumables aren't suitable.

71:

Bruce Cohen @ 19: It seems to me that whenever people feel the need to write something off as "failed" for multiple times, it is anything but.

This discussion has been very enlightening to me. I think my insistence on sensible data storage, which has never been too mild, will increase.

72:

Alex @56 --

Actually, you're wrong about that, though perhaps not for the reason you might expect.

Fundamental constraint on software is the number of smart people available to work on it. Closed formats and code bases are a requirement to do work over again, and, worse, a requirement to figure out evolutionary peculiarities of how the work was done the first time. Given that the number of appropriately-smart people is limited, this is a huge drag on getting to the ideal state, where you only have to solve your problem, rather than a bunch of logistics and infrastructure problems that aren't specific to your goals.

The closedness of OSes produces a nasty tangle of complexity when one is trying to produce cross-platform software; this is why something like Qt can charge thousands of dollars per seat for a developer license, because that really is cheap at the price to abstract the platform-fiddling away for the closed platforms.

Just as an example, Eclipse, which has gods-be-feathered IBM behind it and pushing, depends on the OS for aspects of image handling. Which means that SVG works on Linux and doesn't work on XP, because XP has never head of SVG and can't readily be patched to comprehend it as a mime type.

For those considering Linux because of Charlie's post and/or this thread, there's a basic step no one seems to have mentioned yet.

These days, instead of the research, install, and pray method, you find out about hardware capability by downloading a live CD for the distro you're considering. You burn that, stick it in the optical drive, and if it works and you can do what you want in the way of playing music or whatever, that hardware is compatible. If it is doesn't, you haven't altered your hard drive in any way, so no harm done.

73:

canis@70 "There is such a thing as archival paper and there are places that specialise in printing documents for long-term storage, but I believe most office or home printers and consumables aren't suitable."

Interestingly cheap newsprint has lasted for well over 100 years - see your local library for archived newspapers, and public dumps seem to preserve newsprint quite readably for over 50 years at least - longer than any electronic format has been around.

As an SF writer, I'm surprised Charlie hasn't considered/mentioned the consequences of sticking with -ix based OS's. If OS's based on command line operations are being used just a few hundred years in the future, the evolution of language alone will make the unix commands less and less comprehensible and more like a symbolic incantation. If alphabetic languages become the minority versus pictographic ones, like Chinese, this again makes unix fundamentally problematic.

As for the longevity of file formats, the real issue over the long term is not the format, but rather preservation of the file's integrity. Simple bit-flipping can render a file unreadable, compared to hard copy text and images that are still decipherable after serious degradation. The Archimedes Codex being a good example - it would have been totally lost if it had been recorded on reusable media.

74:

Alex Tolley @ 73:

the evolution of language alone will make the unix commands less and less comprehensible and more like a symbolic incantation.
The names are already pretty much gibberish to most people who didn't grow up along with Unix. Who knows what 'grep', 'sed', and 'awk' stand for, let alone 'cat'?
If alphabetic languages become the minority versus pictographic ones
This seems highly unlikely to me. Alphabets have significant advantages in terms of representing grammatical forms and variations of tense, number, person, voice, and mode; these in turn make alphabets much more flexible in handling new concepts. Alphabets allow change in the language itself more easily as well, since it's easier to become fluent enough in an alphabet to coin a neologism that can propagate quickly through society.

75:

Alex Tolley
I have a scrapbook with newspaper clippings in it running from 50 to 100-plus years old. Some of them are in fairly good shape, and others are yellowed and fragile. It's a matter of the paper.

You can actually get archival-quality (acid-free at least) paper for home use. The inks are a problem.

(I remember 8-inch floppies. And 14-inch RK05 hard disks, holding 5MB, or 10MB if you had a 5F. Yesterday I bought a 3-inch 1TB hard drive for half what my 30MB drive cost in 1987. I'm still croggled.)

76:

Hi, I'm a PC user. I have a friend who sent me a link to this thread and after reading several of the posts I felt compelled to reply.

I do not follow the Windows lock-step because I want to but merely as a matter of convenience. I could easily get into Linux as an example of my 'If it doesn't need bells and whistles then don't give it any' philosophy but I have to admit that the idea of learning to operate a new system, no matter how simple or convenient, bothers me a little.

Imagine that you drive your car to work every day. Your car is not the best but it serves the purpose. If it breaks you need someone qualified (which you likely are not) to repair it. This generally costs you money. You pay it because you don't have the time, the energy, the skill or the tools to learn to repair a car by yourself.

Now you COULD drive your car to a common parking spot and ride a bike to work because the hgh price of gas bothers you or because you don't like the ides of having to rely on an expensive and possibly ruthless mechanic to keep the car running. But would the inconvenience really be worth it in a world where so many drive cars?

I equate many Linux users as the same as the at-home mechanics who work on their own cars. They have the tools, skills and know-how to do all but the most serious repairs and they have these because they've been doing it for so long. However their friends who could care less about how their car operates and would rather spend their time doing something else. Neither side is wrong in their thinking, merely different.

After reading the top post I noticed several things:

1) The author was involved with computers for quite a while. He apparently used them frequently both at work and at home and so HAD to become well-versed in their use.
2) He lost his 3" floppies for his PCW. No offense but if the data on them was that important then he should have backups. His hard disk then crashed and he lost yet more work. You see a trend? Important data MUST be backed up, preferably in multiple formats. I'm an idiot and I know and practice this.
3) How much of the stuff that was lost was really important? If you look in your computer, on your shelves, in your closets how much of the stuff that collects there is needed and how much of it is so much junk. A couple of years ago I had a catastrophic computer failure. The experts managed to remove the one folder with all of my writing ideas and current work and transfer it to CD for me. All of the rest of it was lost. My wife and I spent a little time every day over the past month reconstructing our lost data from other sources...mostly hard-copy. At first she lamented over the loss of our Favorites list but after a week we sat down and though of how much of the stuff we'd lost was really that important.

Pictures: Nope, got hard copies of the important ones. Many others were sent by friends or downloaded and so could be duplicated.
Email addresses: We remembered some, got mutual friends to send us more and simply called on the phone for the rest. A good reason to keep in touch by the way.
Financial records: Nope, got hard copies of all of that.
Writings: Nope, managed to get that removed and saved and now I have a stack of CDs to prevent this from happening again.
Games: We still had those and reinstalled the ones we wanted. The rest were given to friends who might actually play them. The lost time was like forgetting an episode of television...it was entertaining at the time but now we need to forget and move on.

My point is that relying on an outside source for something that we use in our daily lives is not new, unusual or necessarily a bad thing. How many of us generate our own electricity or water? How many groaned at the rising price of gasoline but paid it anyway because we had to?

Linux is, IMHO, the car that car buffs drive. They can work with it, tinker under the hood and generally spend enjoyable time with it. If it breaks they can probably fix it too and that's always a good thing.

I don't care how my car works. I KNOW...but I don't care any more than I care what makes a plane fly or gravity work. It just does and that's good enough for me. Yes, I pay more money for the convenience but then I spend less time overall worrying about what makes my computer go. That time is worth more to me than the money I spend for upgrades, the money I offer to friends and family members to help me when it breaks or the store I buy the new comp from when the old one dies.

Linux and other open-source programs are not the savior of the world any more then MS is the anti-Christ. Each works for what they do and the people who want to work with them. Anyone who becomes sufficiently frustrated with the one is able to go to the other and no harm done to either side and that's all for the better.

77:

Chas: Hi, I'm a PC user.

There's your problem in a nutshell. Not the PC: it's the fact that you're a user.

Welcome to my blog, and feel free to hang around. But you might like to note that a large subset of the folks you're opining about are not users: we build or program the bloody things. (Bits I missed out in my essay: the comp. sci. degree and the years in the development group in the dotcom startup before I quit to write novels instead.)

I don't care how my car works ... It just does and that's good enough for me.

No, what's good enough for you is that other people care how your car works. And that there's an industry that builds cars. And worries about how to dispose of them after you lose interest in them.

We are those people. And you'd do well not to lecture us on our speciality.

78:

P J Evans@75: "Yesterday I bought a 3-inch 1TB hard drive for half what my 30MB drive cost in 1987. I'm still croggled.)"

Storage space has improved in cost and performance so rapidly that I am within spitting distance of being able to put all my 900+ DVDs on hard drives for a reasonable cost and being able to play them from a jukebox as I would with mp3 files. Unlike CPUs, there seems to be a looooong waaaaay to the bottom of this trend.

Bruce Cohen @73. Good point about unix command names already being effectively symbolic. Maybe it won't be so hard a few hundred years from now to use those commands, although a traveler to that era might not understand the pronunciation of those commands and the command's origin might become even more obscure. As for types of languages in the future, I agree that alphabet languages are more flexible (but I'm biased). However, we cannot assume that better wins. If some oriental culture that uses a pictogram language as it's lingua franca dominates the globe, then it is entirely possible for alphabet languages to fade away. Consider what it might mean if in the future using a command line OS was like having to learn latin or greek before you could read texts.

79:

I'd like to throw in a major discontent with the car analogy.

All cars are not the same. Some cars are reliable; some are not. Some are good value for money; some are not. Some are fun to drive; some are not. Some do what you need them to do, somewhere in the range between "impress girls" and "six kids, two week's groceries, and the dog, hence to thence", and some just don't. Some do it, but not in a way that makes you happy. And so on; very big range.

If we're going to do the car analogy, MS Windows is the Detroit product, Macs are Toyota Priuses, and Linux has no good analogy; it would be something like the Stanley Steamer having been taken over by a worker's collective with lots of investment by heavy equipment manufacturers, and now producing its first commuter cars, after years of producing field survey vehicles and pit trucks. The folks who used to road-modify the field survey vehicles to do their grocery shopping with have mixed feelings about this, but generally agree that it's progress.

Also note -- you can have perfect backups, and still wind up scrod if the available hardware goes away on you. Punch cards and various formats of tape have been through this already. USB sockets will do it, given time. The 3" drives Charlie describes had a single, proprietary source. They went away, in the days before cheap huge hard drives. Oops.

As for "was it important", well, I'm an amateur (very amateur) photographer. A year's worth of photos is roughly 350 GB. Someday, someone is going to be a significant photographer who has done all their work in digital, and the artistic executors and biographers are going to want to be able to go back through that pile of digital negatives and see what's there.

80:

Alex @78: Plenty of people learn "dead" languages for the sake of scholarship. My sister can tell you roughly what something written in Classical Latin says. My mother-in-law knows biblical Hebrew.

Anyway, the functionality of a command line doesn't depend on the exact words used. Some of the oldest commands are already aliased to a different program behind the scenes. And 90% of them are either filesystem manipulation or logically deductible tools for handling text. You could (and people have) write a working implementation of *NIX in not too much time.

81:

Hi, I'm a PC user.

Sorry, but after this comment I doubt most of even the casual readers here like me would bother reading the rest. The usage of "PC" to mean a Windows machine which I became aware of from there John Hodgman advertisements on Youtube. It's symptomatic of the (non)understading you have of the issue being discussed. I see Charlie already elaborated so that all I'll say.
And I am by no means even close in technical proficiency to the people who post here.

82:

Chris@80 "Plenty of people learn "dead" languages for the sake of scholarship. "

But do you want to have to learn a dead language in order to do your actual work using a computer? Programmers will. But computer users, I'm not sanguine about that idea. Your point about name aliases is fine, one could do that indefinitely to handle language changes, although you could end up with a conflicting mess of nested aliases. But if the contemporary language is not alphabetic, then what do you do? GUIs handle language types easily compared to command line (although admittedly I don't know if someone has aliased a chinese pictogram keyboard to -nix commands).

The point is why would you keep going through these hoops? If we, for example, compute through a brain interface at some point, would it make sense to retain a file based OS like unix, or map the OS more directly using some other approach?

user-pic
83:

Evans @75:

I had a similar reaction when I bought a video card with greater MB capacity than the first hard drive I ever bought.

84:

Assemble what you want to preserve and print it out! Archive it carefully and store it.
Someone noted that newspapers can last for a hundred years? In my researching days I have handled books from the days when printing was first invented in Europe in the 15th century - the Chinese had it five hundred years earlier.
What is worthwhile lasts, and history tells us that writing or printing on paper can and does survive. It doesn`t tell us that about any electronic means of storing words. Go figure!

user-pic
85:

What is the durability of microfilm? How about micro engraving on metal plates?

86:

Elizabeth Patrick: "In my researching days I have handled books from the days when printing was first invented in Europe in the 15th century - the Chinese had it five hundred years earlier."

Yes, it depends on the paper. Acid free materials such as vellum, papyrus have survived for at least 1000 years. Poor quality paper without the acids removed in processing degrade very quickly.

Theoretically, if you made your electronic storage media of metal, like tungsten, and stored the bits as etched noches in teh surface, like a the old laser discs, it should last for much longer than paper. How long the reading device would survive is another matter, which argues strongly for human readable formats for survivability.

87:

This is exactly the issue about self-documenting building components; you can't use RFID because the chips may just fail sooner than expected. You can maybe use QR codes, but you're limited to putting a URI on them, which means you're relying on a) that technology lasting, b) the WWW and c) me, or rather some institution. Even stamping the URI on the component relies on b) and c). I'm fairly confident of b), but the rest?

So I reckon it would need both a QR, or whatever, plus a printed card, or rather two, one attached physically to the thing, with the data printed on it. Which is tough if it's a nail.

But, y'know, if people are interested, I reckon it's a cracking idea.

88:

Subversive version: put the H&S record on it, and you could have an app that asks "Are you wearing your mask?" If not, "do you have a mask?" If not, file an instant report with UCATT and maybe the HSE. Perhaps mobile sw dev is the answer to false consciousness.

89:

Alex: we're not talking about using an ancient operating system for your everday computing needs, but rather accessing files created by that ancient system on whatever system you happen to have. The ability to do one without the other is what Charlie is getting at. I can't remember how command lines came into it, but I blame the Windows users :)

90:

Elizabeth Patrick @ 86: Hard copy may be a solution for archiving text, and might be one for graphics and photographs (though in both cases there's going to be information loss), but how about video? Or audio (music for instance)? Or medical data like long-term EEG or EKG recordings? Or customized technical manuals (airliners, for instance, are custom-modified for each customer, and the technical documentation can be composed of tens of thousands of pages per model, of which any given page may be different for some customer or set of customers)? Or the records of large distributed sensor arrays (automated weather systems, ocean current monitors, etc.)?

As the capabilities of our computing systems grow, more and more kinds of data in different formats and organizations will be developed, and many of them will have archival requirements for longer than one or two generations of computer hardware or OS. Airplanes, for instance, often have lifetimes of 30 to 50 years (and the US B-52 fleet is expected to operate into the 2040s, for a lifetime of 90-95 years). Hardcopy is not going to do the job here.

user-pic
91:

Graydon @ 79, Elizabeth Patrick @ 86, Bruce Cohen @ 90:

To effectively retrieve (old) data from any type of record medium, you need functional (and mutually compatible) versions of three distinct types of tools:
(a) a physical medium (hardware) upon which the data is stored;
(b) a (reversible) coding/uncoding technology which can be used to store and retrieve that data from the hardware; and
(c) a software application which can be used to accurately (or at least usefully) transform the original data to/from its coded form.

One widely used set of such tools includes paper and ink (the hardware), written language symbols (the technology), and literacy (the software application). Several more recently developed tool sets use patterns of electrical or magnetic charges on a variety of physical substrates, in a wide variety of formats, accessed through an even wider variety of software applications.

As several of the folks in this discussion have noted, many of the recently developed tools in all three of these categories have been subject to very rapid obsolescence. However, several rounds of "format wars" seem to have shaken out a lot of the competing types of hardware, and at least a few of the less widely used coding technologies. (As exemplified by the relative rarity of 3-inch and 8-inch floppies these days, ditto the data storage formats used by BASIC and CP/M WP applications 25 or 30 years ago.)

Fortunately, a gratifyingly large fraction of current software applications seem to be effective at retrieving stored data from a substantial fraction of the more common hardware formats, when stored in one of the more common coding formats, and I expect this pattern to continue. (E.g., many documents saved in ASCII are going to remain readable by new applications, whose developers weren't even born at the time those documents were first saved to disc.)

When a specific software application (even one as widely used as MS Word is today) is no longer generally available, a sufficiently large "installed base" of data coded in a standard format associated with that application will provide a substantial economic incentive for successor applications to be backwards compatible, at least to the extent of being able to read data stored in that format.

92:

Leroy --

Only if it happens all at once.

Ever been near a medium-large (50 lawyers) law office switching off Word Perfect? They'd like to be able to pay someone to make the pain stop, but they're just one office. If every law office that was going to do that (over about 15 years, because some of them still haven't switched) did it at once, there would have been a market. Because it happened in bits and straggles, there wasn't. The archival case for Word is going to be smeared out over a 100 years, easily.

Format conversion is hard, idiosyncratic (did the source environment have some sort of processing macros in it? Has someone encoded semantic meaning in the formatting?) and it doesn't happen all at once. The business case for reverse-engineering obsolete Word formats is really hard to make in each individual case because it's really expensive to do it well. (Look at how much money the Open Office folks, mostly Sun, have sunk into this. Look at how almost it is.)

Hardware is proliferating, not consolidating; we're getting new fundamental storage types. The front ends are generally consolidating (some kind of SCSI commands over some kind of serial bus) but the file systems aren't. Solid state storage for portable devices -- Compact Flash, SD, SDHC, MemoryStick, etc. -- are hitting sizes where the DOS-descended VFAT file system just can't cope. So VFAT's going to be replaced, probably with multiple competing proprietary solutions, one per major flash memory vendor.

Solid state storage devices have inherent speed/volume tradeoffs in terms of number of bits per cell, and those are going to reflect how stuff is stored on the chip; the algorithms for doing that are intensely proprietary and often implemented directly in hardware. This is going to be an intense archival challenge.

user-pic
93:

Hi, I'm a PC user. I have a friend who sent me a link to this thread and after reading several of the posts I felt compelled to reply.

There's your mistake right there.

94:

Graydon @79, I still have my first program -- on paper tape coiled in an old round zipper case -- but I don't expect to run it again.

user-pic
95:

I feel sorry for the "PC User" - I'm one myself.

And there is a good explanation for this ....
I started learning about computing in the early '70s, and started with both BASIC (as a training tool) and FORTRAN IV - 80 chars per line, one line per card, and real, actual core store ....
Incidentally, I was pleased to find, last month, that I can stll write (simple) BASIC code, even after 25 years away !
Because of faction-fighting at my then employers, I didn't STAY in computing, though, I continued to use them, and write simple programs from time to time, and later did some testing of programs - my previous experience in industrial research was valuable as a mind-set for that.
Break for unemploymnet/retraining/teaching/doing an M.Sc.- which was typed in Lotus 123, because WordPervert was CRAP.
Both the earlier Macs, and PC's were well into service by then - we are now at about 1993-5.
So we bought a home computer, a PC, running (deliberate choice) WinNT, as we didn't trust the home/business packs offered by Win then ('95 was known to be trouble, and Win for workgoups was getting old).
Why not a Mac, you say?

COST

The original mistake by Jobs and Mac, by producing the "perfect" home-computer, at high cost, with no upgrades. Put me off for years, that did.
We're now on our third home PC (don't laugh, it's a DELL) - and changing to Mac MIGHT be a good idea, but can I be bothered with the hassle, and the re-learning.
Yes, I've used some forms of UNIX in the past, even written simple code in in it, when doing testing, but I don't understand people liking vi or similar.
I loathe vi - it is SO clunky - frankly reverting to FORTRAN IV would be easier, except it doesn't handle that sort of thing .....

I agree that MicroShaft are not the best-in-the-world, but frankly, is Mac really that much better, given their contempt of the custmer in their ealy years?

And I'm VERY suspicious of the missionary zeal of their followers (nothing personal, may I add)

user-pic
96:

Re-reading, and first-reading some of the comments that I skipped first-time round, can I add a bit more?

The car analogy is good, but ....
After 12 years without one (we were very short of money 1991-2002) we decided to get another.
Now I'm a fairly competent mechanic, as well as a trained scientist, andI do have an Engineering M.Sc. so we decided to get a car that:
I could (mostly) maintain myself, would last a LONG time (preferably 30 years plus), would not need to hire vehicles to move stuff, and was safe.
There is (in England anyway) only one valid solution to that equation: a "proper" second-hand but in good condition, Land-Rover, large, green, and rectangular. Has the added advantage of really nice views, the drivers/passengers heads being at least half-a metre higher up!

But, we were starting from scratch, and, unlike computers, there was no data to carry over, or re-learn, or re-adapt to other uses.
The other problem is there is a multiplicty of car/van manufacturers.
Software, for home users, there is either MicroShaft, or there is Mac/UNIX. There ARE NO OTHER REAL POSSIBILITIES, that I know of, anyway.

Comments?

97:

I'm finding all the anti Windows comments a bit tedious, I realise it's Mr Stross' blog and he's a serious 'ix dude so can make any comment he likes, but the rest of you can get stuffed.

I've _programmed_ Windows based apps for 15+ years. There's no easy solution to any sort of programming regardless of OS and hardware. So all these stupid anti-Windows comments just make you look overly biased.

Don't make me come over there and point out some problems that 'ix has, you wouldn't like it.

98:

Graydon & others: Yes, you can get hold of old MS software. Try googling "torrents MS Office 95"

99:

Charlie@22:
But yes, it's going to be an interesting cultural problem when everybody wakes up and realizes 99.9% of the information that they've assembled during their life is going to go away when they die.

I don't really have a problem with that. I'm collecting a huge amount of digital information, 90% of which is of little or no interest to anyone else. The 10% or less that is of interest to others (e.g. family photos, etc) is, as a result of its digital nature, already distributed to the people who care. And most of that 10% has a limited useful lifespan. I doubt that my grandchildren will care much about having thousands of vacation snapshots.

If only 0.1% of my stuff gets archived, that's probably enough. I have a few physical knicknacks from my great grandparents (WWI medal, etc) but only two photographs and absolutely no written material. I can't say that I have felt any loss. If my grandchildren retain a couple of hardcopy photos and can tell their children a few stories, that's good enough. My great grandchildren may remember my name, but I don't expect even that from their children.

100:

Nick: your digital footprint is of no interest to you after you're dead. But if nothing else, it's rich grit for the statistical mill of historians trying to work out how early-21st-century folks lived. You may not feel any loss from your grandparents, but social historians are interested in that sort of thing. Who are you to say what will or will not be of interest to people living after your death?

101:

Who are you to say what will or will not be of interest to people living after your death?

Who is anyone to say? I can only make a guess, based on our collective reaction to the loss of 99% of the ephemera owned by our Edwardian great grandparents. We're mildly interested in the stuff that survives and most people don't lose much sleep over the stuff that's gone.

Certainly, it would be rich grit for future social historians, but ehh, they're pretty low on the list of people I'm concerned about after I'm gone. I'll bet that even if 99.9% of our early 21st century digital footprint vanishes, they'll still have orders of magnitude more material to play with than current social historians who study our past.

I'm definitely concerned about losing access to stuff that I want to keep during my lifetime, though I have had more trouble with inaccessible storage media than inaccessible file formats. I deal with that by buying paper books rather than ebooks, printing electronic documents, and by distributing digital photographs to multiple relatives. A low tech version of that worked pretty well for my parents when family photo albums were lost in a revolution; they could be reconstructed from copies held by siblings and other relatives. I also try to remember that stuff is, well, just stuff. It wouldn't kill me to lose it.

102:

Robin I @ 97: Touchy, touchy.

OK, let's compare: I've been programming computers for over 30 years, everything from assembly language on bare metal to (in no particular order) various languages like C, Lisp, Java, Smalltalk, Prolog, C++, Basic, Ocaml, and some more I'll remember when the second cup of coffee kicks in. That's on OSs like TOPS20, CP/M, half a dozen multi-threaded embedded OSs, at least 10 flavors of Unix/Linux, MacOS of several flavors, and, yes, DOS and Windows. And Windows has one of the most bollixed set of APIs I've ever seen: it's over-complicated, under-performing, and buggy (and the bugs don't seem to get fixed, they just move from one release to another). If you like it, fine, have at it. I'm sick of spending two weeks finding out that some library I'm using that works fine on 4 other platforms is broken on Windows because of an underlying defect or limitation in the OS.

103:

Coffee kicking in: add Python, Fortran, Pascal. Not bragging, just proving that I really am language-agnostic, and mostly OS-agnostic as well. But some languages and OSs are less equal.

104:

Office 95? That's recent. I have a working copy of Office 4.2. Running under Win98.
(For what it cost at the time I bought it, I couldn't see getting rid of it. On the other hand, that machine also has StarOffice on it, so I'm not out of luck.)

user-pic
105:

Speaking as a C21st social historian, I can predict the time when the disciplines of history and archaeology will merge. Dig in the landfil. Find the hard drive. Scan the bits. Run 'Gateser' on them to decypher the file formats. Put the files into yr office VR. Ask questions of it.

Repeat til n gets to be about 100, and statistical analysis of the unwitting evidence (what time did people knock off work? How much pr0n in the cache? etc) will get interesting.

Hmm . . . if (a) I didn't have kids and (b) the technology existed, I might even demand to be put in stasis so I could do it myself.

In other words, we'll be alright. The bugger might not be the bits, but data protection (whose hard drive is it?) and reproduction rights (the Mouse still owns that pirated movie on it).

106:

Merrilee @94 --

My first program was stored on an audio cassette from a TRS-80. I don't have it, but if I did, I don't think I'd have much odds of being able to run it, either.

Val @98 --

There is, in a large corporate environment, this pesky requirement for "legally obtained". There is also, often, a pesky requirement to get exactly the spin of Windows -- the one for that exact Dell or HP machine -- which Dell or HP have themselves not necessarily retained. It can make life far too interesting trying to figure out what exact version that was.

Robin @97 --

There's a difference between my personal dislike of Windows because that's not how my brain is wired (which, in the larger scheme of things, doesn't matter at all) and stuff like Windows lying about encodings. (Microsoft Shift-JIS isn't, for example, but it says it is.) Everybody else has to code in test and work arounds for what ought to be a solved problem, consuming resources and effort that could be going to doing something useful -- solving their own particular business problems -- instead.

MS does this on purpose, in an attempt to enforce a monoculture. This is just plain evil, without regard for anybody's personal preferences in computer interface style.

107:

This may reveal my Luddite roots but I now know that if I care to keep anything, I need to print it out. All you need to get it back is a typist, and I am a fast one.

user-pic
108:

The insightful newspaper The Onion has an article where archeologists discover What Life Was Like In 1995. It's surprisingly relevant to the conversation.

109:

Nick @101 - that's fair enough, or your descendants might be more like me, who after finding out an ancestor was chief gardener at Leeds Castle (in Kent, obviously) went down and compared the photo of the cottage to the actual cottage (unchanged, but the climbing roses climb in different places) and asked if they had any information on his gardening work (just accounts; if he kept plans, they got tossed out); or going into the history of the 10 foot long dresser my Mum inherited (most modern houses don't have 10 feet of uninterrupted wallspace); or idly speculating what my Grandfather was doing when he needed the "Pass this man without question or delay" piece of paper signed by a Lt Gen in WWII.

I think part of the point of this thread is that people are swapping ideas and techniques, and maybe with a little effort we can do much better than 99% loss after death while making things cheaper and easier for ourselves in the medium term.

Paper of course is vulnerable to the old Fahrenheit 451/Library of Alexander failure mode. Maybe carving your most important documents onto stone is a better bet.

110:

Bruce @ 58: NVidia's graphics drivers are not open source, and NVidia has stubbornly refused to release programming specs for their hardware. They do release some basic 2d programming specs, so you can at least see what you're doing while you install.

NVidia's drivers generally do work relatively well, but, well, caveat emptor. If NVidia goes out of business or stops supporting their older cards, you may be stuck unable to upgrade your OS.

By the way, the Free Software Foundation was founded due to exactly this sort of problem. With unsupported drivers and no programming specs, your hardware will rapidly become pretty useless. Of course, with hardware, you can always just buy a new computer, so it's not like losing access to your data.

111:

I'm glad somebody mentioned Vinge's "A Deepness in the Sky"- one of my favorite parts of the book was the concept of a "software archaeologist". The idea was that, each member starship of a particular large and spread-out group of interstellar traders would have somebody whose job it was to be able to go spelunking through hundreds of years' worth of file formats and programs, and to generally know what sorts of stuff was lurking in the depths of the ships' computer systems.

Heck, it's only been in the last thirty years or so that people have been using computers on any kind of large scale (socially speaking), and look at the mess we're in w.r.t. file formats and media. Imagine what it'll be like in thirty more, much less one hundred... I can definitely see software archaeology evolving into a an entire profession- there could be people whose specialty is in virtualizing antique operating systems, others specializing in particular XML schemata, etc.

user-pic
112:

@ 109
Now there's a thought.

One of these days, I'll find out what my great-great-grandmother was doing in the British Embassy during the seige of Paris, besides being a confidential servant (i.e. a Brit national, not a local hired hand) ...
Particularly as my grandmother's older sister kept a French bayonet, her mother had acquired, and we NEVER found out how she'd managed to snaffle it off its owner ...)

113:

@ 30: For what it's worth, I did have a go at reverse-engineering the LocoScript 1 file format a few years ago. It helped that I was able to run Loco 1 under emulation and create test documents with various formatting codes. Never got round to Loco 2, 3 or 4 or the PC versions.

114:

I also went from Locoscript to Protext, and stuck with Protext. I've been using it since 1985.

user-pic
115:

Steve @ 111:

The sheer number of variant file formats and media in use will have less impact than their distribution pattern. Like worldwide agriculture during the last couple of centuries or so, the trend in IT has been toward a relatively small number of product monocultures, each dominant in rather vaguely defined regions. Similarly, we have in both fields a much larger number of locally specialized products, with narrowly based but intense support from the much smaller number of enthusiasts for each of these.

Like heirloom vegetables, the specific combinations of hardware and storage format most at risk of early disappearance are those which have neither been adopted as de facto standards for a very large user base, nor attracted the support of enthusiasts who find the particular product so satisfying in use, as to warrant the substantial extra effort necessary to maintain its existence in their highly specialized area of interest.

The specific monoculture strains of field corn which presently occupy such a large fraction of Iowa won't last forever, and neither will MS Office. However, the very large user bases for each aren't likely to go away either soon or rapidly, nor will the substantial business and industrial complexes that have been developed in symbiosis with each.

116:

Huw Bowen @ #66:

It's not just video. Cross application compatibility and forward compatibility are hideous for most content creation tools.

Take 3D CGI for instance. The two tools that dominate this market--3DS Max and Maya--use hideously complex formats that are essentially impossible to reverse engineer and in many cases lack even forward compatibility.

117:

Libraries at various universities around the world and in Australia are calling this topic digital sustainability and doing various projects to come up with archiving solutions for digital material. I worked on a solution based on a project called fedora http://fedora.info/ and a front-end called Fez http://dev-repo.library.uq.edu.au/wiki/ and touched on another called Dspace http://dspace.org/. Part of the strategy is called format obsolescence detection which analyses a repository and notifies the archivist if there are any file formats that are not based on an open standard or that contain encapsulated data which can't be decoded (see http://www.apsr.edu.au/aons2/ ). Notice that open source and linux are a big part of tackling these problems as the source code can be archived.

118:

Greg Tingey @112 - getting ever more off topic, I hypothesise that as we're having this discussion on a novelist's site means people here have an above average appreciation for stories, which rolls over into family stories and history, social and other history etc. and so are more likely to be concerned about keeping family and other data available for future generations.

I further note anecdotally that although readers* do spring from non-reader families, readerness and love of stories can run in families, which leads to a richer archive of family stuff anf the stories around them, which in turn feeds the desire to keep records etc.

* Here meaning people who read at least** a book a week,
** "Hi, my name is Neil, and I'm a bibliophile. It's been 98 minutes since I last read a book and I'm still taking it... Oh is that Lemony Snicket's A Series of Unfortunate Events?"

user-pic
119:

@ 117
Agreed, but, in history, as well as in fiction, don't you sometimes wonder what happend to people OTHER than the main protagonists?
Not the "Spear-carriers", but the minor characters who wander in and out. They have lives and stories too!

120:

@ 118

[Don't you sometimes wonder what happened to people OTHER than the main protagonists?]

No.

Er, just kidding. But don't you hate it when an author goes back and retells a story that you already know from other characters' viewpoints? Len Deighton was guilty of that in one of his Bernard Samson books... but to be fair, the Berlin Wall had just come down and destroyed the entire rationale of his novel sequence, so flailing around and trying to work out what happens next might be a reasonable reaction.

(And my current trilogy interweaves with my earlier books so I'm in grave danger of falling into that very trap even as we speak. Er, post.)

BTW, that poor user person left with his/her/its neurons burned out and smoking from the cybercounterattack, but there was a good point about daily backups in there, as we old veteran of the three-sets-of-tapes backup routines should appreciate.

I carry around a copy of my entire life's work on a thumbnail-sized memory stick attached to my keyring. (It also contains a backup of Charlie's personality in a .jar file.)

121:

@111

Re Vinge: That sense of awakening ancient software evil was exactly what I felt some years back, when I landed on the Moon of Doom, known to my team members as The Y2K Project...

122:

I carry around a copy of my entire life's work on a thumbnail-sized memory stick attached to my keyring. (It also contains a backup of Charlie's personality in a .jar file.)

So if anyone finds John's keyring, polish it diligently.

I'll make your three wishes come true, honest.

123:

Thanks for the last two posts -- really takes me back to where I started with computers. In high school in the late '70's I ditched class to program on the physics teacher's home built computer he had in a back closet that he let me play with (even if I wasn't supposed to be there).

Philosophically, what everyone says here is very appealing, but I'm not creating enough original work that I feel compelled to fight the power (mostly I do stats work now-a-days). Now, if I could get myself to actually create something original, then maybe get myself to go back to building it from the ground up...

Also -- first saw the bit about the crooked timber thing on Krugman's website -- wow!

user-pic
124:

Lots of interesting food for thought here, both in the post and the comments.

I got my start with computers slightly before I learned to write - in fact, the first word I learned to spell was "R-U-N-Enter". Back in the '70s we had a TRS-80 Model I with an application called "Electric Pencil" for word processing. I've both come a long way and not very far since then. Many years later my father found a TRS-80 emulator online, and spent a few afternoons importing all of our old TRS-80 disks to images on the PC, via a BASIC program and an RS-232 cable. In the end, a decade's computer use compressed down into a 300MB archive we could burn to a CD along with the software to run it.

An important part for me was not just the nostalgia of all the old games I used to play, but little stories that were the first ones I'd ever written and games that were the first I'd ever made. Insights into the kind of person I was when I was growing up. I'm glad we had the presence of mind and the method to preserve those things while we still could, before the hardware to read those old 5.25" single density disk failed or the bits faded into the mylar.

Recently, a bunch of my dead grandfather's letters have come to light. Some that I've had read to me or seen scans of say a lot about where I come from in terms of knowledge, values and family culture. We're now looking at scanning more and translating them into digital text, so that everyone in the family has access to that legacy. A letter dealing with Great Uncle Maurice having an explosive accident while melting lead at the age of 11 might not be of great historical import, but to a modern-day engineer and teacher it has resonance.

And then there's university work. Given instabilities in the format and software that I've encountered over the years, I've avoided writing large documents in Word. LaTeX may be antiquated, but it's stable, open and carries a degree of precision that I can reasonably entrust a CompSci thesis to. That's not my major academic concern right now, though.

I've been doing work recently in the field of making games and virtual worlds for educational purposes. I'm in the process of starting a PhD in that general area. The problems I'm most concerned about in creating educational assets is making sure that they'll (a) be effective at teaching and (b) be usable and continue to be usable in the future.

Some of our educational game work was done using the Neverwinter Nights Aurora toolset. Of course NWN is proprietary, so there are limits on how we can use it. Even now it's not completely compatible with Vista, so that body of work in its current form has a limited lifespan in which it can feasibly operate.

Moving into virtual worlds, we've created a few things in Second Life (like many other academics around the world). However, that work is only persistent as long as our account, land lease and Linden Labs persist. Especially in the wake of Google Lively being released and then dying, the stability of any one-vendor system is a cause for worry.

That's why we've been looking into alternatives for an open-platform, open-standard system: we're reaching the point where concern over vendor/data lock-in isn't just about preservation of word processing and static multimedia content. Having tools to teach a new generation that aren't easy to preserve (or translate, as technology progresses) greatly limits their potential usefulness. So far it's looking like OpenSim as a de facto alternative, but it'll be interesting to watch how the virtual environment space evolves from a standards and interoperability point of view.

And yes, I really enjoyed and appreciated Halting State, thank you. It's a very timely piece of fiction.

125:

It's simply untrue that Microsoft changed the Word format with every version. It remained the same for so long that when Word 2000 had a new format as default it was a cause of considerable consternation at the time. The real problem was an error in the way it generated files for earlier Word versions when asked. Rather than a word 97 DOC, it really generated an RTF. This was easily overlooked by the majority doing simple documents and so it took a while to build up a scandal but there was eventually much gnashing of teeth before a patch appeared to correct the problem. They screwed up but it wasn't an intentional attempt to force upgrades.

The only documents I've lost are those I couldn't be bothered to preserve or transfer to a more portable format. The first would be on the Atari 800, then on a mix of machines with the bulk on the Amiga in ProText. If any of these were published items with the potential for future sales, I'd have taken more care to at least have printed copies stored somewhere that could be transcribed back into digital if need be.

But really, it strikes me as incredibly unlikely that anything I did in text during the 1980s will mean anything to anyone in the 2080s. There will be vast numbers of surviving printed material that will serve for future historians.

Realistically, physical media is a vastly greater problem for preservation than file formats. If you can read the bits you have as long as necessary to derive meaning from them. If you cannot read the bits you just have a lump that was once a disk drive, no better than an ancient tome moldered into illegibility. If I had all of my old Atari 800 files on a current medium I could easily read them by running the old software under emulation. Atariwriter works as well at 2.93 GHz as it did at 1.79 Mhz.

I just cannot buy into the whining about file formats. The only time I've ever had difficulty moving documents around was when they involved a lot of complex items that just aren't found in novels. Some of my co-workers deal in such documents and they use Word because everything else they've ever tried falls down badly in some area or another when doing this sort of work. That isn't to say Word makes it a breeze either but it's issues and how to work around them are widely known and discussed, producing a great store of helpful knowledge.

If 20 years from now a proposal we submitted to some city cannot be read or easily converted intact for use on the current app of choice, they'll just have to settle for extracting the text and rebuilding the complicated parts, or OCRing it if a printed copy still exists. But this seems unlikely be a problem beyond the application of emulation.

Even if the emulation must run multiple level deep. Hey, I once played Apple II Choplifter running under emulation on an Amiga that in turn was running under emulation on a P-4 system. Played fine and the system wasn't even breaking a sweat.

So, as long as we can preserve the bits and see them, we'll be able to run the code to extract the data.

Graydon, regarding large external drives, it's called exFAT. There may be some licensing bugaboos but that isn't going to really stop anybody from supporting it on non-Microsoft platforms. It's far more likely to get an open license than NTFS, especially since there is some argument to what extent Microsoft owns FAT.

126:

But really, it strikes me as incredibly unlikely that anything I did in text during the 1980s will mean anything to anyone in the 2080s. There will be vast numbers of surviving printed material that will serve for future historians.

You're missing the point completely.

History depends on written sources. But stuff that's mass-published (and hence most likely to survive) is very different from your everyday shopping lists, bits of hacked-out script, and so on. Compare your parents' home movies to what was on TV at the time. There's a gigantic gap between what a culture considers to be worthy of preservation at the time, and what successor cultures wish they'd preserved -- generally material that was pervasive and low-level because it encoded the cultural assumptions of the day, but which is easily lost (due to the lack of preservation).

MS changing the Word file format; I take it you don't remember Word 3.0 to Word 4.0, then. Or Word 5.1a for Macintosh to Word 6.0 for Macintosh. Or Word 95 to Word 97. IIRC they changed their file format something like 5or 6 times in a single decade. Given that I expect my documents to have a half-life of thirty years, that's not good ...