Typesetting with Linux

[ Site Index] [ Linux Index] [ Feedback ]

If you mention typesetting and computers, most people will assume you're going to talk about Macintoshes. This is a reasonable assumption; the association between the Mac and the pre-press trade is long-standing and well known. In recent years Windows has been chewing away at the margins of this market, but the core tools of the publishing business still run on MacOS for the most part, with a few high-level systems running on UNIX minicomputers.
Linux is not a platform that looks -- at first sight -- to have a lot to do with typesetting. The only serious commercial DTP product to be ported to Linux, Adobe Framemaker 5.5, was withdrawn at the end of the beta test period when Adobe decided not to market it. However, there's more to producing a book than throwing a bunch of text files into a program like Quark or PageMaker: although WYSIWYG all-in-one desktop publishing tools are unavailable, Linux sports some of the heaviest duty typesetting programs in the business.
(Incidentally, it would be unwise to dismiss non-WYSIWYG systems on the basis that they're old-fashioned or hard to use. There are a number of very strong arguments in their favour: as Conrad Taylor noted in a paper on DTP systems, "the degree of control that such programs offer may be more illusory than real ... desktop publishing programs let you fiddle interactively with the details of your typography until the cows come home, but they do not let you control the default behaviors of the composition algorithms in a way that efficiently and automatically delivers the kind of quality typography that was formerly expected of trade compositors." Systems such as TeX, on the other hand, provide a level of control that few WYSIWYG systems can approach. And, as we'll see, systems like LyX promise to fuse ease of use with high typographical quality.)

UNIX and typesetting

Not many people these days realise that Linux is descended from a word processing system. Back in 1970, when a group of programmers at Bell Labs were experimenting with their new operating system, they obtained a PDP-11 minicomputer and a departmental budget by promising to deliver an interactive word-processing system for use in the telephone company's offices. Denis Ritchie (co-inventor of the C programming language) wrote:
"At the time of the placement of the order for the PDP-11, it had seemed natural to promise a system dedicated to word processing. ... We transliterated the roff text formatter into PDP-11 assembler language, starting from the PDP-7 version that had been transliterated from McIlroy's BCPL version on Multics, which had in turn been inspired by J. Saltzer's runoff program on CTSS. In early summer, editor and formatter in hand, we felt prepared to fulfill our charter by offering to supply a text-processing service to the Patent department for preparing patent applications. At the time, they were evaluating a commercial system for this purpose; the main advantages we offered (besides the dubious one of taking part in an in-house experiment) were two in number: first, we supported Teletype's model 37 terminals, which, with an extended type-box, could print most of the math symbols they required; second, we quickly endowed roff with the ability to produce line-numbered pages, which the Patent Office required and which the other system could not handle."
Ritchie was writing about the development of the roff/troff typesetting system -- one of the first production text processing systems on the market.

troff, tbl, and eqn -- text processing in the old days

roff (short for run-off) was one of the first computer typesetting systems. Developed during the 1960's, roff is essentially a macro processor: it reads a text file full of formatting commands and raw text, and generates an output file that replaces the formatting commands with control codes that can drive a printer or phototypesetter. The simplest way to use a roff like formatter is to write your document in straight ASCII text using a text editor; then, when you're ready to print it, go through the text inserting commands to adjust fonts, format tables, and so on. (A roff input file with all the embedded commands rapidly becomes an unreadable mess as far as the human eye is concerned.)
An important point to note is that software tools like roff only became practical with the advent of time-sharing, interactive computer operating systems. Back in the mid-1960's, the majority of computers were expensive mainframes oriented towards batch-processing of jobs submitted on punched cards. For text processing to be possible, three inventions were necessary: an interactive operating system, some sort of terminal on which users could edit text, and a printer. These weren't generally available until the mid-1960's. At that time, most typesetting was carried out using electromechanical machines such as the VariTyper (a typewriter-like device capable of outputing text with proportionally spacing, direct to film). One of the major innovations of the UNIX typesetting system was its ability to directly drive some of the first digital phototypesetting machines.
roff commands are extremely low level. They come in two flavours; dot commands (which take the form of a line beginning with a period, followed by a letter, such as .P to indicate a new paragraph, or .SH "heading" to introduce a formatted subheading), and escape commands embedded in the flow of the text, prefixed with a backslash (like \fB to introduce bold text or \*(Tm to indicate a trademark symbol). The precise types of commands available are defined in a macro package that sets up page and paper layouts and define some special dot commands that the user can use to do things like indent the text column; there are several standard macro packages of which the most important are the "man" macros (which are used to write UNIX or Linux manual pages) and the "ms" macros (used to format "manuscript" documents -- typically used for journal articles, books, and technical reports. It's worth noting that these macro sets are quite ancient -- they're still in use today, but date to the mid-1970's -- and documents prepared using them in the early days are still available. (A far cry from files produced using, say, Microsoft Word 2.0 for Macintosh in the late 1980's.)
As the roff tools spread, they were expanded and extended (although the death of roff's author in 1974 prevented a re-write and the adoption of a less arcane command syntax). First, roff itself was split into a front end (that parsed the macro files and produced an output file in a special format called ditroff, device-independent troff) and a back end (that converted ditroff files to bitmaps compatible with specific models of printer). Second, additional tools with mini-languages were invented for special tasks: these tools parsed inclusions of their own mini-language within a roff file and expanded them into roff macros. The core tools were tbl, used for laying out tabular information, eqn, used for typesetting mathematical equations, pic, used for drawing diagrams (like writing postscript by hand!), refer (which would source in and link up bibliographic references from file) and a couple of other non-standard tools (such as dag, used for drawing directed graphs). These tools would be chained together, so that you could format a complex chapter using a UNIX pipeline:
  cat chapter.ms | pic | tbl | eqn | troff -ms | lpr
(Typesetting a book in the old days was a long, challenging task -- back when a computer with 256K of RAM was a departmental minicomputer, it could well be an overnight job, and the ability to print overnight from a shell script was useful.)
The typesetting machines troff (the typesetter version of roff) was designed to drive were primitive by today's standards. There was, in particular, no page description language like PostScript; the machines could print text in a variety of sizes and four faces (roman, san-serif, italic, and bold), and could just about do boxes and symbols. The roff formatters were geared up to send a series of commands to the printer that would explicitly tell it to feed the page up a few points, change typeface, eject the paper, and so on. As such, roff spent more time worrying about the output appearance on paper than about the legibility of the document. As a result, the usual way of writing a paper or article was to use a text editor like ed (or later, vi), write it as straight ASCII text, then after finishing it go back and add roff commands.
Linux systems today almost invariably come with a descendant of the original AT&T tools called groff -- GNU roff. Developed by James Clark (whose name we'll meet again), groff is a complete troff-style formatting system that can generate postscript page images. It also comes with GNU nroff (new-roff), a terminal-oriented formatter that outputs formatted ASCII text with boldface and other terminal control characters embedded in it, suitable for viewing on the Linux console. nroff is the standard tool used for displaying man pages. When you type "man myprog", the man scripts first see if a cat-able (formatted) page image for the requested program exists: if so, it throws this at your screen by way of a pager program such as more (or less). If no such formatted version exists, it feeds the source to the program's documentation through troff, caches the formatted output, and displays it.
You probably won't need to use troff typesetting commands in anger unless you specifically want to write a UNIX man page for a program that you're creating: both KDE and GNOME have their own, different, system for producing online documentation, and it can be argued that man is a dying format. However, numerous books and papers have been written using troff as a typesetter: this is the granddaddy of word processors such as WordStar (which inherited the idea of dot-commands directly from roff, and threw in a built-in text editor), and it's still around.

TeX, LaTeX and literate programming

Back in the early 1970's, when roff was being written, Professor Donald Knuth of Stanford University had a problem. Knuth was writing a series of books (collectively known as "The Art of Computer Programming") which were to be the definitive text on algorithmics. Knuth's problem was that he had a very clear idea of how he wanted the text to appear on paper, and none of the computer typesetting tools available to him were up to the job. He was appalled by the idea of a computing textbook being typeset using traditional methods and hot lead, so he decided to spend six months or so writing a formatter that would provide an unprecedented degree of control over the typographic appearance of its output. Eight years later, TeX was unleashed on the world ...
Although TeX looks superficially similar to troff -- they're both command-line programs that process an input file containing text and embedded macros and output a file in some sort of device-independent bitmap format -- TeX is infinitely more advanced.
Knuth's central innovation was in realising that a page of typeset text can be decomposed into cells, each of which contains other cells. At the smallest level, we have character glyphs (individual letters rendered in a typeface scaled to an appropriate height and width and weight). These are grouped into words, and a group of words may be connected into a line: multiple lines make up a paragraph, and so on. Knuth investigated the best algorithms for laying out cells relative to one another: for example, he determined the optimal auto-hyphenation method, worked out how best to insert fractional inter-word spaces when justifying text, how to balance paragraphs, and so on.
Although TeX is not a WYSIWYG environment, most of the core concepts of desktop publishing and word processing were developed by Knuth. For example, Adobe InDesign was recently upgraded ... to use the TeX algorithm for optimizing paragraph layout. As another -- even more fundamental -- example, the character glyphs TeX outputs aren't simply delegated to a phototypesetter: they're generated from vector font descriptions using a tool called metafont. Metafont pioneered the idea of using a vector font description of a font which could be used to generate a bitmap font scaled to appear at a given size on an output device of a given resolution (measured in dots per inch). Knuth worked with font designers to create the Computer Modern fonts, an elegant character designed for legibility and good appearance when output on digital equipment. Metafont is actually the ancestor of TrueType and Adobe's scalable font technology: both of the latter can produce on-screen bitmap renditions of their fonts as well as a much higher resolution version for use on a 600 or 1200 dpi typesetter. (The major difference is that the more modern font renderers are optimized for speed -- essential when they're feeding fonts to a screen rather than a printer.)
TeX, like troff, is a macro language: commands embedded in the flow of text control how TeX lays out the page, what fonts text is presented in, and so on. However, there is a much wider range of macro packages available for TeX, and unlike troff, it is relatively easy to handle multiple fonts. (Note that the keyworld here is "relatively"! It's still not trivial.) Probably the most important TeX macro package you'll meet is Lesley Lamport's LaTeX package. LaTeX is a set of TeX macros that separate the layout information in a document from its structure. A LaTeX document (such as a paper or a letter) has a set structure; instead of inserting macros that say "format this line right-justified in an italic font", you tag it as belonging to a component of the document -- such as the author's street address -- and leave LaTeX to worry about the formatting. (You can, of course, override LaTeX's default styles.) LaTeX and similar systems were the first to introduce the idea of semantic markup -- in which the structure of a document is deemed to be separate from its typographical appearance: this concept is currently becoming mainstream on the world wide web, with the separation of style and layout instructions (embedded in cascading style sheets) from text (in HTML).
LaTeX, and some related macro packages (such as AmsTeX, the macro package specified by the American Mathematical Society) are the default layout packages required by many journals for academic papers. LaTeX and the related academic packages are very good at keeping track of things such as heading and subheading numbering, diagram captions, and so on: the bibtex macro system provides features for handling bibliographies, and there are a variety of indexing tools.
TeX particularly shines at typesetting mathematics: even though it's not WYSIWYG and has been around for twenty years, it is infinitely more powerful and flexible than the equation editing tools provided by the likes of Microsoft Word. This is no coincidence: Knuth was specifically interested in providing a mathematical typesetting system for his books.
TeX translates its files into a format called DVI (short for device independant). DVI isn't a simple bitmap -- it incorporates font metrics and information sufficient to permit scaling, so that using a suitable output driver DVI files can be printed with optimal output quality on anything from an ancient 9-pin dot matrix printer to a 3000dpi phototypesetter. A number of DVI translators are available: dvips, to produce PostScript output, dvipdf, to produce PDF (Acrobat) files with hyperlinks between internal document cross-references, the dvilj family drivers (for HP Laserjet printers), and xdvi, a previewer that displays DVI files.
Raw TeX is a bit intimidating to edit, if you're not used to it: even LaTeX takes a bit of learning. There have been a couple of commercial attempts to take TeX and turn it into the back-end of a WYSIWYG editing application. However, the most spectacular success in this field is a little piece of open source software called LyX.
LyX is a document editing tool that outputs LaTeX documents. (LyX's own internal file format isn't LaTeX, but is designed to be easier for a WYSIWYG tool to handle; when you select the "Print" menu option, it exports a LaTeX file which is then processed by the LaTeX system on your computer. You can use it to output PostScript into your Linux system's print spooler, or you can view the intermediate DVI file (or emitted PostScript!) in a print preview window.
In use, LyX is similar to a conventional WYSIWYG word processor. The main difference is that it expects your documents to follow the structural conventions of a LaTeX document. For example, you may be accustomed to hitting the space bar a couple of times after every full stop. LyX will ignore this -- it understands the concept of a sentence and will adjust the space between the full stop and the following capital letter according to its layout algorithm, rather than letting you insert a couple of fixed-width spaces manually. (You can explicitly tell it to let you put some spaces in, but this risks throwing off the balance of the neatly formatted paragraph that TeX will generate.)
Similarly, when you select formatting commands from the "Layout" menu, you aren't implicitly telling the word processor to switch to a bold or italic typeface: you're telling it that when you output the LaTeX file, the selected text will have an additional attribute applied to it -- essentially character-level style hints. This sort of markup is familiar to users of high-end publishing systems such as FrameMaker but may come as a bit of a shock if all you're used to are word processors.
LyX has most of the features you'd expect of a high-end word processor. You create a new document by selecting a template (which sets LyX up to generate output for use with a given set of TeX macros -- for example, the Latex8 template provides compatability with LaTeX). Once you've done this, you can edit the document much as you'd edit a Word document. The major difference is that you may want to preview the file to see how TeX will render it: LyX isn't a 100% What-You-See-Is-What-You-Get system, and may break pages or paragraphs slightly differently from TeX itself.
LyX provides extensive facilities for inserting levels of subheadings, tables, and included diagrams -- these are ideally generated as encapsulated postscript using an external program (such as dia, the GNOME diagram editor. LyX shines when it comes to features for producing books and long technical documents: it has the indexing support you'd expect, several different ways of handling footnotes and endnotes, bibliography and reference facilities, dynamic cross-references, formatting of URLs, and so on. LyX doesn't include a spelling checker, but integrates support for the UNIX ispell and pspell checkers (which are multilingual and able to recognize and ignore LaTeX macros). It's also possible to customize LyX to your favourite key bindings; by default it supports CUA keystrokes (compatible with IBM's common user interface, like Windows 3.1), but it can also support Emacs keystrokes (and there's some discussion of support for vi).
The most noteworthy thing about LyX is that its mathematics support far outstrips anything that Microsoft have attempted with Word. This shouldn't be surprising -- LyX was largely written by academics, as a front-end for TeX, which was in turn written by an academic who wanted to write books about algorithmics -- but it's still interesting to see the effect of combining a well-designed equation editor with a best of breed typesetting engine behind it.
As a final note, Lyx can produce postscript, PDF, or DVI output. It can also export documents in LaTeX, for final hand-tuning -- or, using a special template, it can export DocBook SGML files. And thereon hangs a tale, because SGML is going to become increasingly important over the next few years as the world wide web switches to XML.

SGML -- the document as database

Back in the 1970's, it was becoming apparent to some documentation specialists that they had a problem. There exist large projects where the volume of documentation, and its life expectancy, defy the ability of ordinary systems to maintain it. For example, a military aircraft like the F-16 typically has on the order of ten million pages of maintenance, procedures, and operational documentation, and this must be maintained across a period of up to half a century. Printing this mass on paper and filing it is enough to fill a library; searching it by hand is next to impossible. But searching it by electronic means was a difficult problem too -- and this sort of problem is made worse by mixing typographical markup information (such as troff or TeX macros) in with the text.
A team of specialists at IBM's Almaden Research Centre, led by Charles F. Goldfarb, set out to do for text what Codd and Date did for databases: to come up with an abstract model of document semantics which could be used to tag text, making it possible to search large repositories automatically, and which could be used to automate the typesetting process. In effect, the text would become a database, and it could be searched online or printed in a consistent format automatically. The product of their research was SGML, the Standard Generalized Markup Language.
In relational database theory, a database is defined using a data definition language (the CREATE statement in SQL fills this role) and accessed using a data access language (most of the rest of SQL). In SGML, the structure of the data is defined using a DTD, or Document Type Definition; the document is then marked up using tags defined in the DTD (instead of macros, as in TeX or troff). Each DTD defines a subset of SGML, called an instance, with different named tags that bear different hierarchical relationships to each other. On the rendering side, SGML is supposed to be turned into some printable format using a different tool, DSSSL (Dynamic Style Sheet Specification Language), which bears about the same relationship to style sheets that SGML bears to HTML -- that is, DSSSL is a language for defining style sheet architectures.
You're probably more familiar with SGML than you realise, because HTML -- the language of the world wide web -- is actually an SGML instance: there are DTD's for each version of HTML. Indeed, XML (extensible markup language), the next generation of web content, is actually a cleaned- up version of SGML with some added semantics for retreiving DTD's via hypertext transport protocol, thus moving SGML onto the web.
Most SGML tools are difficult to use; the flexibility of the DTD/Instance mechanism means that a true SGML editor needs to be able to parse a DTD then re-jig its behaviour so that it conforms to the document structure defined in it. Some specialist companies write commercial SGML and XML editors (SoftQuad are the leaders). Linux, however, benefits from a wide range of open source SGML and XML tools. The granddaddy of these is emacs, or Xemacs (the Xemacs fork); this graphical text editor includes an extensive SGML processing mode that provides syntax highlighting, an interface to an SGML parser for conformance testing, and a huge slew of features for editing SGML.
The key to being able to process SGML or XML is a conformant parser. Linux documentation produced by the Linux Documentation Project (including a number of books, many of which are published on paper by O'Reilly and Associates) are written using SGML tools and the DocBook DTD, a DTD designed by a consortium of companies for writing software manuals. The core Linux tools to use are the SGMLTools (or more recent SGMLTools-lite, which run on top of the SP parser from James Clark's OpenJade toolkit. OpenJade's DSSSL system includes tools for exporting SGML documents into Microsoft's RTF, TeX, MIF (Framemaker's export format), and for transformation into different types of SGML/XML formats.
Sitting on top of the SGML parser and the various converter scripts (such as sgml2tex, which uses OpenJade to turn SGML files into TeX files suitable for typesetting, or sgml2html, which turns SGML source into HTML files for web publishing) are editing tools.
A number of WYSIWYG tools are under development, including Conglomerate, a client-server GUI environment for managing SGML document repositories: as the world wide web switches to being written in XML, the importance of support for SGML-based toolkits is going to become more obvious. In the meantime, tools for processing SGML source and generating HTML or TeX files from it are available. In combination with Linux's other features for handing large numbers of files, managing version control, and providing access to servers, this makes for a formidable document management environment -- even if it's not as cute, cuddly, and approachable as a WYSIWYG system.

[ Site Index] [ Linux Index] [ Feedback ]