Essays

Maintaining large web sites

[ Comments ] [ Copyright ] [ Main contents ]

Big web sites are hard to maintain.
No, scratch that.
Big web sites are damned near impossible to maintain, unless you have planned them very carefully from the outset.
If you're going to put more than about five files on the web, you need to design the site effectively. There are three stages to building a website:

Architecture
Architecture refers to the overall structure of the site -- how the files relate to one another.
For example, it's important to bear in mind that although almost all webs have a 'flyleaf' page describing what they're about, people can enter them from anywhere, at any point -- so it's essential that every page on the site has a link to its parent page, all the way up to the 'flyleaf'.
It's also important to adequately link the pages together, so that the users aren't constantly bouncing on their browser's 'back' button, and it's useful to partition them into subdirectories organized on the basis of some functional affinity.
(I banged on about this for an entire chapter in my book, so I won't bother you with any more of it here.)
Content
We've all seen web sites where there were about fifty brilliant, high-resolution, large, and optionally animated images ... and the only substantive content on them was the address of the company which put it together (mostly in hope of touting for trade).
Content is indefinable, but crucially important. If you don't have content, you don't have a website; as Tim O'Reilly put it, the correct model for businesses using the web is publishing, and whoever bought a book with a flashy cover but no words or pictures inside it?
Let's get realistic. A web site that contains your curriculum vitae or your company's product range is just that: a CV or a sales brochure. It's not going to attract an audience, any more than your name in the phone book attracts an audience. While people looking for a widget manufacturer via one of the search engines might find your company homepage useful, the only way to be sure that your site is useful is to pump useful stuff into it.
That's a big effort. (Clue: this website is an outgrowth and evolutionary descendant of a hypertext I've been working on since 1993, when I first met the world-wide web. It's current design dates to early 1995, and has been stable -- give or take occasional reshuffles -- ever since. The main way it changes is to accrete content, not to acquire a flashy graphical overhaul every time some bunch of stoned software engineers in California release a new lump of bug-ridden beta-testware. And believe me, the content doesn't come easy.)
Design
Design is the icing on the cake.
Design does distinguish a professional newsstand magazine from a badly-photocopied amateur publication. But the web is not a magazine rack; the web is a platform-independent information interchange medium. It's difficult to design for the web because you can't tell what platform your readers are working on. (I bang on about this issue in my book, too.)
Some design work is worthwhile. This site uses some HTML 3.2 tags -- BGCOLOR for example -- because I happen to believe that they make it a bit more readable than the usual browser defaults on those browsers that support colour, and they don't get in the way of those browsers that don't (Lynx, for example).
Other design work is non-obvious. This site is designed to work well with Lynx 2.5, the de-facto standard browser for non-graphical systems and blind users. As a spin-off, it works reasonably well with graphical browsers. There's nothing wrong with design -- as long as it doesn't get in the way of content delivery.

Why is all this relevant to large webs?

Big websites
You can't work on a big website without developing an acute respect for the hierarchy of priorities outlined above.
My personal website isn't big. There are about 190 files in it. This entire site (www.antipope.org, aka www.antipope.demon.co.uk) contains perhaps 280 files.
But that's still too big to manage by hand!
For example, suppose I want to change the style of the menu bars that feature on every page. Currently they look like this:
[ link ] [ link ][ link ]
(The reason for text-only button bars is that they load far faster than button bars with separate icons, they aren't as browser-specific or as server-intensive as client-side or server-side imagemaps, and they look fine, if a bit plain, in any browser.)
If want to replace them, with a chain of iconic images, or with a JavaScriptlet that explicitly adds 'back' and 'forward' links to each page, or change those square brackets to round ones ...
( link ) ( link )( link )
I'd have to edit all 190 or so files by hand.
From here on in, it gets worse. When I worked for SCO, we were maintaining a documentation set that ran to three million or so words of text. When exported to HTML for the OpenServer online documentation set, this turned out to be on the order of 24,000 files with 200,000 links in them!
You really can't maintain a site like that by hand, can you?
But there's worse. The biggest website I've ever worked on is Hantsweb, Hampshire County Council's government server. It was built by REXX scripts running on a cluster of mainframes; the scripts iterated over the entire accumulated, public, non-confidental, online archives of a government organization, and exported them (as HTML) to a web server. Last time I looked there were 250,000 files on Hantsweb, and over a million links.
How do you maintain a big website?
Macros, build engines, and production cycles
Big websites aren't written in HTML.
This may come as a revelation, but if you stop to think about it it makes sense. Big agglomerations of HTML documents are very hard to change consistently. They suffer from broken links. And they tend to lose structure with time, as different people hack on them.
The smart approach is to either store your content in a database, and export it into HTML via some kind of build engine that takes care of the links, or to write in some precursor language that's better suited to document management, and process it into HTML.
In either case, HTML is the output format -- the way postscript is an output format for word processors.
Wherever possible, the valuable content is insulated from the vagaries of architecture (how it's linked into a web) and design (how the output is formatted).
In a database-driven system, the build engine scans through each record, digs out the data, slaps it into HTML, and adds links to neighbouring entities -- then creates a file in an output directory somewhere.
Macro-driven systems are similar, but instead of repeatedly processing records in a database, a macro processor scans files in some input format, transforms them into HTML, and writes the HTMLfiles into an output directory. A macro processor is a program that processes files and obeys instructions it finds in them. Most macro processors can update variables, make editing changes, execute instructions to load, interpolate, or delete files, and generally behaves a lot like your word processor's mailmerge feature on steroids.
Database-driven webs are ideal for really humongously huge sites, where the full power of an Oracle or DB2 storage engine comes into play. Macro based systems can also work on a big scale, but tend to require an external database to keep track of their links; at some point, they break down, or become a database-driven site in all but name.
Macro-based sites, however, are easy to slap together and make for a good small to medium web platform.
How this site is put together
I use a system called MetaHTML.
MetaHTML is a seriously heavyweight macro processing system for the web. It's free software, and it can run in a variety of modes; it's job is to read documents in a language called meta-html, and turn them into HTML.
It functions as a standalone web server (reading metahtml files and translating them to HTML on the fly when it receives a request), as a CGI script (so you can embed it in another web server), and as a standalone file processor. The file processor slurps in a meta-html file and spits out plain ordinary HTML.
Because this site is mirrored in places where I can't run Meta-HTML as a CGI script, I use the latter facility. All my Meta-HTML files live in a directory tree. I wrote a simple Perl script that recursively descends the tree. Whenever it sees a Meta-HTML file, it processes it and copies the output to an equivalent point below a parallel, "build" directory. Whenever it sees a file that isn't Meta-HTML, it simply copies it over. Whenver it runs into a directory it doesn't know about, it creates an empty one of the same name below the build root.
All my files begin with two or three lines like this:
<set-var chapter-name="Essays">
<set-var page-title="Maintaining large web sites">
<include header.mhtml>
These lines are Meta-HTML. They create a couple of variables (called chapter-name and page-title) associated with the file, then read in another file (header.mhtml) which is inserted into the flow of the text at that point.
The header.mhtml file sets up the standard document header. It looks like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META http-equiv="PICS-Label" content='(PICS-1.0 "http://www.rsac.org/ratingsv01.html" l gen
true comment "RSACi North America Server" by "charlie@antipope.demon.co.uk" for
"http://www.antipope.demon.co.uk/webbook/" on "1996.04.16T08:15-0500" exp
"1997.01.01T08:15-0500" r (n 0 s 0 v 0 l 0))'> 
<TITLE>
<get-var page-title>
</TITLE>
</HEAD>
<BODY link="8B0000" vlink="473C8B" bgcolor="F0F0F0">
<BR>
<IMG SRC="../gifs/earth-ball.gif" ALIGN="RIGHT">
<H1>
<get-var chapter-name>
</H1>
<H2>
<get-var page-title>
</H2>
<P><HR><P>
<include menubar.mhtml>
<P><HR><P>
<BLOCKQUOTE>
This looks like gibberish, doesn't it? It sets up a PICS tag (for content rating of web pages), interpolates the variable page-title into the document's TITLE tag, then interpolates the chapter-name in the right place. It also inserts another file: menubar.mhtml.
menubar.mhtml looks like this:
[ ^lt;A HREF="/charlie/misc/bio/mailto.html">Comments</A> ]
[ <A HREF="copyright.html">Copyright</A> ]
[ <A HREF="../index.html">Main contents</A> ]
Each directory in this website has a header.mhtml, footer.mhtml, and menubar.mhtml file. I like to keep it that way so that I can edit the menu bars (for example) for an entire directory; I could equally well keep them all in one place, so that changing a single file would change the look and feel of every document in the website.
All I need to do is edit menubar.mhtml and re-run my macro processing script, and the menubars are changed for every file in this directory.
I can also make changes to the overall "look and feel" of the site by changing the tags in the header and footer files.
Some changes are, of course, considerably harder to make. If I wanted to entabulate the entire site, replacing every <P> tag with a table row/column tag, so that I could add things like multi-column layout and side-bars, I'd have to rework the entire site by hand. (I may yet do this, turning my site into pure Meta-HTML.) Likewise, it would be sensible to build a database system to track document URLs, and upgrade my macro script to use it. But for now, this website is small enough that I can keep links from breaking. In general, each directory holds less than twenty files, and only the index.mhtml pages have links to other directories (or files in them). Thus, I've partitioned the maintenance problem into manageability for the time being: and if I ever need to add ten thousand more files to my site, I have most of the infrastructure in place to handle them.
(Oh yes. You were wondering if I'd include my website management scripts here, weren't you? The answer is: I will, but not yet. They're not fully mature, and I want to ensure they're useable and effective before unleashing them on the world at large. Meanwhile, I can't emphasize strongly enough that if you want to manage a site with more than a dozen files, you want a macro processor: and for UNIX boxes, Meta-HTML offers a very versatile, powerful, and robust solution.
[ Comments ] [ Copyright ] [ Main contents ]