A prettyprinter for HTML documents

[ Comments ] [ Copyright ] [ Main contents ]

Old, bad, HTML is unreadable to the Mark One Human Eyeball. Never mind what Netscape Gold or Hotmetal Pro makes of the file; if, like me, you insist on writing your web pages by hand, you need to have an HTML file that you can scan easily.

To demonstrate the problem, consider this short perl script, using LWP (lib-WWW perl 5.01). It invokes an HTML parser and spits out the raw HTML; it might equally well be a CGI script that generates text on the fly:

use HTML::Parse;
print parse_htmlfile("pp.html")->as_HTML;
It invokes the parser library (on this very file) and produces some horrible output that looks (to the naked eye) like this:
<HTML> <HEAD> <META CONTENT="Charlie Stross, UNIX-oriented text 
mangler, using vi on Linux" NAME="Generator"> <META CONTENT="A 
prettyprinter for HTML documents" NAME="Description"> <META CONTENT="html
sgml coding sex portability" NAME="Keywords"> <TITLE>A prettyprinter 
for HTML documents</TITLE> </HEAD> <BODY BGCOLOR="F0F0F0" 
LINK="8B0000" VLINK="473C8B"> <P><BR> <IMG ALIGN="RIGHT" 
SRC="../gifs/earth-ball.gif" VALIGN="TOP"> <H1>A prettyprinter for 
HTML documents</H1> <P><HR><P> Old, bad, HTML is 
unreadable to the Mark One Human Eyeball. Never mind what Netscape Gold or 
Hotmetal Pro makes of the file; if, like me, you insist on writing your web 
pages by hand, you need to have an HTML file that you can scan easily.<P> 

You really don't want to try and maintain stuff that looks like that, even though it's actually well-formed HTML that a web browser or SGML parser should be able to handle.

This is a pretty serious problem with machine-generated HTML; it's simply incomprehensible. This is also a familiar problem in computer programming in general -- it's easy to write programs that are so obfuscated that they're unreadable, and you can't go back to them in six months' time with any real hope of understanding how they work.

To avoid the rewrite blues that tend to go with such a situation, programmers adopt a variety of strategies, ranging from explanatory comments to neatly indenting and word-wrapping their files. You can do that, too, if you start at the beginning: it's a lot easier to maintain an HTML file if it looks like this:

    <meta content="charlie stross, unix-oriented text mangler, using vi on li
nux" name="generator">
    <meta content="a prettyprinter for html documents" name="description">
    <meta content="html sgml coding sex portability" name="keywords">
    <title>A prettyprinter for HTML documents</title>
  <body bgcolor="f0f0f0" link="8b0000" vlink="473c8b">
      <img align="right" src="../gifs/earth-ball.gif"
    A prettyprinter for HTML documents
          Old, bad, HTML is unreadable to the Mark One Human
          Eyeball. Never mind what Netscape Gold or Hotmetal Pro
          makes of the file; if, like me, you insist on writing
          your web pages by hand, you need to have an HTML file
          that you can scan easily.

Nevertheless, as a sad fact of life, we get to deal with lots of HTML that isn't well-formed. It may be amateur work, or it may be stuff that's been patched a few thousand times over the years, but whatever the case: it's well-nigh unusable.

Which is where an appropriate tool for pretty-printing HTML comes in handy. Here's a brief Perl script that relies heavily on some facilities of the Lib-WWW Perl 5.01 library. You'll need a full Perl 5.002 (or later) distribution, and the Lib-WWW Perl modules, in order to use it.

hp - HTML pretty-printer
hp [ options ] filename > destfile
-w n
wrap document to n columns default: [70]
-t n
indent n characters per tag level default: [2]
specify insertion of implicit close tags (where missing from original document) default: [on]
print a usage message
hp -w 80 -t 2 index.html.old >index.html
hp reads the specified file and parses it using the LWP HTML parser. It then recursively traverses the parse tree, printing formatted output to stdout (which may be redirected to a file).

hp can format HTML tags in two ways: as breakable and non-breakable. Breakable HTML tags (such as H1, HEAD, TABLE) are followed by a line break and an increase in the indentation level for their contents. Non-breakable tags (such as B, I, IMG, or A HREF) are word-wrapped along with their enclosing text.

You can use the wrap option to adjust the width of the output file, and the tab option to adjust how 'steep' the indentation looks. The implicit flag is switched on by default; it causes the parser to generate 'close' tags where they are missing from the input document. (Be careful about turning implicit off!)

Limitations One limitation is known; it depends on LibWWW-Perl 5.03's HTML::Parse module for parsing. The parser engine used doesn't have any easy mechanism for ignoring specific tags such as <PRE>; thus, preformatted blocks may be reformatted along with the flow of body text. You can get round this by (a) using precut (below) to burst the preformatted blocks out of the file, (b) running hppp, and (c) running mhc (the MetaHTML macro processor) over the resulting file, to source the preformatted blocks back in.

See More HTML Maintenance tools for a discussion of these programs.

(A future prettyprinter may utilize the HTML::TreeParse object- oriented parser skeleton ... once I figure out how to override methods and wrap my head around the deeper OOP wizardry.)

More importantly: the implicit flag is either on or off for an entire document. If it's on, it adds closing tags to everything, including elements such as <META> and <IMG SRC> (which arguably shouldn't have them at all).
use LWP::Simple;
use HTML::Parse;
use HTML::Entities;
use Text::Wrap;
use Getopt::Long;

use strict;
require 5.0;


=head 1 hp -- HTML prettyprinter

Online documentation will go here


# initialize global variables

%MAIN::breaktag = ( # list of tags to automatically insert line breaks around 
              "p"      => 1,
              "meta"   => 1,
              "title"  => 1,
              "head"   => 1,
              "body"   => 1,
              "h1"     => 1,
              "h2"     => 1,
              "h3"     => 1,
              "h4"     => 1,
              "h5"     => 1,
              "h6"     => 1,
              "pre"    => 1,
              "table"  => 1,
              "tr"     => 1,
              "td"     => 1,
              "center" => 1,
              "br"     => 1,
              "dl"     => 1,
              "dt"     => 1,
              "dd"     => 1,
              "li"     => 1,
              "al"     => 1,
              "set-var"   => 1,
              "get-var"   => 1,
              "include"   => 1,

%MAIN::junkclosure = ( # list of tags for which we want to drop the close
                       # tag; flagged as IMPLICIT inthe DTD
              "p"      => 1,
              "hr"     => 1,
              "img"    => 1,
              "meta"   => 1,
              "set-var"   => 1,
              "get-var"   => 1,
              "include"   => 1,

$MAIN::tmpbuff = ""; # global scratch workspace for accumulating text stream

# initialize variables local to main loop 

my ($tabspace) = 3;  # default num. of columns per tabstop
my ($wrapcols) = 70; # default page width for line wrap
my ($implicit) = 1;  # flag: insert implicit tags, or not
my ($ignore)   = 0;  # flag: ignore unknown tags, or not
my ($help)     = 0;  # do we want help?
my ($target)   = ""; # name of current file to process

# set up command line options

GetOptions("w=i"       => \$wrapcols,
           "wrap=i"    => \$wrapcols,
           "t=i"       => \$tabspace,
           "tab=i"     => \$tabspace,
           "implicit!" => \$implicit,
           "ignore!"   => \$ignore,
           "h"         => \$help,
           "help"      => \$help);

# issue help message, if needed

if (($help > 0)|| ($ARGV[0] eq"")) {
   print "\nhppp -- html prettyprinter\n",
         "reformat an HTML document to be more easily maintainable.\n",
         "-w, --wrap=     wrap document to  columns\n",
         "                      default: [$wrapcols]\n",
         "-t, --tab=      indent  characters per tag level\n",
         "                      default: [$tabspace]\n",
         "--implicit, --noimplicit\n",
         "                      specify insertion of implicit close tags\n",
         "                      (where missing) \n",
         "                      default: [on]\n",
         "--ignore, --noignore\n",
         "                      ignore or don't ignore unknown tags\n",
         "                      default: [on]\n",
         "\nTypical usage:\n",
         "hp -w 80 -t 2 index.html.old >index.html\n\n";
   exit 0;  

# foreach specified file, prettyprint it!

foreach $target (@ARGV) {
    $HTML::Parse::IGNORE_UNKNOWN = $ignore;
    my ($p) = HTML::Parser->new;
    $HTML::Parse::IMPLICIT_TAGS = $implicit;
    $p = parse_htmlfile($target);
    print "";
    treewalk($p, 0, $tabspace, $wrapcols);
    print "\n\n";

exit 0;

#------------------ support subroutine --------------------------
sub treewalk {
    my ($p) = shift @_;
    my ($depth) = shift @_;
    my ($tabspace) = shift @_;
    my ($wrap) = shift @_;
    my ($entity) = "";
    my ($indent) = 0;
    my ($bdindent) = 0;
    my ($outname) = "";
    my ($out) = "";
    if (ref($p->content)) {
        foreach $entity (@{ $p->content } ) {
            if (ref($entity) eq "HTML::Element") {
                $indent =  " " x ($depth * $tabspace);
                $bdindent = $indent . (" " x $tabspace);
                $out  = lc($entity->starttag);
                $out  =~    /^\S*?<       # start of line to start of tag
                              ([-\w]+)    # word contents of tag
                              \W+         # something that ain't a word
                $outname = $1;
                if (defined $MAIN::breaktag{$outname} ) {
                    if ($MAIN::tmpbuff !~ /^[\s\n]+$/)  {
                        $Text::Wrap::columns = ($wrap - $depth);
                        print "\n", 
                                    "$bdindent ", 
                    print wrap("\n$indent", 
                               $indent, HTML::Entities::decode($out));
                    $MAIN::tmpbuff = "";
                } else {
                    $MAIN::tmpbuff .= HTML::Entities::decode($out);
                # let's dive down the tree, now ...
                treewalk($entity, $depth+1, $tabspace, $wrap);
                # and it's time to close out whatever tag we were in before 
                if ($MAIN::breaktag{$outname} == 1) {
                    if (($MAIN::tmpbuff !~ /^[\s\n]+$/) && 
                        (length($MAIN::tmpbuff) > 0)) {
                        $Text::Wrap::columns = ($wrap - $depth);
                        print "\n", 
                                    "$bdindent ", 
                    $MAIN::tmpbuff = "";
                    if ( ! defined ($MAIN::junkclosure{$outname})) {
                        print ("\n", $indent, lc($entity->endtag));
                } else {
                    if ($MAIN::tmpbuff !~ />$/) {
                        $MAIN::tmpbuff = substr($MAIN::tmpbuff, 0, -1);
                    if ( ! defined ($MAIN::junkclosure{$outname})) {
                        $MAIN::tmpbuff .= lc($entity->endtag) ;
            } elsif (! ref($entity)) {
                if ($entity ne "") {
                    $MAIN::tmpbuff .=  $entity; 
                    $MAIN::tmpbuff .= " ";

#------------------------- end treewalk --------------------------

This isn't a cure, but it's a useful paliative. Certainly, if you need to tidy up a load of old HTML files, running them through hp will make it easier to understand their structure by eye.

But for a real cure for the disease, it's necessary to look at HTML document repositories and maintenance tools.

[ Comments ] [ Copyright ] [ Main contents ]