orhtml-pretty [ -? ] [ -a ] [ -c ] [ -f filename ] [ -h ] [ -i nnn ] [ -n ] [ -v ] [ -w nnn ] file(s) > outfile
html-pretty filters its HTML input from stdin, or from one or more named files given on the command line, and prettyprints it to stdout.
HTML (HyperText Markup Language) is the language used to specify formatting instructions in text files intended for viewing with World-Wide Web (WWW) client programs (browsers), such as arena (1), hotjava (1), lynx (1), netscape (1), and xmosaic (1).
The WWW idea began in late 1992, and because viewer programs support display of text, line drawings, color raster images, hypertext links, and uniform access to several Internet services, including file transfer, in the first two years, the number of WWW servers grew from zero to several hundred thousand, and some of the more popular sites receive up to two million accesses a day from all over the Internet. Consequently, many Internet computer users are beginning to write HTML documents for their own home pages, and html-pretty is written for them.
The goal of a prettyprinter is to recognize all legal inputs, and produce output that is indented to reflect the structure, and in which line lengths have been restricted for improved readability. Irregularities in coding practice, and outright errors, are more likely to be detected in the prettyprinted output, than in the input.
SGML (Standard General Markup Language, ISO8879), and its particular document type definition instance, HTML, follow a rigorous grammar for text markup that makes it possible to clearly identify document parts, such as headings, sections, subsections, paragraphs, figures, tables, equations, and so on, and files with such standardized markup are particularly good candidates for prettyprinting.
The current definition of HTML is still in flux (version 2.0 is nearing standardization, and version 3.0, informally called HTML Plus, is under development in early 1995). html-pretty follows the grammar of version 3.0, which is a superset of that of version 2.0. Version 3.0 introduces several new tags (see the FORMATTING CONVENTIONS section below), and supports figures, input forms, tables, and a limited math mode.
One significant difference between the two grammar versions is that the HTML tag <P> is a paragraph separator in version 2.0, while it is a paragraph begin in version 3.0, and consequently expects to have a matching </P> end tag that is not required in 2.0. html-pretty will supply missing </P> tags, and delete empty <P> . . . </P> environments. Since HTML translators ignore unrecognized tags, this is transparent to HTML version 2.0 implementations, and causes no problems.
html-pretty expects that its input is reasonably well-formed. Usually it is sufficient that the file can be displayed by one or more WWW browsers, producing the expected form. However, it would be unwise to write a large amount of program code without a compiler to check it, and it is similarly unwise to write documentation in HTML or SGML without at least a validating parser to ensure that the text is syntactically correct.
Fortunately, at least two such programs are publicly available (thanks to the generosity of their author, James Clark) nsgmls (1) and sgmls (1), together with UNIX shell scripts, html-check (1) and html-ncheck (1), to facilitate their use with HTML files. In addition, the nsgmls (1) distribution is accompanied by two SGML tag normalizers, sgmlnorm (1) and spam (1), and there is a UNIX shell script, html-spam (1), for one of them. You may therefore find it useful to apply html-spam (1), and either html-check (1) or html-ncheck (1), to your HTML files, and fix all of the errors that they detect, before filtering the files with html-pretty.
HTML strictly requires a certain amount of boiler-plate to be wrapped around the text, and there is ample evidence that most HTML files omit these wrappers, because WWW browsers are written to be tolerant of grammatical deviations. html-pretty will supply the wrappers if they are omitted; indeed, if given an empty input file, html-pretty produces output similar to this:
<!-- -*-html-*- --> <!-- Prettyprinted by html-pretty lex version 0.09 [05-May-1995] --> <!-- on Sat May 6 09:55:25 1995 --> <!-- for Nelson H. F. Beebe (beebe@sunrise) --> <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <HTML> <HEAD> <TITLE> <!-- Please supply a descriptive title here --> </TITLE> <!-- Please supply a correct e-mail address here --> <LINK REV="made" HREF="mailto:USERNAME@HOSTNAME>"> </HEAD> <BODY> </BODY> </HTML>This example, minus the comments <!-- . . . --> , shows the minimal markup that should be expected in an HTML file, although the grammar permits the HTML, HEAD and BODY environments to be implicitly assumed if they are omitted. While WWW browsers ignore the DOCTYPE declaration, it is essential for SGML parsers, since it identifies the grammar rules that apply to what follows.
Any argument that begins with a hyphen is expected to be an option, and will raise an error if it is not recognized. If a filename begins with a hyphen, you therefore need to disguise it by supplying a leading directory path. For example, ./-foo represents the file named -foo in the current directory in UNIX.
HTML 3.0 augments the 2.0 grammar with 53 additional tags: ABBREV, ABOVE, ACRONYM, ARRAY , ATOP, AU, BAR, BELOW , BIG, BOX, BQ, BT, CAPTION, CHOOSE, CREDIT, DDOT , DEL, DFN, DIV, DOT , FIG, HAT, INS, ITEM , LANG, LEFT, LH, MATH , NOTE, OF, OVER, OVERLAY, PERSON, PRE, Q, RIGHT, ROOT, ROW, S, SMALL, SQRT, STYLE, SUB, SUP, T, TAB, TABLE, TD, TH, TILDE, TR, U , and VEC.
These tags are identified by their occurrence in the html.dtd and html-3.dtd document type definition files in lines like these:
ENTITY declarations define text string substitutions, and ELEMENT declarations define the tags recognized by the grammar.<!ENTITY % font " TT | B | I "> <!ENTITY % phrase "EM | STRONG | CODE | SAMP | KBD | VAR | CITE "> <!ELEMENT (%font;|%phrase) - - (%text)+> <!ELEMENT XMP - - %literal>
The HTML grammar permits certain end tags to be omitted, when their implied position can be determined from the grammatical context. In HTML 3.0, this includes the following tags: BODY, DD, DT, HEAD, ITEM, LI, MESSAGE, OPTION, P, TD, TH, and TR. Supporting such a feature requires the ability to parse a complete SGML grammar, which requires a great deal more code than html-pretty provides. Consequently, it does not support optional end tags; based on typical usage, they are expected to be always present, or always absent, according to the rules given below.
HTML comments are prettyprinted on separate lines. Their internal form is preserved exactly, without any line wrapping, since they will often contain specially-formatted material.
The following HTML tag names occur in begin/end pairs ( <TAG> and </TAG> ), often with substantial amounts of intervening text. They are prettyprinted on separate lines, with their enclosed text indented one level: A, ABBREV, ABSTRACT, ACRONYM, ADDED, ADDRESS , ARG, AROW, ARRAY, AU , BLOCKQUOTE, BODY, BQ, CAPTION, CITE, CMD, CREDIT, DIV1, DIV2, DIV3, DIV4, DIV5, DIV6, FIG, FN, FOOTNOTE, FORM, H1, H2, H3, H4, H5, H6, HEAD , HIDE, HTML, LANG, LH , MARGIN, MATH, MESSAGE, NOTE, OPTION, OVERLAY, P , PERSON, Q, QUOTE, REMOVED , SELECT, TABLE, TEXTAREA, and TITLE.
These HTML tag names occur in begin/end pairs, usually with smaller amounts of enclosed material. They appear inline in the running text, and do not alter indentation: B, BIG, BT, CODE, DFN, EM, I, KBD, Q, REV, S, SAMP, SMALL, STRONG, T, TT, U, and VAR.
These HTML tags names occur in begin/end pairs, and delimit lists. They appear on separate lines, with their enclosed text indented two levels: DIR, DL, MENU, OL, and UL.
These HTML tags mark the beginning of list items, and have matching end tags which are supplied if they are absent. They are output on separate lines, indented one level from the enclosing list: DD, DT, and LI .
These HTML tag names appear inline, without affecting indentation: TAB, TAG, TD, TH , and TR.
These HTML tag names occur only inside MATH mode, and appear inline, without affecting indentation: ABOVE , ATOP, BAR, BELOW, BOX , CHOOSE, DDOT, DEL, DOT , HAT, INS, LEFT, OF , RIGHT, ROOT, ROW, SQRT , SUB, SUP, TILDE, and VEC.
This HTML tag marks an explicit line break, and has no matching end tag; preceding space is deleted, and a newline follows: BR.
These HTML tags have no matching end tag; they appear alone on separate lines: BASE, CHANGED, HR , IMG, ITEM, INPUT, ISINDEX, LINK, META, NEXTID, OVER, RENDER, STYLE, and STYLES.
These HTML tags appear in begin/end pairs, and delimit preformatted, or verbatim text. The beginning and ending tags are output on separate lines, with the enclosed material copied exactly as it appeared in the input stream: CDATA, LISTING, PRE, and XMP .
Finally, the HTML tag PLAINTEXT marks the beginning of verbatim text that continues to end-of-file; it appears on a separate line. Although some HTML viewers will terminate the verbatim text environment on reaching a matching end tag, </PLAINTEXT>, that practice is now considered erroneous.
All tags that are not explicitly named above are treated as normal text, and have no effect on the indentation.
In SGML, this is conventionally done in SGML declaration files for syntax definitions, and in Document Type Definition (DTD) files, which are analogous to LaTeX style files. Low-level typesetting commands of the flavor ``select a 14-point Lucida-BoldItalic font'', ``skip down vertically 6 picas'', and ``draw a horizontal rule 3 ems long'' are notably absent. SGML documents are expected to use only high-level markup commands, leaving the visual appearance entirely up to the DTD specification, and the formatting software.
Unlike LaTeX and TeX, SGML is not a typesetting system. It is only a grammar for a standard markup language, and it is left to SGML software implementors to write DTDs, and to provide for translation of SGML documents to specific document formatting, typesetting, or word processing systems. In this respect, SGML is similar to the RTF (Rich Text Format) supported by many popular word processors, in that it can serve as an intermediate language for electronic document exchange; several language translators between SGML and other text representations are mentioned in the SEE ALSO section below. Regrettably, there is wide variation in the capabilities and document models of current text formatting systems, so such translations are usually rough approximations that may require substantial hand patching to make them truly satisfactory.
HTML is a modest subset of SGML, with tag names apparently chosen from SGML, the Free Software Foundation's TeXinfo system (which in turn is modeled on the earlier Scribe document formatting system), and occasionally, also from LaTeX.
html-pretty is written in this spirit: it knows about the meaning and typical use of all standard HTML tags, but nothing about other SGML DTDs. While it would be reasonably straightforward to prepare modifications of html-pretty to support other, specific, SGML DTDs, you cannot expect it to be very effective for handling arbitrary SGML text.
contained in: ADDRESS B BODY CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA B CITE CODE DFN EM I IMG KBD SAMP STRONG TT U VARADDRESS
contained in: BLOCKQUOTE BODY FORM LI
contains: #PCDATA A HR IMG PB
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARBASE
contained in: HEADBLOCKQUOTE (obsolete)
contained in: BODY FORM
contains: #PCDATA ADDRESS DL HR IMG OL P PRE ULBODY
contained in: HTML
contains: ADDRESS BLOCKQUOTE DL FORM H1 H2 H3 H4 H5 H6 HR IMG LISTING OL P PRE UL XMPBR
contained in: A ADDRESS B BLOCKQUOTE CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD P PRE SAMP STRONG TT U VARCITE
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARCODE
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARDD
contained in: DL
contains: #PCDATA A B BR CITE CODE DFN DL EM HR I IMG INPUT KBD OL P PRE SAMP STRONG TEXTAREA TT U UL VARDFN
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARDIR (obsolete)
contains: LIDL
contained in: BLOCKQUOTE BODY DD FORM LI
contains: DD DTDT
contained in: DL
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VAREM
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARFORM
contained in: BODY FORM LI
contains: ADDRESS BLOCKQUOTE DL FORM H1 H2 H3 H4 H5 H6 HR LISTING MESSAGE OL P PRE UL XMPH1
contained in: BODY FORM
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARH2
contained in: BODY FORM
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARH3
contained in: BODY FORM
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARH4
contained in: BODY FORM
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARH5
contained in: BODY FORM
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARH6
contained in: BODY FORM
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARHEAD
contained in: HTML
contains: BASE ISINDEX LINK META TITLEHR
contained in: ADDRESS BLOCKQUOTE BODY DD FORM LIHTML
contains: BODY HEADI
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARIMG
contained in: A ADDRESS B BLOCKQUOTE BODY CITE CODE DD DFN DT EM FORM H1 H2 H3 H4 H5 H6 I KBD LI P PRE SAMP STRONG TT U VARINPUT
contained in: B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD P PRE SAMP STRONG TT U VARISINDEX
contained in: HEADKBD
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARKEY (obsolete)
contains: #PCDATA B CITE CODE DFN EM I KBD STRONG TT U VARLI
contained in: DIR MENU OL UL
contains: #PCDATA A ADDRESS B CITE CODE DFN DL EM FORM HR I IMG KBD OL P PRE STRONG TT U UL VAR XMPLINK
contained in: HEADLISTING
contained in: BODY FORM
contains: 12 AMENU (obsolete)
contains: LIMESSAGE
contained in: FORMMETA
contained in: HEADNEXTID (obsolete) OL
contained in: BLOCKQUOTE BODY DD FORM LI
contains: LIOPTION
contained in: SELECTP
contained in: ADDRESS BLOCKQUOTE BODY DD FORM LI
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARPLAINTEXT (obsolete) PRE
contained in: BLOCKQUOTE BODY DD FORM LI
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARSAMP
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARSELECT
contained in: B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD P PRE SAMP STRONG TT U VAR
contains: OPTIONSTRONG
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARTEXTAREA
contained in: B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD P PRE SAMP STRONG TT U VARTITLE
contained in: HEAD
contains: #PCDATATT
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARU
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARUL
contained in: BLOCKQUOTE BODY DD FORM LI
contains: LIVAR
contained in: A B CITE CODE DD DFN DT EM H1 H2 H3 H4 H5 H6 I KBD KEY LI P PRE SAMP STRONG TT U VAR
contains: #PCDATA A B BR CITE CODE DFN EM I IMG INPUT KBD SAMP STRONG TEXTAREA TT U VARXMP
contained in: BODY DD FORM LI
Nelson H. F. Beebe, Ph.D. Center for Scientific Computing Department of Mathematics University of Utah Salt Lake City, UT 84112 Tel: +1 801 581 5254 FAX: +1 801 581 4148 Email: <[email protected]> URL: http://www.math.utah.edu/~beebe