Unlike SGML and XML parsers, xmlfixup can be applied to incomplete XML documents, and to document fragments, making it a convenient tool for reformatting XML markup in a text-editor session, when the editor provides for filtering text regions by an external program.
Suitable indentation usually helps to clarify document structure dramatically, and can also reveal instances of improper tag nesting. However, a strict grammatical analysis of XML documents with an SGML parser, such as nsgmls(1) or sgmls(1), or with an XML parser, such as xmllint(1) or xmlwf(1), to ensure well-formed XML is always advisable before submitting an XML stream to other XML processing tools.
To achieve the requested linewidth goal, xmlfixup normally breaks the input stream at whitespace or a newline when the next token would cause the requested linewidth to be exceeded. However, it will not do so at space inside an XML tag, or where there is no existing breakpoint. Thus, a long punctuation-terminated sequence like
will not be broken, and may exceed the specified linewidth.<systemitem role="url">http://www.xml.org/long/path/</systemitem>.
This option may be abbreviated to -d.
This option may be abbreviated to -du.
The default XML markup assumed by xmlfixup is a small subset of the DocBook/XML markup tags, but alternative indentation rules for other tag sets can be easily defined in a configuration file supplied as a command-line argument.
The formatting required for documents in XML markup can be specified by assigning XML tags to one of several formatting classes, illustrated by the default configuration file output with the --dumpconfig option:
# xmlfixup version 1.01 [02-Dec-2003] Begin_Verbatim : # clear existing list Begin_Verbatim : <screen> Break_Before : # clear existing list Break_Before : <colspec> <systemitem> <xref> Empty_Line_After : # clear existing list Empty_Line_After : </entry> </item> </listitem> </para> </row> Empty_Line_After : </screen> </tbody> </thead> </varlistentry> Empty_Line_Before : # clear existing list Empty_Line_Before : <entry> <item> <listitem> <para> <row> Empty_Line_Before : <screen> <tbody> <thead> <varlistentry> Empty_Line_Before : <xref> End_Verbatim : # clear existing list End_Verbatim : </screen> No_Break_Before : # clear existing list No_Break_Before : <footnote> Ordinary : # clear existing list Ordinary : <!> </citetitle> </command> </emphasis> Ordinary : </envvar> </filename> </firstterm> Ordinary : </indexterm> </literal> </option> </quote> Ordinary : </replaceable> </screen> </secondary> Ordinary : </subscript> </superscript> </systemitem> Ordinary : </term> </tertiary> </title> <?> <citetitle> Ordinary : <colspec> <command> <emphasis> <envvar> Ordinary : <filename> <firstterm> <footnoteref> Ordinary : <indexterm> <literal> <option> <primary> Ordinary : <quote> <replaceable> <screen> <secondary> Ordinary : <spanspec> <subscript> <superscript> Ordinary : <systemitem> <term> <tertiary> <title> <xref>
Unlike in HTML and most SGML document types, markup tags in XML are case-sensitive, and xmlfixup follows that practice. DocBook/XML tags are all lowercase, but tags for other XML document types may use mixed lettercase.
In a configuration file, comments run from sharp to end of line, and are removed before further processing. Blank or empty lines, and whitespace at start or end of line, are ignored.
The formatting class is defined by the name before the colon, which may optionally be surrounded by whitespace. The colon is followed by a whitespace-separated list of zero or more tags in that class. An empty list clears the class list, and otherwise, tags for repeated class names simply augment the list for that class.
The Begin_Verbatim and End_Verbatim classes contain tag environments that must be copied verbatim, without any change whatsoever in indentation or spacing.
The Break_Before class contains tags which should begin on a new line, and thus, require a line break preceding the tag.
The No_Break_Before class contains tags, like those for footnotes, which must be attached to preceding text, and thus, must not be preceded by a line break. However, they are otherwise treated like normal tags with indented bodies. Existing whitespace preceding the open tag will not be eliminated.
The Empty_Line_After and Empty_Line_Before classes contain tags that should appear alone on their line, with an empty line after or before them, respectively.
The Ordinary class contains tags that behave like ordinary text, and thus require no special formatting; in particular, they do not cause line breaks.
Two kinds of tags receive special treatment. Formatter directives of the form <?name?> are classed according to the reduced form <?>. Comments (<!-- ...-->) and SGML directives (<!NAME ...>) are classed according to the reduced form <!>. Consequently, only those reduced forms need be used in the classification rules.
Tags may belong to more than one class.
Tag attributes are ignored when tags are classified: thus, <chapter id="mybook-ch-3"> is formatted just like <chapter> is.
Outside the seven classes, all other tags are assumed to define environments with indented bodies. This unnamed class thus serves as a catch-all for an unbounded set of unclassified tags. The open tag (<name>) appears on a line by itself at the current indentation level, the level is incremented by one for the body, and the close tag (</name>) appears on a line by itself at the same level as the open tag:
<tag> Text text text... Text text text... <tag> More text text text... <tag> Even more text text text... </tag> </tag> Text text text... </tag>
xmlfixup adjusts indentation levels according to whether the tag is an open tag or a close tag: it issues a warning on stderr, and also inside an XML comment on stdout, if the tag names do not match. If this happens, it means that either the XML input is not well-formed, or else there is an inconsistency in the markup rules in a user-supplied configuration file.
xmlfixup does not currently offer any control over the formatting of the internals of SGML comments, declarations, and directives (all of which look like <!...>), or of processor directives (which look like <?...>).
It may even prove necessary to extend xmlfixup with additional tag classes to handle a wider variety of XML document types.
xmlfixup does not know anything about SGML tag minimization (use of unnamed end tags </>, and omission of end tags where their implied positions can be determined by the SGML parser from the markup grammar). A tag normalizer, such as sgmlnorm(1) or spam(1), would likely be needed to standardize SGML input before xmlfixup could format it sensibly.
xmlfixup does not currently offer an option to ignore lettercase in tags, making it less useful for HTML and SGML documents with inconsistent lettercase in tags. A tag normalizer can fix such problems. For HTML files, html-pretty(1) has extensive knowledge of HTML variants, and can do an excellent job of prettyprinting such files.
- Peter Flynn, Understanding SGML and XML tools: practical programs for handling structured text, Kluwer, 1998, ISBN 0-7923-8169-6.
- Charles F. Goldfarb and Yuri Rubinsky, The SGML handbook, Clarendon Press, 1990, ISBN 0-19-853737-9.
- Charles F. Goldfarb and Paul Prescod, Charles F. Goldfarb's XML handbook, fourth edition, Prentice-Hall PTR, 2001, ISBN 0-13-065198-2.
- Michel Goossens and Sebastian Rahtz, The LaTeX Web companion: integrating TeX, HTML, and XML, Addison-Wesley Longman, 1999, ISBN 0-201-43311-7.
- Norman Walsh and Leonard Muellner, DocBook: the definitive guide, O'Reilly & Associates, 1999, ISBN 1-56592-580-7.
References to hundreds of other books on HTML, SGML, and XML can be found in the bibliographies at:
http://www.math.utah.edu/pub/tex/bib/index-table.html#sgml http://www.math.utah.edu/pub/tex/bib/index-table.html#sgml2000
Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 Tel: +1 801 581 5254 FAX: +1 801 581 4148 Email: [email protected], [email protected], [email protected] (Internet) WWW URL: http://www.math.utah.edu/~beebe
ftp://ftp.math.utah.edu/pub/xmlfixup/
in the file xmlfixup-x.yy.tar.gz where x.yy is the current version. Other distribution formats are usually available in the same location.