dw is a handy tool for finding a common typographical error in documentation. If filters its standard input, or a list of files, printing on standard output only those words that appear two or more times in succession. Each word is prefixed by its filename and line number(s). For stdin, the filename is a dash.
A word starts with a letter or underscore, and is followed by zero or more letters, underscores, or digits. Letter case is ignored. Program options can modify that behavior.
Pattern files may contain leading comment lines beginning with a sharp character (#): reading terminates after the first noncomment line.
For installed pattern files, a suffix .pat is automatically supplied, and the file contents are in Unicode UTF-8 encoding. Such files are named by the lowercase English name of the language, or by its ISO 639-1 alpha-2 code. Those codes are often identical to the ISO 3166-1 alpha-2 codes that are used for Internet top-level domain names. Thus, --word-pattern spanish and --word-pattern es find equivalent pattern files, spanish.pat and es.pat, for that language.
The default pattern, similar to that for identifiers in many programming languages, and English-language text, is [A-Za-z_][A-Za-z_0-9]. That means an initial letter or underscore, followed by any number of letters, underscores, or digits. Whitespace around the bracketed patterns is ignored.
Any character in the pattern, except the character range dash, may be represented by an octal escape sequence of the form \ooo, \oo, or \o, where o is an octal digit, or by a hexadecimal escape sequence of the form \xhh, where h is a hexadecimal digit in 0-9a-f (ignoring lettercase), or by a literal escape sequence of the form \c where c is any single character, representing itself. The escape character checks are made in that order, so \x begins a hexadecimal sequence, rather than being a literal x.
Thus, the vowels aeiou may be represented by normal characters aeiou, by the literal sequence \a\e\i\o\u, by the octal sequence \141\145\151\157\165, by the hexadecimal sequence \x61\x65\x69\x6f\x75, or any mixture thereof.
Here are examples of additional characters needed in word patterns (some glyphs may be lost due to output device and/or font limitations). All of them are recognized by dw, with language names in lowercase:
The languages above include two that are written from right to left (Arabic and Hebrew, each with distinctive non-Latin scripts).
Nelson H. F. Beebe University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA WWW URL: http://www.math.utah.edu/~beebe Tel: +1 801 581 5254 FAX: +1 801 581 4148 Email: [email protected], [email protected], [email protected]