A word to be spell checked may contain any ASCII letter or apostrophe, or any character in the range 128..255. All other characters are silently ignored.
With suitable locale-specific dictionaries, and optionally, suffix rules, myspell can check spelling for files in any human language that can be encoded in ASCII, or any of the ISO 8859-n code pages, or Unicode in UTF-8 encoding, provided that whitespace separates words. [Languages that lack word separators, such as Lao and Thai, require sophisticated grammatical analysis to identify words.] For Unicode, some prefiltering may be needed to remove Unicode punctuation (otherwise, it will simply be reported as spelling exceptions).
If the files to be spell checked contain document markup, that markup should usually be stripped by a suitable initial filter step; see the EXAMPLES section below.
To avoid confusion with options, if a filename begins with a hyphen, it must be disguised by a leading absolute or relative directory path, e.g., /tmp/-foo or ./-foo. Alternatively, precede the file list with the -- option.
If this option is not specified, and no locale is set, then myspell will use a built-in list of system dictionaries.
For spell(1) compatibility, this option may be abbreviated to +dictfile.
This option may be abbreviated to =rulefile.
myspell report.txt > report.ser myspell +report.sok report.txt > report.ser deroff *.rno | myspell -s french.sfx > temp.ser detex *.tex | myspell -p mydict.sok > temp.ser dehtml *.html | myspell -l fr =french.sfx > temp.ser dexml *.xml | myspell -locale da > temp.ser dexml *.xml | myspell -l da =danish.sfx > temp.ser
Once the input is free of spelling errors, the output of myspell will be a list of exceptional words that are not in the current dictionaries, but are known to be correct. They can be added to a private dictionary that is used on subsequent runs, thereby reducing the size of future reports.
There are numerous sources of word lists for various languages on the Internet (search for word list with your favorite search engine), e.g.,
Dictionaries for other spell checkers can usually be trivially adapted for use with myspell.ftp://ftp.ox.ac.uk/pub/wordlists/ ftp://ibiblio.org/pub/docs/books/gutenberg/etext96/pgw* ftp://qiclab.scn.rain.com/pub/wordlists/ http://www.phreak.org/html/wordlists.shtml
In addition, any corpus of text in a single language that is known to be relatively free of errors can be easily filtered with tr(1) and sort(1) to produce a candidate spelling dictionary for any language that is not yet supported by myspell. Internet archives of articles, books, reports, theses, and even news stories can often be readily located by Web search engines.
Suffix rules are defined in simple text files that contain one rule per line, beginning with a suffix regular expression, and followed by a possibly-empty list of replacement suffixes, one of which may be the empty string, indicated by adjacent quotation marks. Comments run from sharp (#) to end of line, and blank lines are ignored.
Here is a short example for English:
'$ # Jones' -> Jones 's$ # it's -> it ed$ "" e # breaded -> bread, flamed -> flame ied$ ie y # died -> die, cried -> cry ly$ "" # acutely -> acute s$ # cats -> cat
While suffix rules suffice for many Indo-European languages, others don't need them at all, and still others have more complex changes in spelling as words change in case, number, or tense. For such languages, the simplest solution seems to be a larger dictionary that incorporates at least all of the common word forms.
Much more work needs to be done to provide language-specific suffix-rule files, and to collect dictionaries for many more languages.
Nelson H. F. Beebe Center for Scientific Computing University of Utah Department of Mathematics, 110 LCB 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 Tel: +1 801 581 5254 FAX: +1 801 581 4148 Email: [email protected], [email protected], [email protected], [email protected] (Internet) WWW URL: http://www.math.utah.edu/~beebe