orbibsort [-byday or -byvolume or -byyear] [ optional sort(1) switches ] BibTeXfile(s) >outfile
Sorting is normally by BibTeX citation label name, or by @String macro name, and letter case is always ignored in the sorting.
All remaining command-line words are assumed to be input files. Should such a filename begin with a hyphen, it must be disguised by a leading absolute or relative directory path, e.g. /tmp/-foo.bib or ./-foo.bib.
The sort(1) -f switch to ignore letter case differences is always supplied. The -r switch reverses the order of the sort. The -u switch removes duplicate bibliography entries from the input stream; however, such entries must match exactly, including all white space.
With -byday sorting, a day keyword is recognized (it will be standard in BibTeX 1.0), but for backward compatibility, month entries of the form
"daynumber " # monthname "daynumber~" # monthname {daynumber } # monthname {daynumber~} # monthname monthname # "daynumber " monthname # "daynumber~" monthname # {daynumber } monthname # {daynumber~}
are also recognized, and will yield both a day and a month. If a day number is not available, 99 is assumed, which will sort the entry after others that have day values in the same year and month.
With -byvolume sorting, warnings are issued for any entry in which any of these fields are missing, and a value of the missing field is supplied that will sort higher than any printable value.
Because -byvolume sorting is first on journal name, it is essential that there be only one form of each journal name; the best way to ensure this is to always use @String{...} abbreviations for them. Order -byvolume is convenient for checking a bibliography against the original journal, but less convenient for a bibliography user.
An unfortunate implementation limitation of the current BibTeX requires cross-referenced entries to appear after all other entries that cross-reference them, although this limitation works to the advantage of bibsort, allowing single-pass processing.
- 1.
- Introductory material such as comments, file headers, and edit logs that are ignored by BibTeX. No line in this part begins with an at-sign, ``@''.
- 2.
- Preamble material delineated by ``@Preamble{'' and a matching closing ``}'', intended to be processed by TeX. Normally, there is only one such entry in a bibliography file, although BibTeX, and bibsort, permit more than one.
- 3.
- Macro definitions (abbreviations) of the form ``@String{...}''. Any single @String specification may span multiple lines, and there are usually several such definitions.
- 4.
- Bibliography entries such as ``@Article{...}'', ``@Book{...}'', ``@InProceedings{...}'', and so on, provided that their citation labels have not already been encountered in a crossref assignment in a preceding entry. For bibsort, any line that begins with an ``@'' followed by letters and digits and an open brace is considered to be such an entry. Optional spaces and tabs may surround the ``@'', and precede the first open brace; these spaces and tabs will be deleted from the output to help standardize the appearance.
- 5.
- ``@Proceedings{...}'' bibliography entries, which are likely to be cross-referenced by ``@InProceedings{...}'' entries, and any other bibliography entries for which a crossref assignment was met before the entry itself.
The order of these parts is preserved in the output stream. Part 1 will be unchanged, but parts 2--5 will be sorted within themselves.
The sort key of ``@Preamble'' entries is their initial line, of ``@String'' entries, the abbreviation name. For all other BibTeX entries, the sort key is citation label between the open curly brace and the trailing comma, unless the sort key is prefixed with additional fields as requested by -byvolume or -byyear switches.
bibsort will correctly handle UNIX files with LF line terminators, as well as IBM PC DOS files with CR LF line terminators; the essential requirement is that input lines be delineated by LF characters. Thus, files from the Apple Macintosh, which uses bare CR to terminate lines, would first have to be converted to UNIX or PC DOS line format before giving them to bibsort.
The user must be aware that sorting a bibliography is not without peril, for at least these reasons:
- 1.
- BibTeX has a requirement that entry labels given in crossref = label pairs in a bibliography entry must refer to entries defined later, rather than earlier, in the bibliography file. This regrettable implementation limitation of the current (pre-1.0) BibTeX prevents arbitrary ordering of entries when crossref values are present. To partially solve this problem, bibsort will place ``@Proceedings'' entries last, since they are frequently cross-referenced by ``@InProceedings'' entries. However, it is also possible for ``@Book'', ``@InBook'', and ``@InCollection'' entries to cross-reference ``@Book'' entries, and for ``@Article'' entries to cross-reference other ``@Article'' entries. Neither of these cases are dealt with by bibsort, except that ``@Book'' entries that contain a ``booktitle'' assignment, and entries that are explicitly cross-referenced before their definition, are sorted with ``@Proceedings'',
- 2.
- If the BibTeX file contains interspersed commentary between ``@keyword{...}'' entries, this material will be considered part of the preceding entry, and will be sorted with it. Leading commentary is more common, and will be moved elsewhere in the file.
This is normally not a problem for the part 1 material before the ``@Preamble'', since it is kept together at the beginning of the output stream.
- 3.
- Some kinds of bibliography files should be kept in a different order than alphabetically by citation labels. Good examples are a bibliography file with the contents of a journal, or a personal publication list, for both of which chronological publication order is likely to be preferred.
While a much more sophisticated implementation of bibsort could deal with the first point, and the -byvolume switch provides a partial solution to the third point, in general, a satisfactory solution requires human intelligence and natural language understanding that computers lack.
bibsort uses octal ASCII control characters 001 through 007, 0177, and 0377, for temporary modifications of the input stream. If any of these are already present in the input, they will be altered on output. This is unlikely to be a problem, because those characters have neither a printable representation, nor are they conventionally used to mark line or page boundaries in text files.
Some implementations of BibTeX editing support in GNU emacs(1) have a sort-bibtex-entries command that is functionally similar to bibsort. However, the file size that can be processed by emacs(1) is limited, while bibsort can be used on arbitrarily large files, since it acts as a filter, processing a small amount of data at a time. The sort stage needs the entire data stream, but fortunately, the UNIX sort(1) command is clever enough to deal with very large inputs.
The current implementation of bibsort follows the UNIX tradition of combining simple already-available tools. A six-stage pipeline of egrep(1), nawk(1), sort(1), and tr(1) accomplishes the job in one pass with about 500 lines of heavily-commented shell script, about 225 lines of which is a nawk(1) program for insertion of sort keys. The initial prototype of bibsort was written and tested on several large bibliographies in a couple of hours, and after considerable use, was later extended with advanced sorting capabilities and cross-reference recognition in a couple of days of work. By contrast, bibtex(1) is more than 11 000 lines of code and documentation, and bibclean(1) is more than 15 000 lines long; both took months to develop, implement, and test.
Nelson H. F. Beebe, Ph.D. Center for Scientific Computing Department of Mathematics University of Utah Salt Lake City, UT 84112 Tel: +1 801 581 5254 FAX: +1 801 581 4148 Email: <[email protected]> WWW URL: http://www.math.utah.edu/~beebe