If you want to select entries by more complex criteria, such as author names, keywords, subject classifications, title words, etc., then bibsplit is not the tool you need: use bibextract(1) instead.
As long as BibTeX is asked to retrieve only a limited number of citations from database files, it does not matter how many citations there are, or how big the database files are. BibTeX simply processes each file in sequential order, and since the files are read only once, and internal processing of string lookups uses a fast constant-time algorithm, access time is strictly proportional to the amount of data read and written.
However, it is frequently desirable to be able to typeset a complete bibliography file, so that one can verify that each entry can be correctly processed by BibTeX, and correctly typeset by TeX.
This is readily done with a simple TeX file that looks like this:
or a corresponding LaTeX2e file that looks like this:\input btxmac \bibliographystyle{plain} \nocite{*} \bibliography{mybib} \bye
\documentclass{article} \bibliographystyle{plain} \begin{document} \nocite{*} \bibliography{mybib} \end{document}
Splitting large bibliographic files is sometimes necessary, because
- internal table sizes in most TeX and BibTeX implementations limit the number of entries actually extracted from one or more BibTeX database files to about 4000;
- large database files are undesirable for World-Wide Web and FTP file transfers across slow network connections;
- large database files are slower to edit;
- large database files are more prone to massive editing disasters.
bibsplit provides the needed solution to this problem, and does so with at most two or three passes over the input data.
In the first pass, bibsplit writes temporary files containing all non-@String entries, partitioned according to the command-line options chosen. It saves all @String definitions in memory, and it builds up a list in memory of which definitions are needed by each of the temporary files.
In the second pass, bibsplit writes the required @String definitions into the final files, sorted in ascending lexicographic order, followed by the contents of their corresponding temporary files, and then deletes the temporary files. No further parsing is needed for the second pass, so it is relatively fast.
For user feedback, bibsplit writes a brief progress report to stdout at important stages of its work. If you do not want to see this, then simply redirect stdout to the null device: on UNIX, bibsplit ... > /dev/null or else use the -silent option.
At the time of writing, bibsplit processes BibTeX data at about 1MB/sec on a fast modern UNIX workstation, so practical applications should never take more than a few seconds.
To avoid confusion with options, if a filename begins with a hyphen, it must be disguised by a leading absolute or relative directory path, e.g., /tmp/-foo.bib or ./-foo.bib.
GNU- and POSIX-style options of the form --name are also recognized: they begin with two option prefix characters.
In the event of conflicting -byxxx options, the last one specified takes precedence.
This is a synonym for -help.
This option is equivalent to, and shorthand for,
-byrange a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z.
As a special case, a zero number is interpreted as infinity; see the end of this section for a practical application.
This option creates output files suffixed by a four-digit entry count reflecting the input order of the first entry in that file, and entries are written to those output files in strict input order.
Lettercase is ignored in the range list.
Use this option when you want coarser grouping, and larger output files, than provided by -bylabel.
If command contains spaces or other characters that are significant to the shell, then of course it needs to be surrounded by protecting quotes, or the special characters need to be prefixed by a backslash. bibsplit will surround command with apostrophes (single quotes), so they cannot be used in command. Should you require apostrophes, then you must embed your commands inside a short executable script file, and use that for command.
This option is most useful for applying bibsort(1) to the output files, because even if the input bibliography was already sorted, resolution of citations and cross-references will have destroyed that order.
This is a synonym for -?.
This option can also be used for discarding messages, with, e.g., on UNIX systems, -logfile /dev/null.
To avoid the need for multiple applications of bibsplit, this option limits the number of simultaneously open files to nnnn. This does not increase the number of passes made over the input stream, but may cause additional file closing and opening.
On most modern UNIX systems, and in real applications, this option should rarely be needed.
Benchmarks show no noticeable effect on runtime when small values of nnnn are chosen, but because bibsplit's implementation language offers no way to test for an open-file limit-exceeded condition, and because that limit varies between operating systems and installations, and on some, even depends on other current user processes and resource quotas, no sensible default value for nnnn can be chosen that is guaranteed to work everywhere.
bibsplit uses stdout only for a brief progress report, so there is never much data written to it.
The suffixes attached to the output filenames contain no leading separator character, so, for example, the command bibsplit -byscore gnats.bib for a bibliography containing entries from 1941 to 2001 would produce output files gnats1940.bib, gnats1960.bib, gnats1980.bib, and gnats2000.bib.
If you prefer a separator, do it like this: bibsplit -byscore -prefix gnats- gnats.bib to get output files named gnats-1940.bib, etc.
The -prefix xxx option may include a directory path, so bibsplit -byscore -prefix /usr/tmp/gnats- gnats.bib would write the split files in the directory /usr/tmp.
In the interests of maximal filename portability, bibsplit assumes that slash, backslash, and colon are directory component separators, and legal characters in filenames are letters, digits, hyphen, underscore, and dot; all others will be removed.
Normally, the contents of those files, if they exist, are implicitly inserted at the beginning of the command line, with comments removed and newlines replaced by spaces. Thus, those files can contain any bibsplit options defined in this documentation, either one option, or option/value pair, per line, or with multiple options per line. Empty lines, and lines that begin with optional whitespace followed by a sharp (#) are comment lines that are discarded.
If the initialization file contains backslashes, they must be doubled because the text is interpreted by the shell before bibsplit sees it.
This option may also be spelled -tempdir.
In the event that bibliography entries are encountered that cannot be assigned to a suitable output file according to the particular -byxxx option chosen (or assumed by default), they will be written to a file whose basename is suffixed by the uppercase string UNKNOWN.
Similarly, @String definitions that are not used in any input bibliography entry will not be written to the normal split files, so they are collected, sorted, and written to a separate file whose basename is suffixed by the uppercase string UNUSED. The basename of that file is determined by that of the last input BibTeX file.
You could use this feature to find and remove unused @String definitions, like this:
bibsplit -bynumber 0 mybib.bib mv mybib.bib mybib.bib-old mv mybib-000001.bib mybib.bib
If a duplicate @String definition is encountered, then a warning is issued if the definitions differ, except possibly at whitespace. Multiple differing definitions are collected and later output together in the same order they were read, so as not to lose information.
bibclean(1) and bibsplit assume that an input file takes the form
Blank or empty lines may appear anywhere, and are thus not shown in this sketch.% FILE HEADER COMMENTS @Preamble{...} % preamble comments @Preamble{...} % preamble comments @Preamble{...} ... % STRING BLOCK COMMENTS @String{...} % string comments @String{...} % string comments @String{...} ... % ENTRY BLOCK COMMENTS @Book{...} % entry comment @Article{...} % entry comment @TechReport{...} ... % FILE TRAILER COMMENTS
This organization has been found to be the most useful in many hundreds of BibTeX files containing hundreds of thousands of document entries, and is very similar to that commonly found in well-written computer software for several decades. In particular, comments always precede the code or data that they refer to; they never follow.
Any of the comment regions may be empty, and after the @String definitions, BibTeX entries for any supported document type may appear, and in any order.
The comment blocks in UPPERCASE take precedence over other comment blocks, and will be transferred verbatim to every output file containing BibTeX entries, preserving the order shown above.
All other comments are assumed to refer to the nearest following @Name{...} group, and will be attached to those groups, and output when they are output, preserving that order.
All input lines that are blank or empty are discarded. However, for readability and editing convenience, bibsplit takes care to incorporate blank lines around all bibliographic entries, just as bibclean(1) does.
Any other text which does not conform to the BibTeX grammar is converted to a comment, and thus preserved in at least one of the output files.
However, in some types of bibliographies, entries use the BibTeX crossref = "label" facility to include additional data from another entry; the commonest such case is an @InProceedings entry that cross references a following @Proceedings entry.
Sometimes, an entry may contain a note with a citation of another entry, such as an article series where part I cites part II, or an article citing a subsequent erratum. For closely-related articles, it is useful to include such citations in the BibTeX files, so that an author who remembers to cite only one of a series of related publications will automatically get bibliography entries for all of them.
In both these cases, the cross-referenced or cited entry follows the one that references it, so bibsplit, after reading the original entry, is able to examine it, and prepare a list of entries that it refers to. When bibsplit later encounters those entries, it outputs them not only to their normal split file, but also to all of the other files that contain earlier entries that refer to them.
Backward references are, however, more challenging. For example, an article erratum might contain a citation of the original paper, which appears earlier in a bibliography ordered by publication time. In this case, bibsplit will have already output the original entry without knowing that it will later be cited, and because it makes no attempt to hold all entries in memory (a strategy that would routinely fail on small systems), the cross reference has arrived too late for it to act. BibTeX itself would require a following LaTeX or TeX run to deposit the citation into the auxiliary file, in which a second BibTeX run would find it, and finally correctly incorporate the cross reference in the typeset bibliography.
To deal with this important case of backward references, while still being frugal with memory, bibsplit takes a different approach. As each entry is output to a split file, bibsplit augments a list with entries (citation-label, BibTeX-filename), so that it knows in which file each entry has been written. It also records the citation labels of any embedded references in a to-be-found list, with entries of the form (citation-label, BibTeX-filename-1, BibTeX-filename-2, ..., BibTeX-filename-n). It then looks in the to-be-found list to see if this entry is needed by earlier entries already written to other files as well, and if so, outputs it to those.
On completion of processing of all of the input stream, and generation of any unused-labels file, it then re-examines the to-be-found list, sorts it by filename, and then steps through these files in order, reading entries, and writing each one found out to all of the BibTeX file(s) in which it has been referenced, but does not yet appear. Any citation label from the to-be-found list which is not in the original (citation-label, BibTeX-filename) list is diagnosed as an unsatisfied reference, since its absence is definitely an error in the bibliography.
In the worst case, this algorithm will result in bibsplit's reading the input data a total of three times, but in most cases, only a few of the split files, and sometimes, none, need to be read again.
Because all @String definitions are saved in memory, and all citation labels as well, very large jobs may exceed the memory requirements of very small systems. About two megabytes of memory should suffice for the vast majority of practical applications.
/usr/local/share/lib/bibsplit/bibsplit-1.00
Nelson H. F. Beebe Center for Scientific Computing University of Utah Department of Mathematics, 322 INSCC 155 S 1400 E RM 233 Salt Lake City, UT 84112-0090 USA Email: [email protected], [email protected], [email protected] (Internet) WWW URL: http://www.math.utah.edu/~beebe Telephone: +1 801 581 5254 FAX: +1 801 585 1640, +1 801 581 4148
ftp://ftp.math.utah.edu/pub/tex/bib/ http://www.math.utah.edu/pub/tex/bib/
in the file bibsplit-x.yy.tar.gz where x.yy is the current version. Other distribution formats are usually available at the same location.
That site is mirrored to several other Internet archives, so you may also be able to find it elsewhere on the Internet; try searching for the string bibsplit at one or more of the popular Web search sites, such as
http://altavista.digital.com/ http://search.microsoft.com/us/default.asp http://www.dejanews.com/ http://www.dogpile.com/index.html http://www.euroseek.net/page?ifl=uk http://www.excite.com/ http://www.go2net.com/search.html http://www.google.com/ http://www.hotbot.com/ http://www.infoseek.com/ http://www.inktomi.com/ http://www.lycos.com/ http://www.northernlight.com/ http://www.snap.com/ http://www.stpt.com/ http://www.yahoo.com/
######################################################################## ######################################################################## ######################################################################## ### ### ### bibsplit: split BibTeX bibliography files into independent parts ### ### ### ### Copyright (C) 1999 Nelson H. F. Beebe ### ### ### ### This program is covered by the GNU General Public License (GPL), ### ### version 2 or later, available as the file COPYING in the program ### ### source distribution, and on the Internet at ### ### ### ### ftp://ftp.gnu.org/gnu/GPL ### ### ### ### http://www.gnu.org/copyleft/gpl.html ### ### ### ### This program is free software; you can redistribute it and/or ### ### modify it under the terms of the GNU General Public License as ### ### published by the Free Software Foundation; either version 2 of ### ### the License, or (at your option) any later version. ### ### ### ### This program is distributed in the hope that it will be useful, ### ### but WITHOUT ANY WARRANTY; without even the implied warranty of ### ### MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ### ### GNU General Public License for more details. ### ### ### ### You should have received a copy of the GNU General Public ### ### License along with this program; if not, write to the Free ### ### Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, ### ### MA 02111-1307 USA ### ######################################################################## ######################################################################## ########################################################################