Notes
Some notes and remarks on several aspects of bp_text.
Working with BibTeX files
For managing BibTeX databases (.bib files), using BibDesk <http://bibdesk.sourceforge.io> (on a Mac) is an easy way to organize the text library. However, there are a few things to consider, especially when it comes to non-standard BibTeX fields.
The crucial part with BibDesk as well as with other citation/library managers is in the case of bp_text the way how files (resp. paths to them) are stored. BibDesk by default uses a rather complex way to store files attached via the regular BibDesk attachments in the BibTeX library. It does so by including serialized symlinks to bdsk-file-n fields. bp_text does not handle these, but expects files to be stored as (absolute or relative) paths to e.g. PDFs or TXTs in a file field. Although singular, there could be more files included when separating the filenames with a semicolon. Yet whether this makes sense depends on the way the relation between BibTeX entries and the respective files is conceived.
Keywords can also be included. They should be placed in a keywords field which should be provided by BibDesk by default and should also be the standard field when exporting a library from Zotero e.g. via BetterBibTeX.
Citation keys should, of course, be unique.
The primary language of a document is expected to be stored in the langid field. This is a BibLaTeX field and expects languages to be identified by an idiosyncratic id which is not the same as standard ISO-639-1 language codes. The langid for German, for example, is ngerman (i.e. new German spelling). See Table 2: Supported Languages in the BibLaTeX documentation (https://ctan.org/pkg/biblatex). Although it is allowed to omit specifying a primary language, it is highly recommended as e.g. the PdfFile class will use the language when performing OCR.
Some notes on BibDesk
Some BibTeX fields used bp_text are not available in BibDesk by default. Thus, it is recommended to add these to the Default Fields via BibDesk Preferences -> Fields -> Custom BibTeX Fields. These are some fields which should be added. Make sure to tick the “Is Default” checkbox.
Field |
Type |
Is Default |
|---|---|---|
Keywords |
Textual |
Y |
File |
Textual |
Y |
Langid |
Textual |
Y |
Zotero and BibDesk
It is quite simple to copy Zotero (https://www.zotero.org) bibliography data to BibDesk. In order to export Zotero data to Bib(La)TeX, first install the Better BibTeX (https://retorque.re/zotero-better-bibtex/) extension in Zotero. Then, for example, select a few entries in your Zotero database and select “copy BibLaTeX to clipboard” in the “Better BibTeX” context menu (via right-click). Then, you can simply paste your clipboard to your BibDesk library.
Better BibTeX automatically includes the paths to the PDFs attached to a Zotero item in the file field. Multiple files are separated by a semicolon. As this is also the standard way bp_text handles files, nothing more needs to be done. If you want to share your BibDesk database you might want to copy the attachments from the Zotero storage location to a directory relative to the path of the BibDesk (.bib) file. Then you need to adjust the file field in the respective entries accordingly. A relative path is completely sufficient here as bp_text searches for files relative to the database file when no absolute path is given (e.g. when calling BibtexDatabase.make_pool()).
Languages
Tokenization and tagging in this library is based on the spaCy
<https://spacy.io> library. bp_text tries to automatically determine the
language of a text and apply the proper model to it. For more information see
the documentation for Text.