bp_text.database
This module implements database functionality. Its main purpose is to read from a BibTeX file (as database).
Created: 2025-03-23 Author: Ruben Philipp <me@rubenphilipp.com>
$$ Last modified: 02:29:46 Sat Apr 26 2025 CEST
Functions
|
Convert LaTeX umlauts to ASCII umlauts. |
Classes
|
Implementation of a BibTeX database. |
|
An abstract superclass for a database. |
- class bp_text.database.BibTexDatabase(file_path: str, split_keywords=True, split_files=True)[source]
Bases:
DatabaseImplementation of a BibTeX database. This class provides the user with various methods to interact with the stored data.
NB: This class is limited to read-only use. In order to modify the contents of the actual database it is recommended to use a specialized software (e.g. BibDesk).
The data attribute contains the parsed data.
Example:
database = database.BibTexDatabase("db.bib") database.entries["adorno1960"].get("file").value
- Parameters:
file_path (string) – The path to the .bib file.
split_keywords (boolean) – When true, split all keywords in the keywords field into a list, assuming they are separated by a comma (“,”). Default = True
split_files (boolean) – When true, split all files in the file field into a list, assuming they are separated by a semicolon (“;”). Default = True
- property data
The parsed data.
- property entries
The database entries as a dict.
Example:
# get the value of the file field of the entry with the citation # key "heinlein2020" db.entries["heinlein2020"].fields_dict["file"].value # => ['heinlein2020 - katastrophen.pdf'] # ...which can also be expressed in a shorter form db.entries["heinlein2020"].get("file").value
- find_entries(field: str, search: str)[source]
Find entries matching the search string in the given field.
- Parameters:
field (string) – The field name (e.g. “keywords”).
search (string) – The search string.
- Returns:
A list with items of <class ‘bibtexparser.model.Entry’>
- Return type:
list
- get_entry_by_key(key)[source]
Get a specific entry by citation-key in the db.
This is an alias to self.entries.get(key).
- Parameters:
key (string) – The citation key (e.g. “@adorno1960”) to look for.
- get_nth_entry(n)[source]
Get the nth entry in the database.
- Parameters:
n (integer) – Index (zero-based) of the entry in the database.
- load(file_path: str, split_keywords=True, split_files=True)[source]
Load and parse a BibTeX file.
- Parameters:
file_path (string) – The path to the BibTeX file.
split_keywords (boolean) – When true, split all keywords in the keywords field into a list, assuming they are separated by a comma (“,”). Default = True
split_files (boolean) – When true, split all files in the file field into a list, assuming they are separated by a semicolon (“;”). Default = True
- make_pool(cache=False, default_get_data_func=None, pdf_auto_extract=True, pdf_use_ocr=False, pdf_fallback_to_ocr=True, pdf_ocr_dpi=300, pdf_ocr_default_lang='eng', verbose=True)[source]
Create a
PoolwithPoolItemobjects derived from the BibTexDatabase entries and the documents linked in the file fields.While creating the Pool, this methods also instantiates text holding objects for the data linked in file. This could be
PdfFile(for PDFs) orTxtFile(for TXTs) objects. When a cache directory is given, this method will try to search for pickled objects in the respective directory and tries to load them in order to avoid recomputing expensive NLP analyses (cf.Text). The search pattern for cache files is [citekey]-[file_checksum]-[bp_text.__version__].pickle. Besides searching for existing cached files, new cache files will automatically be created items not found in the cache directory.The paths in file are either relative or absolute. When relative, they are converted to absolute paths relative to the location of the database file (cf. self._base_path).
- Parameters:
cache (False or string) – When False caching is disabled. If a directory string is given, use this for caching (see above).
default_get_data_func (A function which must be a function taking the PoolItem as its argument and must return an index to the element of data which should be retrieved. Set to None to use the default.) – This sets the default function to get data from a
PoolItemobject (cf. respective doc in this class).pdf_auto_extract (boolean) – Automatically extract the text from all pages in the file when instantiating the object? This also automatically creates
PdfPageobjects for each page. Default = Truepdf_use_ocr – Use OCR by default for text extraction? Default = False
pdf_fallback_to_ocr (boolean) – If text extraction without OCR yields little text, fallback to OCR? Default = True
pdf_ocr_dpi (integer) – The DPI amount for OCR. Default = 300
pdf_ocr_default_lang (string) – The default language for OCR. Default = “eng”
verbose (boolean) – Print additional information during performance when True. Default = True
- split_fields_by(field: str, separator=';')[source]
Splits the data/value of all entries in the database (destructively) of the given field (e.g. “keywords”) by a given separator.
- Parameters:
field (string) – The field name (e.g. “keywords”).
separator (string) – The separating character. Default = “;”