bp_text.database

This module implements database functionality. Its main purpose is to read from a BibTeX file (as database).

Created: 2025-03-23 Author: Ruben Philipp <me@rubenphilipp.com>

$$ Last modified: 02:29:46 Sat Apr 26 2025 CEST

Functions

convert_latex_umlauts(text)

Convert LaTeX umlauts to ASCII umlauts.

Classes

`BibTexDatabase`(file_path[, split_keywords, ...])	Implementation of a BibTeX database.
`Database`()	An abstract superclass for a database.

class bp_text.database.BibTexDatabase(file_path: str, split_keywords=True, split_files=True)[source]

Bases: Database

Implementation of a BibTeX database. This class provides the user with various methods to interact with the stored data.

NB: This class is limited to read-only use. In order to modify the contents of the actual database it is recommended to use a specialized software (e.g. BibDesk).

The data attribute contains the parsed data.

Example:

database = database.BibTexDatabase("db.bib")
database.entries["adorno1960"].get("file").value

Parameters:

file_path (string) – The path to the .bib file.
split_keywords (boolean) – When true, split all keywords in the keywords field into a list, assuming they are separated by a comma (“,”). Default = True
split_files (boolean) – When true, split all files in the file field into a list, assuming they are separated by a semicolon (“;”). Default = True

__init__(file_path: str, split_keywords=True, split_files=True)[source]

property data: The parsed data.

property entries

The database entries as a dict.

Example:

# get the value of the file field of the entry with the citation
# key "heinlein2020"
db.entries["heinlein2020"].fields_dict["file"].value
# => ['heinlein2020 - katastrophen.pdf']

# ...which can also be expressed in a shorter form
db.entries["heinlein2020"].get("file").value

find_entries(field: str, search: str)[source]

Find entries matching the search string in the given field.

Parameters:

field (string) – The field name (e.g. “keywords”).
search (string) – The search string.

Returns:

A list with items of <class ‘bibtexparser.model.Entry’>

Return type:

list

get_entry_by_key(key)[source]

Get a specific entry by citation-key in the db.

This is an alias to self.entries.get(key).

Parameters:: key (string) – The citation key (e.g. “@adorno1960”) to look for.

get_nth_entry(n)[source]

Get the nth entry in the database.

Parameters:: n (integer) – Index (zero-based) of the entry in the database.

load(file_path: str, split_keywords=True, split_files=True)[source]

Load and parse a BibTeX file.

Parameters:

file_path (string) – The path to the BibTeX file.
split_keywords (boolean) – When true, split all keywords in the keywords field into a list, assuming they are separated by a comma (“,”). Default = True
split_files (boolean) – When true, split all files in the file field into a list, assuming they are separated by a semicolon (“;”). Default = True

make_pool(cache=False, default_get_data_func=None, pdf_auto_extract=True, pdf_use_ocr=False, pdf_fallback_to_ocr=True, pdf_ocr_dpi=300, pdf_ocr_default_lang='eng', verbose=True)[source]

Create a Pool with PoolItem objects derived from the BibTexDatabase entries and the documents linked in the file fields.

While creating the Pool, this methods also instantiates text holding objects for the data linked in file. This could be PdfFile (for PDFs) or TxtFile (for TXTs) objects. When a cache directory is given, this method will try to search for pickled objects in the respective directory and tries to load them in order to avoid recomputing expensive NLP analyses (cf. Text). The search pattern for cache files is [citekey]-[file_checksum]-[bp_text.__version__].pickle. Besides searching for existing cached files, new cache files will automatically be created items not found in the cache directory.

The paths in file are either relative or absolute. When relative, they are converted to absolute paths relative to the location of the database file (cf. self._base_path).

Parameters:

cache (False or string) – When False caching is disabled. If a directory string is given, use this for caching (see above).
default_get_data_func (A function which must be a function taking the PoolItem as its argument and must return an index to the element of data which should be retrieved. Set to None to use the default.) – This sets the default function to get data from a PoolItem object (cf. respective doc in this class).
pdf_auto_extract (boolean) – Automatically extract the text from all pages in the file when instantiating the object? This also automatically creates PdfPage objects for each page. Default = True
pdf_use_ocr – Use OCR by default for text extraction? Default = False
pdf_fallback_to_ocr (boolean) – If text extraction without OCR yields little text, fallback to OCR? Default = True
pdf_ocr_dpi (integer) – The DPI amount for OCR. Default = 300
pdf_ocr_default_lang (string) – The default language for OCR. Default = “eng”
verbose (boolean) – Print additional information during performance when True. Default = True

split_fields_by(field: str, separator=';')[source]

Splits the data/value of all entries in the database (destructively) of the given field (e.g. “keywords”) by a given separator.

Parameters:

field (string) – The field name (e.g. “keywords”).
separator (string) – The separating character. Default = “;”

class bp_text.database.Database[source]

Bases: ABC

An abstract superclass for a database.

abstractmethod load(file_path: str)[source]

Load a database from a file.

Parameters:: file_path (string) – The path to the database file.

bp_text.database.convert_latex_umlauts(text)[source]

Convert LaTeX umlauts to ASCII umlauts.

Parameters:: text (string) – Text to convert.