Examples

First of all, load the library:

import bp_text

Database

This examples shows how to load a BibTeX file and access the keywords of an entry by a given citation key:

# load the database
db = bp_text.database.BibTexDatabase("sources.bib")

# get keywords if the field is not empty
entry = db.entries.get("chion2018")
if entry.get("keywords"):
    print(entry.get("keywords").value)

# => ['Aesthetics', 'Motion pictures', 'Sound effects', 'Sound motion
#     pictures']

Pool

The bp_text.pool.Pool class is the heart of bp_text. This class is a collection of annotated/tokenized, text-holding objects (e.g. PdfFiles, TxtFiles) and can be generated from a BibTexDatabase object. Its main purpose is to facilitate interacting with a corpus of texts and the metadata provided by the BibTex entries.

The most straightforward way to create a pool is to first load a BibTeX database and then derive a Pool from the BibTexDatabase object.

Note: It is crucial to include paths to the source files (either PDF or TXT) in the BibTeX file (cf. notes). The paths can either be absolute or relative (to the BibTeX file).

Here is an example for creating a database and a derived Pool:

import bp_text
db = bp_text.database.BibTexDatabase("/users/bp/sources.bib")
pool = db.make_pool(cache="/tmp/pool_cache")

# this returns the data according to the given default_get_data_func
pool.get("chion2018").get_data()
# => <bp_text.pdf.PdfFile object at 0x453d80ef0>

# you can also use a different method to get the data
pool.get("chion2018").get_data(bp_text.pool.random_data)
# => <bp_text.pdf.PdfFile object at 0x453d80ef0>

# this entry ("chion2018") is a PDF, so it contains multiple pages
pool.get("chion2018").get_data().get_page(20).text()
# => 'XX FOREWORD\n(the disembodied voice seems to come from (...)

Using a cache (via cache, which is a directory where to store cache files) improves the performance of bp_text.

org-mode

The easiest way to compile a text with material generated by bp_text is to use embedded python code in Emacs’s org-mode [1]. The following example shows a simple way to work with bp_text in this way.

#+title: Testdocument
#+author: Ruben Philipp
#+date: 2025-05-13
#+LANGUAGE: de
#+startup: overview
#+LATEX_COMPILER: xelatex
#+LATEX_CLASS: article
#+LATEX_CLASS_OPTIONS: [a4paper,10pt,oneside]
#+LATEX_HEADER: \usepackage{multicol}
#+LATEX_HEADER: \usepackage{geometry}
#+LATEX_HEADER: \geometry{left=15mm,right=15mm,top=15mm,bottom=15mm}
#+LATEX_HEADER: \usepackage[hang,flushmargin]{footmisc}
#+LATEX_HEADER: \usepackage{pdflscape}
#+LATEX_HEADER: \usepackage{lmodern}
# #+LATEX_HEADER: \usepackage{ebgaramond}
#+LATEX_HEADER: \usepackage{fontspec}
#+LATEX_HEADER: \setmainfont{Adobe Garamond Pro}
#+LATEX_HEADER: \usepackage{fancyhdr}
#+LATEX_HEADER: \usepackage{epigraph}
#+LATEX_HEADER: \usepackage[utf8]{inputenc}
#+LATEX_HEADER: \usepackage[ngerman]{babel}
# % caption styling
#+LATEX_HEADER: \usepackage[font=small,labelfont=bf]{caption}

#+cite_export: csl chicago-note-bibliography.csl

#+options: toc:nil num:nil ':t
#+bibliography: /path/to/sources.bib
#+LaTeX_HEADER: \hypersetup{hidelinks}
#+LATEX_HEADER: \usepackage{setspace}
#+EXPORT_FILE_NAME: /tmp/test
# % code listings with mono font
#+LATEX_HEADER: \lstset{basicstyle=\footnotesize\ttfamily,breaklines=true}
#+LATEX_HEADER: \lstset{keepspaces=true}
# % hyphenation in mono/texttt blocks
#+LATEX_HEADER: \usepackage[htt]{hyphenat}


#+begin_src python :session bp_text :results none :exports none

## These are some global definitions and declarations

import bp_text
import os
import random

DATA_ROOT = os.path.abspath("data")
SOURCES_BIB = DATA_ROOT + "/sources.bib"

# Just instantiate DB and POOL once as this might take a while
try:
    DB and POOL
except NameError:
    # not defined, instantiate
    DB = bp_text.database.BibTexDatabase(SOURCES_BIB)
    POOL = DB.make_pool(cache=DATA_ROOT + "/_pool-cache")


# Set this if citations for items retrieved from a pool should be included in
# the generated text.
CITE = False

#+end_src


* Text From a "Model"

#+begin_comment
This example uses an input text as a model and "replaces" the words in the
text with words with the same POS from randomly chosen pages from a text in the
pool.
#+end_comment

#+begin_src python :session bp_text :results value raw :exports results

# This is the pool item and its associated doc
thePitm = POOL.get("arendt2006")
theDoc = thePitm.get_data()

# number of pages in doc
num_pages = len(theDoc.data)

# this is the input text
# source:
# https://www.soziopolis.de/zur-diagnostischen-gefuehlskultur-der-gegenwart.html
input_sentence = """
Trauma, toxisch, triggern – Wörter wie diese sind
inzwischen fester Bestandteil des Begriffsinventars, mit dem viele Menschen,
online wie offline, über Lebenskrisen und zwischenmenschliche Konflikte
sprechen. In Digitale Diagnosen. Psychische Gesundheit als Social-Media-Trend
analysiert die Soziologin Laura Wiesböck ebenjene gegenwärtige Melange aus
therapeutischen Diskursfragmenten auf Social Media. Die Autorin, Jahrgang 1987,
hat an der Universität Wien promoviert und ist auf die Themen soziale
Ungleichheit, Gendergerechtigkeit und digitale Arbeitswelt spezialisiert. Seit
2018 ihr erstes Buch In besserer Gesellschaft. Der selbstgerechte Blick auf die
Anderen erschienen ist, zählt sie zu den medial präsentesten Gesichtern einer –
jungen und weiblichen – soziologischen Wissenschaftskommunikation. Das
vorliegende Sachbuch adressiert eine breite Öffentlichkeit jenseits der
fachwissenschaftlichen Community. Das verhandelte Phänomen ist eine so markante
Signatur unserer Gegenwart, dass Wiesböcks Buch eine gewinnbringende Lektüre für
unterschiedliche Leserschaften – vor allem mit sozialwissenschaftlichem und
psychologischem Fachhintergrund – ist.
"""

# initialize a NLP model
nlp = bp_text.text.get_nlp("de_core_news_sm")

# analyze the input text
input_doc = nlp(input_sentence)

# reset random seed
random.seed(123)

result = []
for word in input_doc:
    pg_i = random.randrange(num_pages)
    pg = theDoc.get_page(pg_i)
    for token in pg.text.doc:
        if token.pos_ == word.pos_:
            result.append(bp_text.textfragment.TextFragment(thePitm.key,
                                                            pg.page_label,
                                                            thePitm.meta,
                                                            token))
            break

# output the results as org text
bp_text.textfragment.textfragments_to_org(result,
                                          cite = CITE,
                                          force_cite = False)


#+end_src

* Literature

#+print_bibliography:

Footnotes