Examples

First of all, load the library:

import bp_text

Database

This examples shows how to load a BibTeX file and access the keywords of an entry by a given citation key:

# load the database
db = bp_text.database.BibTexDatabase("sources.bib")

# get keywords if the field is not empty
entry = db.entries.get("chion2018")
if entry.get("keywords"):
    print(entry.get("keywords").value)

# => ['Aesthetics', 'Motion pictures', 'Sound effects', 'Sound motion
#     pictures']

Pool

The bp_text.pool.Pool class is the heart of bp_text. This class is a collection of annotated/tokenized, text-holding objects (e.g. PdfFiles, TxtFiles) and can be generated from a BibTexDatabase object. Its main purpose is to facilitate interacting with a corpus of texts and the metadata provided by the BibTex entries.

The most straightforward way to create a pool is to first load a BibTeX database and then derive a Pool from the BibTexDatabase object.

Note: It is crucial to include paths to the source files (either PDF or TXT) in the BibTeX file (cf. notes). The paths can either be absolute or relative (to the BibTeX file).

Here is an example for creating a database and a derived Pool:

import bp_text
db = bp_text.database.BibTexDatabase("/users/bp/sources.bib")
pool = db.make_pool(cache="/tmp/pool_cache")

# this returns the data according to the given default_get_data_func
pool.get("chion2018").get_data()
# => <bp_text.pdf.PdfFile object at 0x453d80ef0>

# you can also use a different method to get the data
pool.get("chion2018").get_data(bp_text.pool.random_data)
# => <bp_text.pdf.PdfFile object at 0x453d80ef0>

# this entry ("chion2018") is a PDF, so it contains multiple pages
pool.get("chion2018").get_data().get_page(20).text()
# => 'XX FOREWORD\n(the disembodied voice seems to come from (...)

Using a cache (via cache, which is a directory where to store cache files) improves the performance of bp_text.

Trivial Noun Search

The following lines show how to search a pool for a search_word which should be used as a noun in the context of the respective text.

The bp_text.textfragment.TextFragment can be, in this case, conceived as a “container” for the search results. It holds the token of the search result as well as the citekey, the page_label and some meta-data (taken from the resp. item’s meta-data which is most likely retrieved from the BibTeX file). In general, TextFragment objects help to unify the results of data extraction e.g. from a pool.

# these need to be imported in order to make the typecasting work (via
# isinstance())...
from bp_text.pdf import PdfFile
from bp_text.txt import TxtFile

# this word is the word to find in the pool...
search_word = "sprache"
# as we, in this example (see below), use normalized words, let's apply
# lowercase...
search_word = search_word.lower()

# this is an empty list for the results (which will be a dict)...
results = {}

# now, loop through all available pool items...
for key, pitm in pool.data.items():
    # get a data object (either a TxtFile or a PdfFile)...
    data = pitm.get_data()

    # these will be the matches...
    matches = []

    # just proceed if the PoolItem contains either a PdfFile or a TxtFile
    # object)...
    if isinstance(data, PdfFile) or isinstance(data, TxtFile):
        # loop throuth all pages while preserving the pagenum (which is the
        # page
        # index here)...
        for pagenum, page in enumerate(data.data):
            # this is the spacy.doc
            doc = page.text.doc
            # this is the page_label (cf. :py:module:`bp_text.pdf`)
            page_label = page.page_label
            # if the page does not contain any text, the doc might be empty.
            # this handles this case...
            if doc == None:
                continue
            # now, search the spacy.doc for nouns matching the pattern
            for token in doc:
                if (token.text.lower() == search_word
                    and token.pos_ == "NOUN"):
                    # make a TextFragment object for this search result
                    frag = bp_text.textfragment.TextFragment(key,
                                                             page_label,
                                                             pitm.meta,
                                                             token)
                    matches.append(frag)

    else:
        continue

    # add the matches to the results...
    if matches:
        results[key] = matches

# this loop goes through the items in the results variable and prints the
# token and the page_label (if applicable)...
for key, val in results.items():
    print("-------")
    for itm in val:
        print(f"data: '{itm.data}'")
        print(f"page_label: '{itm.page_label}'")

# get tokens around the tokens
selected_token = results["nietzsche2"][7].data
print(f"This is the token: '{selected_token}'")
print("This is the next token: " +
      f"'{selected_token.doc[selected_token.i + 1]}'")
print("This is the prev token: '"
      + f"{selected_token.doc[selected_token.i - 1]}'")
print("This is the sentence:")
print(f"'{selected_token.sent.text}'")

org-mode

The easiest way to compile a text with material generated by bp_text is to use embedded python code in Emacs’s org-mode [1]. The following example shows a simple way to work with bp_text in this way.

#+title: Testdocument
#+author: Ruben Philipp
#+date: 2025-05-13
#+LANGUAGE: de
#+startup: overview
#+LATEX_COMPILER: xelatex
#+LATEX_CLASS: article
#+LATEX_CLASS_OPTIONS: [a4paper,10pt,oneside]
#+LATEX_HEADER: \usepackage{multicol}
#+LATEX_HEADER: \usepackage{geometry}
#+LATEX_HEADER: \geometry{left=15mm,right=15mm,top=15mm,bottom=15mm}
#+LATEX_HEADER: \usepackage[hang,flushmargin]{footmisc}
#+LATEX_HEADER: \usepackage{pdflscape}
#+LATEX_HEADER: \usepackage{lmodern}
# #+LATEX_HEADER: \usepackage{ebgaramond}
#+LATEX_HEADER: \usepackage{fontspec}
#+LATEX_HEADER: \setmainfont{Adobe Garamond Pro}
#+LATEX_HEADER: \usepackage{fancyhdr}
#+LATEX_HEADER: \usepackage{epigraph}
#+LATEX_HEADER: \usepackage[utf8]{inputenc}
#+LATEX_HEADER: \usepackage[ngerman]{babel}
# % caption styling
#+LATEX_HEADER: \usepackage[font=small,labelfont=bf]{caption}

#+cite_export: csl chicago-note-bibliography.csl

#+options: toc:nil num:nil ':t
#+bibliography: /path/to/sources.bib
#+LaTeX_HEADER: \hypersetup{hidelinks}
#+LATEX_HEADER: \usepackage{setspace}
#+EXPORT_FILE_NAME: /tmp/test
# % code listings with mono font
#+LATEX_HEADER: \lstset{basicstyle=\footnotesize\ttfamily,breaklines=true}
#+LATEX_HEADER: \lstset{keepspaces=true}
# % hyphenation in mono/texttt blocks
#+LATEX_HEADER: \usepackage[htt]{hyphenat}


#+begin_src python :session bp_text :results none :exports none

## These are some global definitions and declarations

import bp_text
import os
import random

DATA_ROOT = os.path.abspath("data")
SOURCES_BIB = DATA_ROOT + "/sources.bib"

# Just instantiate DB and POOL once as this might take a while
try:
    DB and POOL
except NameError:
    # not defined, instantiate
    DB = bp_text.database.BibTexDatabase(SOURCES_BIB)
    POOL = DB.make_pool(cache=DATA_ROOT + "/_pool-cache")


# Set this if citations for items retrieved from a pool should be included in
# the generated text.
CITE = False

#+end_src


* Text From a "Model"

#+begin_comment
This example uses an input text as a model and "replaces" the words in the
text with words with the same POS from randomly chosen pages from a text in the
pool.
#+end_comment

#+begin_src python :session bp_text :results value raw :exports results

# This is the pool item and its associated doc
thePitm = POOL.get("arendt2006")
theDoc = thePitm.get_data()

# number of pages in doc
num_pages = len(theDoc.data)

# this is the input text
# source:
# https://www.soziopolis.de/zur-diagnostischen-gefuehlskultur-der-gegenwart.html
input_sentence = """
Trauma, toxisch, triggern – Wörter wie diese sind
inzwischen fester Bestandteil des Begriffsinventars, mit dem viele Menschen,
online wie offline, über Lebenskrisen und zwischenmenschliche Konflikte
sprechen. In Digitale Diagnosen. Psychische Gesundheit als Social-Media-Trend
analysiert die Soziologin Laura Wiesböck ebenjene gegenwärtige Melange aus
therapeutischen Diskursfragmenten auf Social Media. Die Autorin, Jahrgang 1987,
hat an der Universität Wien promoviert und ist auf die Themen soziale
Ungleichheit, Gendergerechtigkeit und digitale Arbeitswelt spezialisiert. Seit
2018 ihr erstes Buch In besserer Gesellschaft. Der selbstgerechte Blick auf die
Anderen erschienen ist, zählt sie zu den medial präsentesten Gesichtern einer –
jungen und weiblichen – soziologischen Wissenschaftskommunikation. Das
vorliegende Sachbuch adressiert eine breite Öffentlichkeit jenseits der
fachwissenschaftlichen Community. Das verhandelte Phänomen ist eine so markante
Signatur unserer Gegenwart, dass Wiesböcks Buch eine gewinnbringende Lektüre für
unterschiedliche Leserschaften – vor allem mit sozialwissenschaftlichem und
psychologischem Fachhintergrund – ist.
"""

# initialize a NLP model
nlp = bp_text.text.get_nlp("de_core_news_sm")

# analyze the input text
input_doc = nlp(input_sentence)

# reset random seed
random.seed(123)

result = []
for word in input_doc:
    pg_i = random.randrange(num_pages)
    pg = theDoc.get_page(pg_i)
    for token in pg.text.doc:
        if token.pos_ == word.pos_:
            result.append(bp_text.textfragment.TextFragment(thePitm.key,
                                                            pg.page_label,
                                                            thePitm.meta,
                                                            token))
            break

# output the results as org text
bp_text.textfragment.textfragments_to_org(result,
                                          cite = CITE,
                                          force_cite = False)


#+end_src

* Literature

#+print_bibliography:

Footnotes