.. -*- eval: (flyspell-mode); eval: (ispell-change-dictionary "en") -*- ========== Examples ========== First of all, load the library: .. code-block:: python import bp_text Database -------- This examples shows how to load a BibTeX file and access the keywords of an entry by a given citation key: .. code-block:: python # load the database db = bp_text.database.BibTexDatabase("sources.bib") # get keywords if the field is not empty entry = db.entries.get("chion2018") if entry.get("keywords"): print(entry.get("keywords").value) # => ['Aesthetics', 'Motion pictures', 'Sound effects', 'Sound motion # pictures'] Pool ---- The :py:class:`bp_text.pool.Pool` class is the heart of `bp_text`. This class is a collection of annotated/tokenized, text-holding objects (e.g. PdfFiles, TxtFiles) and can be generated from a BibTexDatabase object. Its main purpose is to facilitate interacting with a corpus of texts and the metadata provided by the BibTex entries. The most straightforward way to create a pool is to first load a BibTeX database and then derive a `Pool` from the `BibTexDatabase` object. **Note:** It is crucial to include paths to the source files (either PDF or TXT) in the BibTeX file (cf. `notes`). The paths can either be absolute or relative (to the BibTeX file). Here is an example for creating a database and a derived `Pool`: .. code-block:: python import bp_text db = bp_text.database.BibTexDatabase("/users/bp/sources.bib") pool = db.make_pool(cache="/tmp/pool_cache") # this returns the data according to the given default_get_data_func pool.get("chion2018").get_data() # => # you can also use a different method to get the data pool.get("chion2018").get_data(bp_text.pool.random_data) # => # this entry ("chion2018") is a PDF, so it contains multiple pages pool.get("chion2018").get_data().get_page(20).text() # => 'XX FOREWORD\n(the disembodied voice seems to come from (...) Using a cache (via `cache`, which is a directory where to store cache files) improves the performance of `bp_text`. Trivial Noun Search ^^^^^^^^^^^^^^^^^^^ The following lines show how to search a pool for a `search_word` which should be used as a noun in the context of the respective text. The :py:class:`bp_text.textfragment.TextFragment` can be, in this case, conceived as a "container" for the search results. It holds the token of the search result as well as the citekey, the page_label and some meta-data (taken from the resp. item's meta-data which is most likely retrieved from the BibTeX file). In general, `TextFragment` objects help to unify the results of data extraction e.g. from a pool. .. code-block:: python # these need to be imported in order to make the typecasting work (via # isinstance())... from bp_text.pdf import PdfFile from bp_text.txt import TxtFile # this word is the word to find in the pool... search_word = "sprache" # as we, in this example (see below), use normalized words, let's apply # lowercase... search_word = search_word.lower() # this is an empty list for the results (which will be a dict)... results = {} # now, loop through all available pool items... for key, pitm in pool.data.items(): # get a data object (either a TxtFile or a PdfFile)... data = pitm.get_data() # these will be the matches... matches = [] # just proceed if the PoolItem contains either a PdfFile or a TxtFile # object)... if isinstance(data, PdfFile) or isinstance(data, TxtFile): # loop throuth all pages while preserving the pagenum (which is the # page # index here)... for pagenum, page in enumerate(data.data): # this is the spacy.doc doc = page.text.doc # this is the page_label (cf. :py:module:`bp_text.pdf`) page_label = page.page_label # if the page does not contain any text, the doc might be empty. # this handles this case... if doc == None: continue # now, search the spacy.doc for nouns matching the pattern for token in doc: if (token.text.lower() == search_word and token.pos_ == "NOUN"): # make a TextFragment object for this search result frag = bp_text.textfragment.TextFragment(key, page_label, pitm.meta, token) matches.append(frag) else: continue # add the matches to the results... if matches: results[key] = matches # this loop goes through the items in the results variable and prints the # token and the page_label (if applicable)... for key, val in results.items(): print("-------") for itm in val: print(f"data: '{itm.data}'") print(f"page_label: '{itm.page_label}'") # get tokens around the tokens selected_token = results["nietzsche2"][7].data print(f"This is the token: '{selected_token}'") print("This is the next token: " + f"'{selected_token.doc[selected_token.i + 1]}'") print("This is the prev token: '" + f"{selected_token.doc[selected_token.i - 1]}'") print("This is the sentence:") print(f"'{selected_token.sent.text}'") org-mode -------- The easiest way to compile a text with material generated by `bp_text` is to use embedded python code in Emacs's `org-mode` [#f1]_. The following example shows a simple way to work with `bp_text` in this way. .. code-block:: org-mode #+title: Testdocument #+author: Ruben Philipp #+date: 2025-05-13 #+LANGUAGE: de #+startup: overview #+LATEX_COMPILER: xelatex #+LATEX_CLASS: article #+LATEX_CLASS_OPTIONS: [a4paper,10pt,oneside] #+LATEX_HEADER: \usepackage{multicol} #+LATEX_HEADER: \usepackage{geometry} #+LATEX_HEADER: \geometry{left=15mm,right=15mm,top=15mm,bottom=15mm} #+LATEX_HEADER: \usepackage[hang,flushmargin]{footmisc} #+LATEX_HEADER: \usepackage{pdflscape} #+LATEX_HEADER: \usepackage{lmodern} # #+LATEX_HEADER: \usepackage{ebgaramond} #+LATEX_HEADER: \usepackage{fontspec} #+LATEX_HEADER: \setmainfont{Adobe Garamond Pro} #+LATEX_HEADER: \usepackage{fancyhdr} #+LATEX_HEADER: \usepackage{epigraph} #+LATEX_HEADER: \usepackage[utf8]{inputenc} #+LATEX_HEADER: \usepackage[ngerman]{babel} # % caption styling #+LATEX_HEADER: \usepackage[font=small,labelfont=bf]{caption} #+cite_export: csl chicago-note-bibliography.csl #+options: toc:nil num:nil ':t #+bibliography: /path/to/sources.bib #+LaTeX_HEADER: \hypersetup{hidelinks} #+LATEX_HEADER: \usepackage{setspace} #+EXPORT_FILE_NAME: /tmp/test # % code listings with mono font #+LATEX_HEADER: \lstset{basicstyle=\footnotesize\ttfamily,breaklines=true} #+LATEX_HEADER: \lstset{keepspaces=true} # % hyphenation in mono/texttt blocks #+LATEX_HEADER: \usepackage[htt]{hyphenat} #+begin_src python :session bp_text :results none :exports none ## These are some global definitions and declarations import bp_text import os import random DATA_ROOT = os.path.abspath("data") SOURCES_BIB = DATA_ROOT + "/sources.bib" # Just instantiate DB and POOL once as this might take a while try: DB and POOL except NameError: # not defined, instantiate DB = bp_text.database.BibTexDatabase(SOURCES_BIB) POOL = DB.make_pool(cache=DATA_ROOT + "/_pool-cache") # Set this if citations for items retrieved from a pool should be included in # the generated text. CITE = False #+end_src * Text From a "Model" #+begin_comment This example uses an input text as a model and "replaces" the words in the text with words with the same POS from randomly chosen pages from a text in the pool. #+end_comment #+begin_src python :session bp_text :results value raw :exports results # This is the pool item and its associated doc thePitm = POOL.get("arendt2006") theDoc = thePitm.get_data() # number of pages in doc num_pages = len(theDoc.data) # this is the input text # source: # https://www.soziopolis.de/zur-diagnostischen-gefuehlskultur-der-gegenwart.html input_sentence = """ Trauma, toxisch, triggern – Wörter wie diese sind inzwischen fester Bestandteil des Begriffsinventars, mit dem viele Menschen, online wie offline, über Lebenskrisen und zwischenmenschliche Konflikte sprechen. In Digitale Diagnosen. Psychische Gesundheit als Social-Media-Trend analysiert die Soziologin Laura Wiesböck ebenjene gegenwärtige Melange aus therapeutischen Diskursfragmenten auf Social Media. Die Autorin, Jahrgang 1987, hat an der Universität Wien promoviert und ist auf die Themen soziale Ungleichheit, Gendergerechtigkeit und digitale Arbeitswelt spezialisiert. Seit 2018 ihr erstes Buch In besserer Gesellschaft. Der selbstgerechte Blick auf die Anderen erschienen ist, zählt sie zu den medial präsentesten Gesichtern einer – jungen und weiblichen – soziologischen Wissenschaftskommunikation. Das vorliegende Sachbuch adressiert eine breite Öffentlichkeit jenseits der fachwissenschaftlichen Community. Das verhandelte Phänomen ist eine so markante Signatur unserer Gegenwart, dass Wiesböcks Buch eine gewinnbringende Lektüre für unterschiedliche Leserschaften – vor allem mit sozialwissenschaftlichem und psychologischem Fachhintergrund – ist. """ # initialize a NLP model nlp = bp_text.text.get_nlp("de_core_news_sm") # analyze the input text input_doc = nlp(input_sentence) # reset random seed random.seed(123) result = [] for word in input_doc: pg_i = random.randrange(num_pages) pg = theDoc.get_page(pg_i) for token in pg.text.doc: if token.pos_ == word.pos_: result.append(bp_text.textfragment.TextFragment(thePitm.key, pg.page_label, thePitm.meta, token)) break # output the results as org text bp_text.textfragment.textfragments_to_org(result, cite = CITE, force_cite = False) #+end_src * Literature #+print_bibliography: .. rubric:: Footnotes .. [#f1] Cf. https://orgmode.org and https://orgmode.org/worg/org-contrib/babel/languages/ob-doc-python.html for more details on working with Python in `org-mode`.