bp_text.txt

This module implements functionality for TXT files.

Created: 2025-03-29 Author: Ruben Philipp <me@rubenphilipp.com>

$$ Last modified: 11:04:38 Wed May 7 2025 CEST

Classes

TxtFile(file[, lang])

Implementation of the text-file (txt) class.

TxtPage([page_num, page_label, data, ...])

This is a class implementation of a TXT page.

class bp_text.txt.TxtFile(file: str, lang='')[source]

Bases: File

Implementation of the text-file (txt) class.

Note: The data attribute holds a list of (usually one) TxtPage object(s). This is intentionally analogous to PdfPage.

Example:

## instantiate the text file object and read its contents
text = txt.TxtFile("something.txt")
## get the primary language
print(text.lang)
## => "en"
Parameters:
  • file (string) – The path to the text file.

  • lang (string) – The language of the text file (e.g. “en”, “de” etc.).

__init__(file: str, lang='')[source]
property data

Getter/setter for the data.

property file

Getter/setter for the file attribute.

property file_checksum

Get the file checksum (sha256). Read-only.

get_primary_lang()[source]

Detect the primary language of the text in data and set the lang attribute accordingly.

property lang

Getter/setter for the language.

update()[source]

Updates the instance.

property verbose

Getter/setter for verbose (bool).

class bp_text.txt.TxtPage(page_num=None, page_label=None, data=None, raw_text='', lang='', verbose=False)[source]

Bases: Page

This is a class implementation of a TXT page. Usually TXT files (.txt) only contain a single page. Anyway, esp. in order to comply with the structure of PdfFile objects, TxtFile objects also use (usually) one TxtPage to store the (analyzed/tokenized) contents).

The data attribute (read-only) is an alias to the raw_text attribute of the respective page while the text attribute contains the analyzed/tokenized text.

Parameters:
  • page_num (int) – The page number (zero-based) of the page in the related file.

  • page_label (string) – The actual page label of the page in the PDF file. This will be used e.g. for citations in generated text.

  • raw_text (string) – Holds the actual raw text of the page, extracted from the data.

  • lang (string) – The language code of the primary language in the ISO-639-1 form (e.g. “de” or “en”).

  • verbose (boolean) – Print additional information during performance when True. Default = True

__init__(page_num=None, page_label=None, data=None, raw_text='', lang='', verbose=False)
count_words()

Counts the words in the text.

Returns:

The number of words in the text.

Return type:

integer

property data

Getter (alias) for the raw_text (read-only).

detect_lang(set_lang=True)

Detect the primary language of text in the page.

Parameters:

set_lang (boolean) – When true, automatically set the language attribute of the page. Default = True.

property lang
property page_label
property page_num
property raw_text
property text

Getter for the Text (read-only).

update()

Updates the instance.

property verbose

Verbose setter/getter (bool)