bp_text.txt
This module implements functionality for TXT files.
Created: 2025-03-29 Author: Ruben Philipp <me@rubenphilipp.com>
$$ Last modified: 11:04:38 Wed May 7 2025 CEST
Classes
|
Implementation of the text-file (txt) class. |
|
This is a class implementation of a TXT page. |
- class bp_text.txt.TxtFile(file: str, lang='')[source]
Bases:
FileImplementation of the text-file (txt) class.
Note: The data attribute holds a list of (usually one) TxtPage object(s). This is intentionally analogous to
PdfPage.Example:
## instantiate the text file object and read its contents text = txt.TxtFile("something.txt") ## get the primary language print(text.lang) ## => "en"
- Parameters:
file (string) – The path to the text file.
lang (string) – The language of the text file (e.g. “en”, “de” etc.).
- property data
Getter/setter for the data.
- property file
Getter/setter for the file attribute.
- property file_checksum
Get the file checksum (sha256). Read-only.
- get_primary_lang()[source]
Detect the primary language of the text in data and set the lang attribute accordingly.
- property lang
Getter/setter for the language.
- property verbose
Getter/setter for verbose (bool).
- class bp_text.txt.TxtPage(page_num=None, page_label=None, data=None, raw_text='', lang='', verbose=False)[source]
Bases:
PageThis is a class implementation of a TXT page. Usually TXT files (.txt) only contain a single page. Anyway, esp. in order to comply with the structure of PdfFile objects, TxtFile objects also use (usually) one TxtPage to store the (analyzed/tokenized) contents).
The data attribute (read-only) is an alias to the raw_text attribute of the respective page while the text attribute contains the analyzed/tokenized text.
- Parameters:
page_num (int) – The page number (zero-based) of the page in the related file.
page_label (string) – The actual page label of the page in the PDF file. This will be used e.g. for citations in generated text.
raw_text (string) – Holds the actual raw text of the page, extracted from the data.
lang (string) – The language code of the primary language in the ISO-639-1 form (e.g. “de” or “en”).
verbose (boolean) – Print additional information during performance when True. Default = True
- __init__(page_num=None, page_label=None, data=None, raw_text='', lang='', verbose=False)
- count_words()
Counts the words in the text.
- Returns:
The number of words in the text.
- Return type:
integer
- property data
Getter (alias) for the raw_text (read-only).
- detect_lang(set_lang=True)
Detect the primary language of text in the page.
- Parameters:
set_lang (boolean) – When true, automatically set the language attribute of the page. Default = True.
- property lang
- property page_label
- property page_num
- property raw_text
- property text
Getter for the Text (read-only).
- update()
Updates the instance.
- property verbose
Verbose setter/getter (bool)