bp_text.page
This module implements the page class.
Created: 2025-03-28 Author: Ruben Philipp <me@rubenphilipp.com>
$$ Last modified: 00:35:26 Tue Apr 29 2025 CEST
Classes
|
Abstract base class for a page. |
- class bp_text.page.Page(page_num=None, page_label=None, data=None, raw_text='', lang='', verbose=False)[source]
Bases:
ABCAbstract base class for a page.
Note: The text attribute holds a
Textobject containing tokenized text derived from the raw_text.- Parameters:
page_num (int) – The page number (zero-based) of the page in the related file.
page_label (string) – The actual page label of the page. Its value and meaning differs from the page_num as it is related to the actual page numbering e.g. in a document. Thus, it could also be a roman numeral or be counted from a starting index different from 0.
data (undefined) – Holds page data.
raw_text (string) – Holds the actual raw text of the page, extracted from the data.
lang (string) – The language code of the primary language in the alpha3/ISO 639-1 form.
verbose (boolean) – Print additional information during performance when True. Default = False
- count_words()[source]
Counts the words in the text.
- Returns:
The number of words in the text.
- Return type:
integer
- property data
- detect_lang(set_lang=True)[source]
Detect the primary language of text in the page.
- Parameters:
set_lang (boolean) – When true, automatically set the language attribute of the page. Default = True.
- property lang
- property page_label
- property page_num
- property raw_text
- property text
Getter for the Text (read-only).
- property verbose
Verbose setter/getter (bool)