bp_text.page

This module implements the page class.

Created: 2025-03-28 Author: Ruben Philipp <me@rubenphilipp.com>

$$ Last modified: 00:35:26 Tue Apr 29 2025 CEST

Classes

Page([page_num, page_label, data, raw_text, ...])

Abstract base class for a page.

class bp_text.page.Page(page_num=None, page_label=None, data=None, raw_text='', lang='', verbose=False)[source]

Bases: ABC

Abstract base class for a page.

Note: The text attribute holds a Text object containing tokenized text derived from the raw_text.

Parameters:
  • page_num (int) – The page number (zero-based) of the page in the related file.

  • page_label (string) – The actual page label of the page. Its value and meaning differs from the page_num as it is related to the actual page numbering e.g. in a document. Thus, it could also be a roman numeral or be counted from a starting index different from 0.

  • data (undefined) – Holds page data.

  • raw_text (string) – Holds the actual raw text of the page, extracted from the data.

  • lang (string) – The language code of the primary language in the alpha3/ISO 639-1 form.

  • verbose (boolean) – Print additional information during performance when True. Default = False

__init__(page_num=None, page_label=None, data=None, raw_text='', lang='', verbose=False)[source]
count_words()[source]

Counts the words in the text.

Returns:

The number of words in the text.

Return type:

integer

property data
detect_lang(set_lang=True)[source]

Detect the primary language of text in the page.

Parameters:

set_lang (boolean) – When true, automatically set the language attribute of the page. Default = True.

property lang
property page_label
property page_num
property raw_text
property text

Getter for the Text (read-only).

update()[source]

Updates the instance.

property verbose

Verbose setter/getter (bool)