bp_text.pdf

This module implements functionality for PDF files.

Created: 2025-03-27 Author: Ruben Philipp <me@rubenphilipp.com>

$$ Last modified: 11:05:21 Wed May 7 2025 CEST

Classes

`PdfFile`(file[, auto_extract, use_ocr, ...])	This is a class implementation of a PDF file.
`PdfPage`([page_num, page_label, data, ...])	This is a class implementation for a PDF page.

class bp_text.pdf.PdfFile(file: str, auto_extract=True, use_ocr=False, fallback_to_ocr=True, ocr_dpi=300, ocr_default_lang='eng', verbose=True)[source]

Bases: File

This is a class implementation of a PDF file. A PDF file object is related to an actual PDF file (e.g. retrieved from a database entry). Its methods facilitate e.g. the retrieval of data/text from the pages.

The data attribute holds the PdfPage objects as a list.

Examples:

## load a PDF file and extract the content from its pages
pdfFile = pdf.PdfFile("bajohr2024a.pdf", auto_extract=True)

## get the primary language
print(pdfFile.lang)
## => "de"

## get the label from the second page (in this case a roman numeral)
pdfFile.data[1].page_label
## => "II"

## get the text from the third page
pdfFile.data[2].text

Parameters:

file (string) – The filepath.
auto_extract (boolean) – Automatically extract the text from all pages in the file when instantiating the object? This also automatically creates PdfPage objects for each page. Default = True
use_ocr – Use OCR by default for text extraction? Default = False
fallback_to_ocr (boolean) – If text extraction without OCR yields little text, fallback to OCR? Default = True
ocr_dpi (integer) – The DPI amount for OCR. Default = 300
ocr_default_lang (string) – The default language for OCR. Default = “eng”
verbose (boolean) – Print additional information during performance when True. Default = True

__init__(file: str, auto_extract=True, use_ocr=False, fallback_to_ocr=True, ocr_dpi=300, ocr_default_lang='eng', verbose=True)[source]

property auto_extract: Do auto-extraction?

property data: Getter/setter for the data.

extract_text()[source]: Extract text from a PDF using direct extraction or OCR. Returns a list of PdfPage objects.

extract_text_with_ocr()[source]: Extract text from a PDF using Tesseract OCR. Returns a list of PdfPage objects.

extract_text_without_ocr()[source]: Extract text from a PDF using pypdf. Returns a list of PdfPage objects.

property file: Getter/setter for the file attribute.

property file_checksum: Get the file checksum (sha256). Read-only.

get_page(page_index)[source]: Returns the PdfPage object for the page at index (zero-based).

get_page_label(page_num)[source]

Returns the label (i.e. the page number according to the PDF number tree) of a pdf page by index (page_num, zero-based).

Parameters:: page_num (integer) – The page number (zero-based) the label should be retrieved from.

get_primary_lang()[source]: Get the primary language of a PDF.

property lang: The language.

property reader: The pypdf.PdfReader object (read-only).

set_reader()[source]

This method sets the reader slot to the file.

This was previously done in the update method, but since pypdf objects (just as the reader) cannot be pickled, we seperate this process here in order to be at least able to reconstruct the reader when unpickling a PdfFile object.

update()[source]: Updates the instance.

property verbose: Getter/setter for verbose (bool).

class bp_text.pdf.PdfPage(page_num=None, page_label=None, data=None, raw_text='', lang='', file=None, verbose=True)[source]

Bases: Page

This is a class implementation for a PDF page. A PDF page holds is a reference to a page in a PDF document, usually related to a bp_text.pdf.PdfFile object.

The data attribute is also capable of holding a pypdf.PageObject (optional), while the text attribute contains the analyzed/tokenized text.

Please note that the pypdf.PageObject instances in the data attribute will not be (re-)stored when (un-)pickling the PdfPage.

Parameters:

page_num (int) – The page number (zero-based) of the page in the related file.
page_label (string) – The actual page label of the page in the PDF file. The actual PDF page number/label is defined in the PDF header and could differ from the page_num (e.g. by varying the start index being a roman instead of an arabic numeral.
data (A pypdf.PageObject) – A pypdf.PageObject. Default = None
raw_text (string) – Holds the actual raw text of the page, extracted from the data.
lang (string) – The language code of the primary language in the ISO-639-1 form (e.g. “de” or “en”).
file (A PdfFile object.) – An optional (back-)reference to a PdfFile object.
verbose (boolean) – Print additional information during performance when True. Default = True

__init__(page_num=None, page_label=None, data=None, raw_text='', lang='', file=None, verbose=True)[source]: Constructor method.

count_words()

Counts the words in the text.

Returns:: The number of words in the text.
Return type:: integer

property data: Read/write property for the data of the object.

detect_lang(set_lang=True)

Detect the primary language of text in the page.

Parameters:: set_lang (boolean) – When true, automatically set the language attribute of the page. Default = True.

extract_text(update_text=True)[source]

Extract text from a PDF page using direct extraction. Returns the text as a string.

Parameters:: update_text (boolean) – Update the text attribute with the extracted text?
Returns:: The retrieved text.
Return type:: string

property file: Read/write property.

property lang

property page_label

property page_num

property raw_text

property text: Getter for the Text (read-only).

update(): Updates the instance.

property verbose: Verbose setter/getter (bool)