bp_text.pdf

This module implements functionality for PDF files.

Created: 2025-03-27 Author: Ruben Philipp <me@rubenphilipp.com>

$$ Last modified: 11:05:21 Wed May 7 2025 CEST

Classes

PdfFile(file[, auto_extract, use_ocr, ...])

This is a class implementation of a PDF file.

PdfPage([page_num, page_label, data, ...])

This is a class implementation for a PDF page.

class bp_text.pdf.PdfFile(file: str, auto_extract=True, use_ocr=False, fallback_to_ocr=True, ocr_dpi=300, ocr_default_lang='eng', verbose=True)[source]

Bases: File

This is a class implementation of a PDF file. A PDF file object is related to an actual PDF file (e.g. retrieved from a database entry). Its methods facilitate e.g. the retrieval of data/text from the pages.

The data attribute holds the PdfPage objects as a list.

Examples:

## load a PDF file and extract the content from its pages
pdfFile = pdf.PdfFile("bajohr2024a.pdf", auto_extract=True)

## get the primary language
print(pdfFile.lang)
## => "de"

## get the label from the second page (in this case a roman numeral)
pdfFile.data[1].page_label
## => "II"

## get the text from the third page
pdfFile.data[2].text
Parameters:
  • file (string) – The filepath.

  • auto_extract (boolean) – Automatically extract the text from all pages in the file when instantiating the object? This also automatically creates PdfPage objects for each page. Default = True

  • use_ocr – Use OCR by default for text extraction? Default = False

  • fallback_to_ocr (boolean) – If text extraction without OCR yields little text, fallback to OCR? Default = True

  • ocr_dpi (integer) – The DPI amount for OCR. Default = 300

  • ocr_default_lang (string) – The default language for OCR. Default = “eng”

  • verbose (boolean) – Print additional information during performance when True. Default = True

__init__(file: str, auto_extract=True, use_ocr=False, fallback_to_ocr=True, ocr_dpi=300, ocr_default_lang='eng', verbose=True)[source]
property auto_extract

Do auto-extraction?

property data

Getter/setter for the data.

extract_text()[source]

Extract text from a PDF using direct extraction or OCR. Returns a list of PdfPage objects.

extract_text_with_ocr()[source]

Extract text from a PDF using Tesseract OCR. Returns a list of PdfPage objects.

extract_text_without_ocr()[source]

Extract text from a PDF using pypdf. Returns a list of PdfPage objects.

property file

Getter/setter for the file attribute.

property file_checksum

Get the file checksum (sha256). Read-only.

get_page(page_index)[source]

Returns the PdfPage object for the page at index (zero-based).

get_page_label(page_num)[source]

Returns the label (i.e. the page number according to the PDF number tree) of a pdf page by index (page_num, zero-based).

Parameters:

page_num (integer) – The page number (zero-based) the label should be retrieved from.

get_primary_lang()[source]

Get the primary language of a PDF.

property lang

The language.

property reader

The pypdf.PdfReader object (read-only).

set_reader()[source]

This method sets the reader slot to the file.

This was previously done in the update method, but since pypdf objects (just as the reader) cannot be pickled, we seperate this process here in order to be at least able to reconstruct the reader when unpickling a PdfFile object.

update()[source]

Updates the instance.

property verbose

Getter/setter for verbose (bool).

class bp_text.pdf.PdfPage(page_num=None, page_label=None, data=None, raw_text='', lang='', file=None, verbose=True)[source]

Bases: Page

This is a class implementation for a PDF page. A PDF page holds is a reference to a page in a PDF document, usually related to a bp_text.pdf.PdfFile object.

The data attribute is also capable of holding a pypdf.PageObject (optional), while the text attribute contains the analyzed/tokenized text.

Please note that the pypdf.PageObject instances in the data attribute will not be (re-)stored when (un-)pickling the PdfPage.

Parameters:
  • page_num (int) – The page number (zero-based) of the page in the related file.

  • page_label (string) – The actual page label of the page in the PDF file. The actual PDF page number/label is defined in the PDF header and could differ from the page_num (e.g. by varying the start index being a roman instead of an arabic numeral.

  • data (A pypdf.PageObject) – A pypdf.PageObject. Default = None

  • raw_text (string) – Holds the actual raw text of the page, extracted from the data.

  • lang (string) – The language code of the primary language in the ISO-639-1 form (e.g. “de” or “en”).

  • file (A PdfFile object.) – An optional (back-)reference to a PdfFile object.

  • verbose (boolean) – Print additional information during performance when True. Default = True

__init__(page_num=None, page_label=None, data=None, raw_text='', lang='', file=None, verbose=True)[source]

Constructor method.

count_words()

Counts the words in the text.

Returns:

The number of words in the text.

Return type:

integer

property data

Read/write property for the data of the object.

detect_lang(set_lang=True)

Detect the primary language of text in the page.

Parameters:

set_lang (boolean) – When true, automatically set the language attribute of the page. Default = True.

extract_text(update_text=True)[source]

Extract text from a PDF page using direct extraction. Returns the text as a string.

Parameters:

update_text (boolean) – Update the text attribute with the extracted text?

Returns:

The retrieved text.

Return type:

string

property file

Read/write property.

property lang
property page_label
property page_num
property raw_text
property text

Getter for the Text (read-only).

update()

Updates the instance.

property verbose

Verbose setter/getter (bool)