bp_text.pdf
This module implements functionality for PDF files.
Created: 2025-03-27 Author: Ruben Philipp <me@rubenphilipp.com>
$$ Last modified: 11:05:21 Wed May 7 2025 CEST
Classes
|
This is a class implementation of a PDF file. |
|
This is a class implementation for a PDF page. |
- class bp_text.pdf.PdfFile(file: str, auto_extract=True, use_ocr=False, fallback_to_ocr=True, ocr_dpi=300, ocr_default_lang='eng', verbose=True)[source]
Bases:
FileThis is a class implementation of a PDF file. A PDF file object is related to an actual PDF file (e.g. retrieved from a database entry). Its methods facilitate e.g. the retrieval of data/text from the pages.
The data attribute holds the
PdfPageobjects as a list.Examples:
## load a PDF file and extract the content from its pages pdfFile = pdf.PdfFile("bajohr2024a.pdf", auto_extract=True) ## get the primary language print(pdfFile.lang) ## => "de" ## get the label from the second page (in this case a roman numeral) pdfFile.data[1].page_label ## => "II" ## get the text from the third page pdfFile.data[2].text
- Parameters:
file (string) – The filepath.
auto_extract (boolean) – Automatically extract the text from all pages in the file when instantiating the object? This also automatically creates
PdfPageobjects for each page. Default = Trueuse_ocr – Use OCR by default for text extraction? Default = False
fallback_to_ocr (boolean) – If text extraction without OCR yields little text, fallback to OCR? Default = True
ocr_dpi (integer) – The DPI amount for OCR. Default = 300
ocr_default_lang (string) – The default language for OCR. Default = “eng”
verbose (boolean) – Print additional information during performance when True. Default = True
- __init__(file: str, auto_extract=True, use_ocr=False, fallback_to_ocr=True, ocr_dpi=300, ocr_default_lang='eng', verbose=True)[source]
- property auto_extract
Do auto-extraction?
- property data
Getter/setter for the data.
- extract_text()[source]
Extract text from a PDF using direct extraction or OCR. Returns a list of PdfPage objects.
- extract_text_with_ocr()[source]
Extract text from a PDF using Tesseract OCR. Returns a list of PdfPage objects.
- extract_text_without_ocr()[source]
Extract text from a PDF using pypdf. Returns a list of PdfPage objects.
- property file
Getter/setter for the file attribute.
- property file_checksum
Get the file checksum (sha256). Read-only.
- get_page_label(page_num)[source]
Returns the label (i.e. the page number according to the PDF number tree) of a pdf page by index (page_num, zero-based).
- Parameters:
page_num (integer) – The page number (zero-based) the label should be retrieved from.
- property lang
The language.
- property reader
The pypdf.PdfReader object (read-only).
- set_reader()[source]
This method sets the reader slot to the file.
This was previously done in the update method, but since pypdf objects (just as the reader) cannot be pickled, we seperate this process here in order to be at least able to reconstruct the reader when unpickling a PdfFile object.
- property verbose
Getter/setter for verbose (bool).
- class bp_text.pdf.PdfPage(page_num=None, page_label=None, data=None, raw_text='', lang='', file=None, verbose=True)[source]
Bases:
PageThis is a class implementation for a PDF page. A PDF page holds is a reference to a page in a PDF document, usually related to a
bp_text.pdf.PdfFileobject.The data attribute is also capable of holding a pypdf.PageObject (optional), while the text attribute contains the analyzed/tokenized text.
Please note that the pypdf.PageObject instances in the data attribute will not be (re-)stored when (un-)pickling the PdfPage.
- Parameters:
page_num (int) – The page number (zero-based) of the page in the related file.
page_label (string) – The actual page label of the page in the PDF file. The actual PDF page number/label is defined in the PDF header and could differ from the page_num (e.g. by varying the start index being a roman instead of an arabic numeral.
data (A pypdf.PageObject) – A pypdf.PageObject. Default = None
raw_text (string) – Holds the actual raw text of the page, extracted from the data.
lang (string) – The language code of the primary language in the ISO-639-1 form (e.g. “de” or “en”).
file (A
PdfFileobject.) – An optional (back-)reference to a PdfFile object.verbose (boolean) – Print additional information during performance when True. Default = True
- __init__(page_num=None, page_label=None, data=None, raw_text='', lang='', file=None, verbose=True)[source]
Constructor method.
- count_words()
Counts the words in the text.
- Returns:
The number of words in the text.
- Return type:
integer
- property data
Read/write property for the data of the object.
- detect_lang(set_lang=True)
Detect the primary language of text in the page.
- Parameters:
set_lang (boolean) – When true, automatically set the language attribute of the page. Default = True.
- extract_text(update_text=True)[source]
Extract text from a PDF page using direct extraction. Returns the text as a string.
- Parameters:
update_text (boolean) – Update the text attribute with the extracted text?
- Returns:
The retrieved text.
- Return type:
string
- property file
Read/write property.
- property lang
- property page_label
- property page_num
- property raw_text
- property text
Getter for the Text (read-only).
- update()
Updates the instance.
- property verbose
Verbose setter/getter (bool)