bp_text.text

This module implements the text class.

A text is at first a string containing information in a given language. This text module uses Flair to split the text into sentences and tokenizes them. It uses various algorithms to e.g. detect parts of speech or entities which can be later used for analysis or text-production.

Created: 2025-04-24 Author: Ruben Philipp <me@rubenphilipp.com>

$$ Last modified: 16:46:43 Wed May 7 2025 CEST

Module Attributes

LANG_SPACY_MODELS

the default spacy models for a given language

Functions

get_nlp(model_name)

This loads and returns a spacy language model.

Classes

Text([text, lang])

This is a class implementation of a Text object.

bp_text.text.LANG_SPACY_MODELS = {'de': 'de_core_news_sm', 'en': 'en_core_web_sm', 'es': 'es_core_news_sm', 'fr': 'fr_core_news_sm', 'it': 'it_core_news_sm', 'nl': 'nl_core_news_sm', 'pl': 'pl_core_news_sm', 'pt': 'pt_core_news_sm', 'sv': 'sv_core_news_sm'}: the default spacy models for a given language

class bp_text.text.Text(text='', lang='en')[source]

Bases: object

This is a class implementation of a Text object. A text holds a natural language text as a string and additionally contains segmented and analysed data derived from the text. The text is tokenized (using spaCy) and analyzed e.g. for parts of speech or entities. By default, Text uses Flair’s most versatile models (e.g. ‘pos-multi’ for POS tagging and ‘ner-large’ for NER tagging). While introducing some overhead on loading, this comes with the advantage of being able to more precisely analyse multilingual text.

Note: The doc contains the actual segmented text.

Parameters:

text (string) – The text to be used as a basis for the analysis.
lang (string) – The primary language of the text as a ISO 639-1 code. Default = “en”

__init__(text='', lang='en')[source]

property doc: Getter for the doc (i.e. the tokenized and analysed elements of the text). Read-only.

property text

Getter/setter for text (string).

Changing the text also causes re-generation of the sentence analyses.

update()[source]

Update the instance.

This also method also performs the text segmentation and analysis.

bp_text.text.get_nlp(model_name: str)[source]

This loads and returns a spacy language model. Additionally, it caches up to 10 language models in order to minimize memory usage.

Parameters:: model_name (string) – The name of the spacy model (e.g. “en_core_web_sm”).