bp_text.text
This module implements the text class.
A text is at first a string containing information in a given language. This text module uses Flair to split the text into sentences and tokenizes them. It uses various algorithms to e.g. detect parts of speech or entities which can be later used for analysis or text-production.
Created: 2025-04-24 Author: Ruben Philipp <me@rubenphilipp.com>
$$ Last modified: 16:46:43 Wed May 7 2025 CEST
Module Attributes
the default spacy models for a given language |
Functions
|
This loads and returns a spacy language model. |
Classes
|
This is a class implementation of a Text object. |
- bp_text.text.LANG_SPACY_MODELS = {'de': 'de_core_news_sm', 'en': 'en_core_web_sm', 'es': 'es_core_news_sm', 'fr': 'fr_core_news_sm', 'it': 'it_core_news_sm', 'nl': 'nl_core_news_sm', 'pl': 'pl_core_news_sm', 'pt': 'pt_core_news_sm', 'sv': 'sv_core_news_sm'}
the default spacy models for a given language
- class bp_text.text.Text(text='', lang='en')[source]
Bases:
objectThis is a class implementation of a Text object. A text holds a natural language text as a string and additionally contains segmented and analysed data derived from the text. The text is tokenized (using spaCy) and analyzed e.g. for parts of speech or entities. By default, Text uses Flair’s most versatile models (e.g. ‘pos-multi’ for POS tagging and ‘ner-large’ for NER tagging). While introducing some overhead on loading, this comes with the advantage of being able to more precisely analyse multilingual text.
Note: The doc contains the actual segmented text.
- Parameters:
text (string) – The text to be used as a basis for the analysis.
lang (string) – The primary language of the text as a ISO 639-1 code. Default = “en”
- property doc
Getter for the doc (i.e. the tokenized and analysed elements of the text). Read-only.
- property text
Getter/setter for text (string).
Changing the text also causes re-generation of the sentence analyses.