somajo_transformers
This module offers Hugging Face Transformers and SoMaJo specific tools.
This module is based on Hugging Face Transformers and SoMaJo.
Hint
Use pip to install the necessary dependencies for this module:
pip install mltb2[somajo_transformers]
- class mltb2.somajo_transformers.TextSplitter(max_token: int, somajo_sentence_splitter: SoMaJoSentenceSplitter, transformers_token_counter: TransformersTokenCounter, show_progress_bar: bool = False, ignore_overly_long_sentences: bool = False)[source]
Bases:
object
Split the text into sections with a specified maximum token number.
Does not divide words, but always whole sentences.
- Parameters:
max_token (int) – Maximum number of tokens per text section.
somajo_sentence_splitter (SoMaJoSentenceSplitter) – The sentence splitter to be used.
transformers_token_counter (TransformersTokenCounter) – The token counter to be used.
show_progress_bar (bool) – Show a progressbar during processing.
ignore_overly_long_sentences (bool) – If this is
False
anValueError
exception is raised if a sentence is longer thanmax_token
. If it isTrue
, then the sentence is simply ignored.