This module offers Hugging Face Transformers and SoMaJo specific tools.

This module is based on Hugging Face Transformers and SoMaJo.


Use pip to install the necessary dependencies for this module: pip install mltb2[somajo_transformers]

class mltb2.somajo_transformers.TextSplitter(max_token: int, somajo_sentence_splitter: SoMaJoSentenceSplitter, transformers_token_counter: TransformersTokenCounter, show_progress_bar: bool = False, ignore_overly_long_sentences: bool = False)[source]

Bases: object

Split the text into sections with a specified maximum token number.

Does not divide words, but always whole sentences.

  • max_token (int) – Maximum number of tokens per text section.

  • somajo_sentence_splitter (SoMaJoSentenceSplitter) – The sentence splitter to be used.

  • transformers_token_counter (TransformersTokenCounter) – The token counter to be used.

  • show_progress_bar (bool) – Show a progressbar during processing.

  • ignore_overly_long_sentences (bool) – If this is False an ValueError exception is raised if a sentence is longer than max_token. If it is True, then the sentence is simply ignored.

__call__(text: str) List[str][source]

Split the text into sections.


text (str) – The text to be split.


The list of section splits.

Return type: