somajo_transformers

This module offers Hugging Face Transformers and SoMaJo specific tools.

This module is based on Hugging Face Transformers and SoMaJo.

Hint

Use pip to install the necessary dependencies for this module: pip install mltb2[somajo_transformers]

class mltb2.somajo_transformers.TextSplitter(max_token: int, somajo_sentence_splitter: SoMaJoSentenceSplitter, transformers_token_counter: TransformersTokenCounter, show_progress_bar: bool = False, ignore_overly_long_sentences: bool = False)[source]

Bases: object

Split the text into sections with a specified maximum token number.

Does not divide words, but always whole sentences.

Parameters:
  • max_token (int) – Maximum number of tokens per text section.

  • somajo_sentence_splitter (SoMaJoSentenceSplitter) – The sentence splitter to be used.

  • transformers_token_counter (TransformersTokenCounter) – The token counter to be used.

  • show_progress_bar (bool) – Show a progressbar during processing.

  • ignore_overly_long_sentences (bool) – If this is False an ValueError exception is raised if a sentence is longer than max_token. If it is True, then the sentence is simply ignored.

__call__(text: str) List[str][source]

Split the text into sections.

Parameters:

text (str) – The text to be split.

Returns:

The list of section splits.

Return type:

List[str]