somajo

This module offers SoMaJo specific tools.

Hint

Use pip to install the necessary dependencies for this module: pip install mltb2[somajo]

class mltb2.somajo.JaccardSimilarity(language: Literal['de_CMC', 'en_PTB'])[source]

Bases: SoMaJoBaseClass

Calculate the jaccard similarity.

Parameters:

language (Literal['de_CMC', 'en_PTB']) – The language. de_CMC for German or en_PTB for English.

__call__(text1: str, text2: str) float[source]

Calculate the jaccard similarity for two texts.

Parameters:
  • text1 (str) – Text one.

  • text2 (str) – Text two.

Returns:

The jaccard similarity.

Return type:

float

get_token_set(text: str) set[str][source]

Get token set for text.

Parameters:

text (str) – The text to be tokenized into a set.

Returns:

The set of tokens (words).

Return type:

set[str]

class mltb2.somajo.SoMaJoBaseClass(language: Literal['de_CMC', 'en_PTB'])[source]

Bases: ABC

Base Class for SoMaJo tools.

Parameters:

language (Literal['de_CMC', 'en_PTB']) – The language. de_CMC for German or en_PTB for English.

Note

This class is an abstract base class. It should not be used directly.

class mltb2.somajo.SoMaJoSentenceSplitter(language: Literal['de_CMC', 'en_PTB'], show_progress_bar: bool = False)[source]

Bases: SoMaJoBaseClass

Use SoMaJo to split text into sentences.

Parameters:
  • language (Literal['de_CMC', 'en_PTB']) – The language. de_CMC for German or en_PTB for English.

  • show_progress_bar (bool) – Show a progressbar during processing.

__call__(text: str) list[str][source]

Split the text into a list of sentences.

Parameters:

text (str) – The text to be split.

Returns:

The list of sentence splits.

Return type:

list[str]

class mltb2.somajo.TokenExtractor(language: Literal['de_CMC', 'en_PTB'])[source]

Bases: SoMaJoBaseClass

Extract tokens from text.

Parameters:

language (Literal['de_CMC', 'en_PTB']) – The language. de_CMC for German or en_PTB for English.

extract_token_set(text: Iterable | str, keep_token_classes: str | None = None) set[str][source]

Extract tokens from text.

Parameters:
  • text (Iterable | str) – the text

  • keep_token_classes (str | None) – The token classes to keep. If None all will be kept.

Returns:

Set of tokens.

Return type:

set[str]

extract_url_set(text: Iterable | str) set[str][source]

Extract URLs from text.

An example:

from mltb2.somajo import TokenExtractor

token_extractor = TokenExtractor("de_CMC")
url_set = token_extractor.extract_url_set("Das ist ein Link: http://github.com")
print(url_set)

Example output:

{'http://github.com'}
Parameters:

text (Iterable | str) – the text

Returns:

Set of extracted links.

Return type:

set[str]

class mltb2.somajo.UrlSwapper(token_extractor: TokenExtractor, url_pattern: str = 'https://link-{}.com')[source]

Bases: object

Tool to swap (and reverse swap) links with a numbered replacement link.

Parameters:
  • token_extractor (TokenExtractor) – The sentence token extractor to be used.

  • url_pattern (str) – The pattern to use for replacement. One {} marks the place where to put the number.

reverse_swap_urls(text: str) tuple[str, set[str]][source]

Revert the url swap.

Returns:

The reverted text and a set of URLs that were unknown by the URLSwapper.

Parameters:

text (str)

Return type:

tuple[str, set[str]]

swap_urls(text: str) str[source]

Swap the urls of the text.

Parameters:

text (str)

Return type:

str

mltb2.somajo.detokenize(tokens) str[source]

Convert SoMaJo tokens to sentence (string).

Parameters:

tokens – The tokens to be de-tokenized.

Returns:

The de-tokenized sentence.

Return type:

str

mltb2.somajo.extract_token_class_set(sentences: Iterable, keep_token_classes: str | None = None) set[str][source]

Extract token from sentences by token class.

Parameters:
  • sentences (Iterable) – The sentences from which to extract.

  • keep_token_classes (str | None) – The token classes to keep. If None all will be kept.

Returns:

The set of extracted token texts.

Return type:

set[str]