`somajo`

This module offers SoMaJo specific tools.

Hint

Use pip to install the necessary dependencies for this module: pip install mltb2[somajo]

class mltb2.somajo.JaccardSimilarity(language: Literal['de_CMC', 'en_PTB'])[source]

Bases: SoMaJoBaseClass

Calculate the jaccard similarity.

Parameters:: language (Literal['de_CMC', 'en_PTB']) – The language. de_CMC for German or en_PTB for English.

__call__(text1: str, text2: str) → float[source]

Calculate the jaccard similarity for two texts.

Parameters:

text1 (str) – Text one.
text2 (str) – Text two.

Returns:

The jaccard similarity.

Return type:

float

get_token_set(text: str) → set[str][source]

Get token set for text.

Parameters:: text (str) – The text to be tokenized into a set.
Returns:: The set of tokens (words).
Return type:: set[str]

class mltb2.somajo.SoMaJoBaseClass(language: Literal['de_CMC', 'en_PTB'])[source]

Bases: ABC

Base Class for SoMaJo tools.

Parameters:: language (Literal['de_CMC', 'en_PTB']) – The language. de_CMC for German or en_PTB for English.

Note

This class is an abstract base class. It should not be used directly.

class mltb2.somajo.SoMaJoSentenceSplitter(language: Literal['de_CMC', 'en_PTB'], show_progress_bar: bool = False)[source]

Bases: SoMaJoBaseClass

Use SoMaJo to split text into sentences.

Parameters:

language (Literal['de_CMC', 'en_PTB']) – The language. de_CMC for German or en_PTB for English.
show_progress_bar (bool) – Show a progressbar during processing.

__call__(text: str) → list[str][source]

Split the text into a list of sentences.

Parameters:: text (str) – The text to be split.
Returns:: The list of sentence splits.
Return type:: list[str]

class mltb2.somajo.TokenExtractor(language: Literal['de_CMC', 'en_PTB'])[source]

Bases: SoMaJoBaseClass

Extract tokens from text.

Parameters:: language (Literal['de_CMC', 'en_PTB']) – The language. de_CMC for German or en_PTB for English.

extract_token_set(text: Iterable | str, keep_token_classes: str | None = None) → set[str][source]

Extract tokens from text.

Parameters:

text (Iterable | str) – the text
keep_token_classes (str | None) – The token classes to keep. If None all will be kept.

Returns:

Set of tokens.

Return type:

set[str]

extract_url_set(text: Iterable | str) → set[str][source]

Extract URLs from text.

An example:

from mltb2.somajo import TokenExtractor

token_extractor = TokenExtractor("de_CMC")
url_set = token_extractor.extract_url_set("Das ist ein Link: http://github.com")
print(url_set)

Example output:

{'http://github.com'}

Parameters:: text (Iterable | str) – the text
Returns:: Set of extracted links.
Return type:: set[str]

class mltb2.somajo.UrlSwapper(token_extractor: TokenExtractor, url_pattern: str = 'https://link-{}.com')[source]

Bases: object

Tool to swap (and reverse swap) links with a numbered replacement link.

Parameters:

token_extractor (TokenExtractor) – The sentence token extractor to be used.
url_pattern (str) – The pattern to use for replacement. One {} marks the place where to put the number.

reverse_swap_urls(text: str) → tuple[str, set[str]][source]

Revert the url swap.

Returns:: The reverted text and a set of URLs that were unknown by the URLSwapper.
Parameters:: text (str)
Return type:: tuple[str, set[str]]

swap_urls(text: str) → str[source]

Swap the urls of the text.

Parameters:: text (str)
Return type:: str

mltb2.somajo.detokenize(tokens) → str[source]

Convert SoMaJo tokens to sentence (string).

Parameters:: tokens – The tokens to be de-tokenized.
Returns:: The de-tokenized sentence.
Return type:: str

somajo

`somajo`