somajo
This module offers SoMaJo specific tools.
Hint
Use pip to install the necessary dependencies for this module:
pip install mltb2[somajo]
- class mltb2.somajo.JaccardSimilarity(language: Literal['de_CMC', 'en_PTB'])[source]
Bases:
SoMaJoBaseClass
Calculate the jaccard similarity.
- Parameters:
language (Literal['de_CMC', 'en_PTB']) – The language.
de_CMC
for German oren_PTB
for English.
- class mltb2.somajo.SoMaJoBaseClass(language: Literal['de_CMC', 'en_PTB'])[source]
Bases:
ABC
Base Class for SoMaJo tools.
- Parameters:
language (Literal['de_CMC', 'en_PTB']) – The language.
de_CMC
for German oren_PTB
for English.
Note
This class is an abstract base class. It should not be used directly.
- class mltb2.somajo.SoMaJoSentenceSplitter(language: Literal['de_CMC', 'en_PTB'], show_progress_bar: bool = False)[source]
Bases:
SoMaJoBaseClass
Use SoMaJo to split text into sentences.
- Parameters:
- class mltb2.somajo.TokenExtractor(language: Literal['de_CMC', 'en_PTB'])[source]
Bases:
SoMaJoBaseClass
Extract tokens from text.
- Parameters:
language (Literal['de_CMC', 'en_PTB']) – The language.
de_CMC
for German oren_PTB
for English.
- extract_token_set(text: Iterable | str, keep_token_classes: str | None = None) set[str] [source]
Extract tokens from text.
- extract_url_set(text: Iterable | str) set[str] [source]
Extract URLs from text.
An example:
from mltb2.somajo import TokenExtractor token_extractor = TokenExtractor("de_CMC") url_set = token_extractor.extract_url_set("Das ist ein Link: http://github.com") print(url_set)
Example output:
{'http://github.com'}
- class mltb2.somajo.UrlSwapper(token_extractor: TokenExtractor, url_pattern: str = 'https://link-{}.com')[source]
Bases:
object
Tool to swap (and reverse swap) links with a numbered replacement link.
- Parameters:
token_extractor (TokenExtractor) – The sentence token extractor to be used.
url_pattern (str) – The pattern to use for replacement. One
{}
marks the place where to put the number.
- mltb2.somajo.detokenize(tokens) str [source]
Convert SoMaJo tokens to sentence (string).
- Parameters:
tokens – The tokens to be de-tokenized.
- Returns:
The de-tokenized sentence.
- Return type: