somajo
This module offers SoMaJo specific tools.
Hint
Use pip to install the necessary dependencies for this module:
pip install mltb2[somajo]
- class mltb2.somajo.JaccardSimilarity(language: Literal['de_CMC', 'en_PTB'])[source]
- Bases: - SoMaJoBaseClass- Calculate the jaccard similarity. - Parameters:
- language (Literal['de_CMC', 'en_PTB']) – The language. - de_CMCfor German or- en_PTBfor English.
 
- class mltb2.somajo.SoMaJoBaseClass(language: Literal['de_CMC', 'en_PTB'])[source]
- Bases: - ABC- Base Class for SoMaJo tools. - Parameters:
- language (Literal['de_CMC', 'en_PTB']) – The language. - de_CMCfor German or- en_PTBfor English.
 - Note - This class is an abstract base class. It should not be used directly. 
- class mltb2.somajo.SoMaJoSentenceSplitter(language: Literal['de_CMC', 'en_PTB'], show_progress_bar: bool = False)[source]
- Bases: - SoMaJoBaseClass- Use SoMaJo to split text into sentences. - Parameters:
 
- class mltb2.somajo.TokenExtractor(language: Literal['de_CMC', 'en_PTB'])[source]
- Bases: - SoMaJoBaseClass- Extract tokens from text. - Parameters:
- language (Literal['de_CMC', 'en_PTB']) – The language. - de_CMCfor German or- en_PTBfor English.
 - extract_token_set(text: Iterable | str, keep_token_classes: str | None = None) set[str][source]
- Extract tokens from text. 
 - extract_url_set(text: Iterable | str) set[str][source]
- Extract URLs from text. - An example: - from mltb2.somajo import TokenExtractor token_extractor = TokenExtractor("de_CMC") url_set = token_extractor.extract_url_set("Das ist ein Link: http://github.com") print(url_set) - Example output: - {'http://github.com'}
 
- class mltb2.somajo.UrlSwapper(token_extractor: TokenExtractor, url_pattern: str = 'https://link-{}.com')[source]
- Bases: - object- Tool to swap (and reverse swap) links with a numbered replacement link. - Parameters:
- token_extractor (TokenExtractor) – The sentence token extractor to be used. 
- url_pattern (str) – The pattern to use for replacement. One - {}marks the place where to put the number.
 
 
- mltb2.somajo.detokenize(tokens) str[source]
- Convert SoMaJo tokens to sentence (string). - Parameters:
- tokens – The tokens to be de-tokenized. 
- Returns:
- The de-tokenized sentence. 
- Return type: