text
This module offers text specific tools.
It offers the following functionality:
- detect or clean invisible characters 
- detect or replace special whitespaces 
- remove duplicate whitespaces 
- calculate the distance between two texts to find anomalies 
- class mltb2.text.TextDistance(show_progress_bar: bool = False, max_dimensions: int = 100)[source]
- Bases: - object- Calculate the distance between two texts. - This class can be used to find texts with anomalies. For example with HTML markup or other unusual characters. - One text (or multiple texts) must first be fitted with - fit(). After that the distance to other given texts can be calculated with- distance(). After the distance was calculated the first time, the class can not be fitted again.- Parameters:
- Raises:
- ValueError – If - max_dimensionsis not greater than 0.
 - _normalize_char_counter() None[source]
- Normalize the char counter to a defaultdict. - This supports lazy postprocessing of the char counter. - Return type:
- None 
 
 - distance(text) float[source]
- Calculate the distance between the fitted text and the given text. - This implementation uses the Manhattan distance ( - scipy.spatial.distance.cityblock()). The distance is only calculated for- max_dimensionsmost commen characters.- Parameters:
- text – The text to calculate the Manhattan distance to. The higher this value is, the more the text differs from the fitted text. 
- Return type:
 
 - fit(text: str | Iterable[str]) None[source]
- Fit the text. - This method must be called at least once before - distance().- Parameters:
- Raises:
- ValueError – If - fit()is called after- distance().
- Return type:
- None 
 
 
- mltb2.text._normalize_counter_to_defaultdict(counter: Counter, max_dimensions: int) defaultdict[source]
- Normalize a counter to to - max_dimensions.- The number of dimensions is limited to - max_dimensionsof the most commen characters. The counter values are normalized by deviding them by the total count.- Parameters:
- Returns:
- The normalized counter with a maximum of - max_dimensionsdimensions.
- Return type:
 
- mltb2.text.clean_all_invisible_chars_and_strip(text: str) str[source]
- Clean text form invisible characters and strip the text. - Remove invisible characters from text. 
- Replace special whitespaces with normal whitespaces. 
- Remove leading and trailing whitespaces. 
 - The invisible characters are defined in the constant - INVISIBLE_CHARACTERS. The special whitespaces are defined in the constant- SPECIAL_WHITESPACES.- Rteturns:
- The cleaned text. 
 
- mltb2.text.clean_all_invisible_chars_and_whitespaces(text: str) str[source]
- Clean text form invisible characters and whitespaces. - Remove invisible characters from text. 
- Replace special whitespaces with normal whitespaces. 
- Replace multiple whitespaces with single whitespace. 
- Remove leading and trailing whitespaces. 
 - The invisible characters are defined in the constant - INVISIBLE_CHARACTERS. The special whitespaces are defined in the constant- SPECIAL_WHITESPACES.- Rteturns:
- The cleaned text. 
 
- mltb2.text.has_invisible_characters(text: str) bool[source]
- Check if text contains invisible characters. - The invisible characters are defined in the constant - INVISIBLE_CHARACTERS.
- mltb2.text.has_special_whitespaces(text: str) bool[source]
- Check if text contains special whitespaces. - The special whitespaces are defined in the constant - SPECIAL_WHITESPACES.
- mltb2.text.has_xml_tag(text: str) bool[source]
- Check if text contains XML tags (one or multiple). - These are some XML tags we detect: - <xml_tag>
- <xml:tag>
- </xml_tag>
- <xml_tag/>
- <xml_tag />
 - While we do not detect - a < b but x > y.
- mltb2.text.remove_invisible_characters(text: str) str[source]
- Remove invisible characters from text. - The invisible characters are defined in the constant - INVISIBLE_CHARACTERS.
- mltb2.text.replace_multiple_whitespaces(text: str) str[source]
- Replace multiple whitespaces with single whitespace.