text
This module offers text specific tools.
It offers the following functionality:
detect or clean invisible characters
detect or replace special whitespaces
remove duplicate whitespaces
calculate the distance between two texts to find anomalies
- class mltb2.text.TextDistance(show_progress_bar: bool = False, max_dimensions: int = 100)[source]
Bases:
object
Calculate the distance between two texts.
This class can be used to find texts with anomalies. For example with HTML markup or other unusual characters.
One text (or multiple texts) must first be fitted with
fit()
. After that the distance to other given texts can be calculated withdistance()
. After the distance was calculated the first time, the class can not be fitted again.- Parameters:
- Raises:
ValueError – If
max_dimensions
is not greater than 0.
- _normalize_char_counter() None [source]
Normalize the char counter to a defaultdict.
This supports lazy postprocessing of the char counter.
- Return type:
None
- distance(text) float [source]
Calculate the distance between the fitted text and the given text.
This implementation uses the Manhattan distance (
scipy.spatial.distance.cityblock()
). The distance is only calculated formax_dimensions
most commen characters.- Parameters:
text – The text to calculate the Manhattan distance to. The higher this value is, the more the text differs from the fitted text.
- Return type:
- fit(text: str | Iterable[str]) None [source]
Fit the text.
This method must be called at least once before
distance()
.- Parameters:
- Raises:
ValueError – If
fit()
is called afterdistance()
.- Return type:
None
- mltb2.text._normalize_counter_to_defaultdict(counter: Counter, max_dimensions: int) defaultdict [source]
Normalize a counter to to
max_dimensions
.The number of dimensions is limited to
max_dimensions
of the most commen characters. The counter values are normalized by deviding them by the total count.- Parameters:
- Returns:
The normalized counter with a maximum of
max_dimensions
dimensions.- Return type:
- mltb2.text.clean_all_invisible_chars_and_strip(text: str) str [source]
Clean text form invisible characters and strip the text.
Remove invisible characters from text.
Replace special whitespaces with normal whitespaces.
Remove leading and trailing whitespaces.
The invisible characters are defined in the constant
INVISIBLE_CHARACTERS
. The special whitespaces are defined in the constantSPECIAL_WHITESPACES
.- Rteturns:
The cleaned text.
- mltb2.text.clean_all_invisible_chars_and_whitespaces(text: str) str [source]
Clean text form invisible characters and whitespaces.
Remove invisible characters from text.
Replace special whitespaces with normal whitespaces.
Replace multiple whitespaces with single whitespace.
Remove leading and trailing whitespaces.
The invisible characters are defined in the constant
INVISIBLE_CHARACTERS
. The special whitespaces are defined in the constantSPECIAL_WHITESPACES
.- Rteturns:
The cleaned text.
- mltb2.text.has_invisible_characters(text: str) bool [source]
Check if text contains invisible characters.
The invisible characters are defined in the constant
INVISIBLE_CHARACTERS
.
- mltb2.text.has_special_whitespaces(text: str) bool [source]
Check if text contains special whitespaces.
The special whitespaces are defined in the constant
SPECIAL_WHITESPACES
.
- mltb2.text.has_xml_tag(text: str) bool [source]
Check if text contains XML tags (one or multiple).
These are some XML tags we detect:
<xml_tag>
<xml:tag>
</xml_tag>
<xml_tag/>
<xml_tag />
While we do not detect
a < b but x > y
.
- mltb2.text.remove_invisible_characters(text: str) str [source]
Remove invisible characters from text.
The invisible characters are defined in the constant
INVISIBLE_CHARACTERS
.
- mltb2.text.replace_multiple_whitespaces(text: str) str [source]
Replace multiple whitespaces with single whitespace.