text

This module offers text specific tools.

It offers the following functionality:

  • detect or clean invisible characters

  • detect or replace special whitespaces

  • remove duplicate whitespaces

  • calculate the distance between two texts to find anomalies

class mltb2.text.TextDistance(show_progress_bar: bool = False, max_dimensions: int = 100)[source]

Bases: object

Calculate the distance between two texts.

This class can be used to find texts with anomalies. For example with HTML markup or other unusual characters.

One text (or multiple texts) must first be fitted with fit(). After that the distance to other given texts can be calculated with distance(). After the distance was calculated the first time, the class can not be fitted again.

Parameters:
  • show_progress_bar (bool) – Show a progressbar during processing.

  • max_dimensions (int) – The maximum number of dimensions to use for the distance calculation. Must be greater than 0.

Raises:

ValueError – If max_dimensions is not greater than 0.

_normalize_char_counter() None[source]

Normalize the char counter to a defaultdict.

This supports lazy postprocessing of the char counter.

Return type:

None

distance(text) float[source]

Calculate the distance between the fitted text and the given text.

This implementation uses the Manhattan distance (scipy.spatial.distance.cityblock()). The distance is only calculated for max_dimensions most commen characters.

Parameters:

text – The text to calculate the Manhattan distance to. The higher this value is, the more the text differs from the fitted text.

Return type:

float

fit(text: str | Iterable[str]) None[source]

Fit the text.

This method must be called at least once before distance().

Parameters:

text (str | Iterable[str]) – The text to fit.

Raises:

ValueError – If fit() is called after distance().

Return type:

None

mltb2.text._normalize_counter_to_defaultdict(counter: Counter, max_dimensions: int) defaultdict[source]

Normalize a counter to to max_dimensions.

The number of dimensions is limited to max_dimensions of the most commen characters. The counter values are normalized by deviding them by the total count.

Parameters:
  • counter (Counter) – The counter to normalize.

  • max_dimensions (int) – The maximum number of dimensions to use for the normalization. Must be greater than 0.

Returns:

The normalized counter with a maximum of max_dimensions dimensions.

Return type:

defaultdict

mltb2.text.clean_all_invisible_chars_and_strip(text: str) str[source]

Clean text form invisible characters and strip the text.

  • Remove invisible characters from text.

  • Replace special whitespaces with normal whitespaces.

  • Remove leading and trailing whitespaces.

The invisible characters are defined in the constant INVISIBLE_CHARACTERS. The special whitespaces are defined in the constant SPECIAL_WHITESPACES.

Parameters:

text (str) – The text to clean.

Return type:

str

Rteturns:

The cleaned text.

mltb2.text.clean_all_invisible_chars_and_whitespaces(text: str) str[source]

Clean text form invisible characters and whitespaces.

  • Remove invisible characters from text.

  • Replace special whitespaces with normal whitespaces.

  • Replace multiple whitespaces with single whitespace.

  • Remove leading and trailing whitespaces.

The invisible characters are defined in the constant INVISIBLE_CHARACTERS. The special whitespaces are defined in the constant SPECIAL_WHITESPACES.

Parameters:

text (str) – The text to clean.

Return type:

str

Rteturns:

The cleaned text.

mltb2.text.has_invisible_characters(text: str) bool[source]

Check if text contains invisible characters.

The invisible characters are defined in the constant INVISIBLE_CHARACTERS.

Parameters:

text (str) – The text to check.

Returns:

True if the text contains invisible characters, False otherwise.

Return type:

bool

mltb2.text.has_special_whitespaces(text: str) bool[source]

Check if text contains special whitespaces.

The special whitespaces are defined in the constant SPECIAL_WHITESPACES.

Parameters:

text (str) – The text to check.

Returns:

True if the text contains special whitespaces, False otherwise.

Return type:

bool

mltb2.text.has_xml_tag(text: str) bool[source]

Check if text contains XML tags (one or multiple).

These are some XML tags we detect:

  • <xml_tag>

  • <xml:tag>

  • </xml_tag>

  • <xml_tag/>

  • <xml_tag />

While we do not detect a < b but x > y.

Parameters:

text (str) – The text to check.

Returns:

True if the text contains XML tags, False otherwise.

Return type:

bool

mltb2.text.remove_invisible_characters(text: str) str[source]

Remove invisible characters from text.

The invisible characters are defined in the constant INVISIBLE_CHARACTERS.

Parameters:

text (str) – The text from which the invisible characters are to be removed.

Returns:

The cleaned text.

Return type:

str

mltb2.text.replace_multiple_whitespaces(text: str) str[source]

Replace multiple whitespaces with single whitespace.

Parameters:

text (str) – The text from which the multiple whitespaces are to be replaced.

Returns:

The cleaned text.

Return type:

str

mltb2.text.replace_special_whitespaces(text: str) str[source]

Replace special whitespaces with normal whitespaces.

The special whitespaces are defined in the constant SPECIAL_WHITESPACES.

Parameters:

text (str) – The text from which the special whitespaces are to be replaced.

Returns:

The cleaned text.

Return type:

str