bs

Beautiful Soup and HTML specific tools.

Hint

Use pip to install the necessary dependencies for this module: pip install mltb2[bs]

mltb2.bs.extract_all(soup: BeautifulSoup, name=None, attrs: dict | None = None, **kwargs: dict[str, Any]) Any[source]

Extract all specified elements from a BeautifulSoup object.

Parameters:
  • soup (BeautifulSoup) – The BeautifulSoup object to extract the elements from.

  • name – Name of the tag to extract.

  • attrs (dict | None) – Attributes of the tag to extract.

  • kwargs (dict[str, Any]) – Additional keyword arguments.

Returns:

The extracted BeautifulSoup elements.

Return type:

Any

mltb2.bs.extract_one(soup: BeautifulSoup, name=None, attrs: dict | None = None, **kwargs: dict[str, Any]) Any[source]

Extract exactly one specified element from a BeautifulSoup object.

This function expacts that exactly only one result is found. Otherwise a RuntimeError is raised.

Parameters:
  • soup (BeautifulSoup) – The BeautifulSoup object to extract the element from.

  • name – Name of the tag to extract.

  • attrs (dict | None) – Attributes of the tag to extract.

  • kwargs (dict[str, Any]) – Additional keyword arguments.

Returns:

The extracted BeautifulSoup element.

Raises:

RuntimeError – If not exactly one result is found.

Return type:

Any

mltb2.bs.extract_text(soup: BeautifulSoup, join_str: str | None = None) str[source]

Extract the text from a BeautifulSoup object.

Warning

This implementation has known issues with whitespace handling.

Parameters:
  • soup (BeautifulSoup) – The BeautifulSoup object to extract the text from.

  • join_str (str | None) – String to join the text parts with. Per default a space is used.

Returns:

Text from the BeautifulSoup object.

Return type:

str

mltb2.bs.html_to_md(html: str, mdformat_options: dict | None = None) str[source]

Convert HTML to Markdown.

The default mdformat options are:

  • number=True: apply consecutive numbering to ordered lists

  • wrap="no": paragraph word wrap mode

  • end-of-line="lf": use LF as line ending

See also

The mdformat Options.

Parameters:
  • html (str) – HTML text.

  • mdformat_options (dict | None) – Options for mdformat.

Returns:

The Markdown text.

Return type:

str

mltb2.bs.remove_all(soup: BeautifulSoup, name=None, attrs: dict | None = None, **kwargs: dict[str, Any]) None[source]

Remove all specified elements from a BeautifulSoup object.

The removal is done in place. Nothing is returned.

Parameters:
  • soup (BeautifulSoup) – The BeautifulSoup object to remove the elements from.

  • name – Name of the tag(-s) to remove.

  • attrs (dict | None) – Attributes of the tag(-s) to remove.

  • kwargs (dict[str, Any]) – Additional keyword arguments.

Return type:

None

mltb2.bs.soup_to_md(soup: BeautifulSoup, mdformat_options: dict | None = None) str[source]

Convert a BeautifulSoup object to Markdown.

The default mdformat options are:

  • number=True: apply consecutive numbering to ordered lists

  • wrap="no": paragraph word wrap mode

  • end-of-line="lf": use LF as line ending

See also

The mdformat Options.

Parameters:
  • soup (BeautifulSoup) – BeautifulSoup object.

  • mdformat_options (dict | None) – Options for mdformat.

Returns:

The Markdown text.

Return type:

str