md

Markdown specific module.

Hint

Use pip to install the necessary dependencies for this module: pip install mltb2[md]

class mltb2.md.MdTextSplitter(max_token: int, transformers_token_counter: TransformersTokenCounter, show_progress_bar: bool = False)[source]

Bases: object

Split Markdown text into sections with a specified maximum token number.

Does not divide headings with their corresponding paragraphs.

Parameters:
  • max_token (int) – Maximum number of tokens per text section. Can only be exceeded if a single Markdown chunk is already larger.

  • transformers_token_counter (TransformersTokenCounter) – The token counter to be used.

  • show_progress_bar (bool) – Show a progressbar during processing.

__call__(md_text: str) List[str][source]

Split the Markdown text into sections.

Parameters:

md_text (str) – The Markdown text to be split.

Returns:

The list of Markdown section splits.

Return type:

List[str]

mltb2.md._chunk_md_by_headline(md_text: str) List[str][source]

Chunk Markdown by headlines.

Parameters:

md_text (str) – The Markdown text to be chunked.

Returns:

The list of Markdown chunks.

Return type:

List[str]

mltb2.md.chunk_md(md_text: str) List[str][source]

Chunk Markdown by headlines and merge isolated headlines.

Merges isolated headlines with their corresponding subsequent paragraphs. Headings isolated at the end of md_text (headings without content) are removed in this process.

Parameters:

md_text (str) – The Markdown text to be chunked.

Returns:

The list of Markdown chunks.

Return type:

List[str]