Splitter

The splitter step (also known as chunking) takes a long Markdown document (*.md) as the input and returns smaller splits (or chunks) that can easier processed by an embedding model or language model. The splitter keeps the length of the output chunks below a defined threshold (token limit) and tries to split without breaking the document context, e.g., split only at the end of a sentence and not within a sentence.

Semantic Splitter

Semantic document elements (e.g., headings) are repeated.

Splitter implementation.

__init__(token_limit=256, token_limit_buffer=32, token_limit_min=64, sentence_splitter_model='de_core_news_sm', repeat_table_header_row=True, tokenizer_model='gpt-3.5-turbo')

Initializes the SemanticSplitter class with specified token limits and a sentence splitter model.

Parameters:
  • token_limit (int, default: 256 ) –

    The maximum number of tokens allowed. Defaults to 256.

  • token_limit_buffer (int, default: 32 ) –

    The buffer size for token limit to allow flexibility. Defaults to 32.

  • token_limit_min (int, default: 64 ) –

    The minimum number of tokens required. Defaults to 64.

  • sentence_splitter_model (str, default: 'de_core_news_sm' ) –

    The name of the sentence splitter model. Defaults to "de_core_news_sm".

  • repeat_table_header_row (bool, default: True ) –

    If a table is splitted, the header is repeated. Defaults to True.

  • tokenizer_model (str, default: 'gpt-3.5-turbo' ) –

    The name of the tokenizer model to use for encoding. Defaults to "gpt-3.5-turbo".

Raises:
  • OSError

    If the specified sentence splitter cannot be loaded.

split_markdown_document(doc)

Split a Markdown Document into Snippets.

text_sentences(text)

Split a text into sentences using a sentence splitter model.

This does not use a Regex based approach on purpose as they break with punctuation very easily see: https://stackoverflow.com/a/61254146

Table Splitter

For Markdown tables, a custom logic is implemented that preserves the table structure by repeating the header row if a split occurs within a table. So subsequent chunks maintain the semantic table information from the header row. By default, tables are never broken in the middle of a row; if a single row exceeds the budget, it is split at column boundaries instead and full-header is repeated.

A class to split markdown tables into token-bounded chunks.

This class encapsulates the logic for splitting large markdown tables while preserving table structure. Tables are never broken in the middle of a row; if a single row exceeds the max length, it is split at column boundaries instead and the full header is repeated.

Example:

>>> import tiktoken
>>> enc = tiktoken.get_encoding("cl100k_base")
>>> splitter = MarkdownTableSplitterUtil(token_limit=8000, enc=enc)
>>> chunks = splitter.split(markdown_text)
>>> len(chunks)
3
Parameters:
  • token_limit (int) –

    Maximum tokens per chunk (model tokens, not characters).

  • enc (Encoding) –

    Tokenizer used for counting tokens.

  • repeat_header_row (bool, default: True ) –

    If True, repeat the header row in each chunk. Defaults to True.

Attributes:
  • chunks (list[str]) –

    Completed chunks of markdown.

  • buf (list[str]) –

    Current buffer of lines.

  • buf_tok (int) –

    Current token count in buffer.

  • min_safety_token_limit (int) –

    A minimum of 10 tokens is a safety threshold to ensure the splitter can always fit at least a minimal table structure in a chunk.

__post_init__()

Validate configuration after initialization.

split(md)

Split a markdown document into token-bounded chunks while respecting tables.

Parameters:
  • md

    str Markdown document.

Returns:
  • list[str]

    list[str]: Chunks whose token counts are <= token_limit.

Sentence Splitter

The semantic splitter avoids splitting within sentences and to achieve this it relies on a sentence splitter. The sentence splitter takes longer text as input and splits the text into individual sentences. There are different implementations available.

Bases: SentenceSplitter

A sentence splitter based on regular expressions.

NOTE: Using the regex splitter is not recommended since it based on very simple heuristics.

Heuristics: - Split after sentence-ending punctuation (. ! ? …) and any closing quotes/brackets. - Only split if the next non-space token looks like a sentence start (capital letter or digit, optionally after an opening quote/paren). - Merge back false positives caused by common abbreviations, initials, dotted acronyms (e.g., U.S.), decimals (e.g., 3.14), ordinals (No. 5), and ellipses.

Notes: - Tweak self.abbreviations for your domain/corpus. - For chatty/poetic text where sentences may start lowercase, relax self._split_re's lookahead (see comment in init).

__init__()

Initialize a regex sentence splitter (compile regex, set abbreviations).

get_sentences(text)

Split text into sentences.

Bases: SentenceSplitter

Adapter for Spacy sentence splitter.

__init__(nlp)

Initialize a SpacySentenceSplitter.

Parameters:
  • nlp (Language) –

    A Spacy model from spacy.load().

get_sentences(text)

Split text into sentences.