Skip to content

Splitter

The splitter step (also known as chunking) takes a long Markdown document (*.md) as the input and returns smaller splits (or chunks) that can easier processed by an embedding model or language model. The splitter keeps the length of the output chunks below a defined threshold (token limit) and tries to split without breaking the document context, e.g., split only at the end of a sentence and not within a sentence.

Semantic Splitter

Semantic document elements (e.g., headings) are repeated.

SemanticSplitter

Splitter implementation.

Functions

__init__(token_limit=256, token_limit_buffer=32, token_limit_min=64, sentence_splitter_model='de_core_news_sm', repeat_table_header_row=True, tokenizer_model='gpt-3.5-turbo')

Initializes the SemanticSplitter class with specified token limits and a sentence splitter model.

Parameters:

Name Type Description Default
token_limit int

The maximum number of tokens allowed. Defaults to 256.

256
token_limit_buffer int

The buffer size for token limit to allow flexibility. Defaults to 32.

32
token_limit_min int

The minimum number of tokens required. Defaults to 64.

64
sentence_splitter_model str

The name of the sentence splitter model. Defaults to "de_core_news_sm".

'de_core_news_sm'
repeat_table_header_row bool

If a table is splitted, the header is repeated. Defaults to True.

True
tokenizer_model str

The name of the tokenizer model to use for encoding. Defaults to "gpt-3.5-turbo".

'gpt-3.5-turbo'

Raises:

Type Description
OSError

If the specified sentence splitter cannot be loaded.

split_markdown_document(doc)

Split a Markdown Document into Snippets.

text_sentences(text)

Split a text into sentences using a sentence splitter model.

This does not use a Regex based approach on purpose as they break with punctuation very easily see: https://stackoverflow.com/a/61254146

Table Splitter

For Markdown tables, a custom logic is implemented that preserves the table structure by repeating the header row if a split occurs within a table. So subsequent chunks maintain the semantic table information from the header row. By default, tables are never broken in the middle of a row; if a single row exceeds the budget, it is split at column boundaries instead and full-header is repeated.

MarkdownTableSplitterUtil dataclass

A class to split markdown tables into token-bounded chunks.

This class encapsulates the logic for splitting large markdown tables while preserving table structure. Tables are never broken in the middle of a row; if a single row exceeds the max length, it is split at column boundaries instead and the full header is repeated.

Example:

>>> from wurzel.utils.tokenizers import Tokenizer
>>> tokenizer = Tokenizer.from_name("cl100k_base")
>>> splitter = MarkdownTableSplitterUtil(token_limit=8000, tokenizer=tokenizer)
>>> chunks = splitter.split(markdown_text)
>>> len(chunks)
3

Parameters:

Name Type Description Default
token_limit int

Maximum tokens per chunk (model tokens, not characters).

required
tokenizer Tokenizer

Tokenizer used for counting tokens.

required
repeat_header_row bool

If True, repeat the header row in each chunk. Defaults to True.

True

Attributes:

Name Type Description
chunks list[str]

Completed chunks of markdown.

buf list[str]

Current buffer of lines.

buf_tok int

Current token count in buffer.

min_safety_token_limit int

A minimum of 10 tokens is a safety threshold to ensure the splitter can always fit at least a minimal table structure in a chunk.

Functions

__post_init__()

Validate configuration after initialization.

split(md)

Split a markdown document into token-bounded chunks while respecting tables.

Parameters:

Name Type Description Default
md str

str Markdown document.

required

Returns:

Type Description
list[str]

list[str]: Chunks whose token counts are <= token_limit.

Sentence Splitter

The semantic splitter avoids splitting within sentences and to achieve this it relies on a sentence splitter. The sentence splitter takes longer text as input and splits the text into individual sentences. There are different implementations available.

RegexSentenceSplitter

Bases: SentenceSplitter

A sentence splitter based on regular expressions.

NOTE: Using the regex splitter is not recommended since it based on very simple heuristics.

Heuristics: - Split after sentence-ending punctuation (. ! ? …) and any closing quotes/brackets. - Only split if the next non-space token looks like a sentence start (capital letter or digit, optionally after an opening quote/paren). - Merge back false positives caused by common abbreviations, initials, dotted acronyms (e.g., U.S.), decimals (e.g., 3.14), ordinals (No. 5), and ellipses.

Notes: - Tweak self.abbreviations for your domain/corpus. - For chatty/poetic text where sentences may start lowercase, relax self._split_re's lookahead (see comment in init).

Functions

__init__()

Initialize a regex sentence splitter (compile regex, set abbreviations).

get_sentences(text)

Split text into sentences.

SpacySentenceSplitter

Bases: SentenceSplitter

Adapter for Spacy sentence splitter.

Functions

__init__(nlp)

Initialize a SpacySentenceSplitter.

Parameters:

Name Type Description Default
nlp Language

A Spacy model from spacy.load().

required
get_sentences(text)

Split text into sentences.

SaTSentenceSplitter

Bases: SentenceSplitter

Adapter for wtpsplit's SaT sentence splitter.

SaT (Segment any Text) is a state-of-the-art sentence splitter. Depending on the selected model you may want to use a GPU for faster inference.

Available models and benchmark results: https://github.com/segment-any-text/wtpsplit

Example usage:

splitter = SentenceSplitter.from_name("sat-3l")
splitter.get_sentences("This is a test This is another test.")
# returns ["This is a test ", "This is another test."]

Functions

__init__(model_name_or_model)

Initialize a SaTSentenceSplitter.

Parameters:

Name Type Description Default
model_name_or_model str

A string or Path (Hugging Face ID or local directory path)

required
get_sentences(text)

Split text into sentences.