Splitter¶

The splitter step (also known as chunking) takes a long Markdown document (*.md) as the input and returns smaller splits (or chunks) that can easier processed by an embedding model or language model. The splitter keeps the length of the output chunks below a defined threshold (token limit) and tries to split without breaking the document context, e.g., split only at the end of a sentence and not within a sentence.

Semantic Splitter¶

Semantic document elements (e.g., headings) are repeated.

`SemanticSplitter` ¶

Splitter implementation.

Functions¶

`init(token_limit=256, token_limit_buffer=32, token_limit_min=64, sentence_splitter_model='de_core_news_sm', repeat_table_header_row=True, tokenizer_model='gpt-3.5-turbo')` ¶

Initializes the SemanticSplitter class with specified token limits and a sentence splitter model.

Parameters:

Name	Type	Description	Default
`token_limit`	`int`	The maximum number of tokens allowed. Defaults to 256.	`256`
`token_limit_buffer`	`int`	The buffer size for token limit to allow flexibility. Defaults to 32.	`32`
`token_limit_min`	`int`	The minimum number of tokens required. Defaults to 64.	`64`
`sentence_splitter_model`	`str`	The name of the sentence splitter model. Defaults to "de_core_news_sm".	`'de_core_news_sm'`
`repeat_table_header_row`	`bool`	If a table is splitted, the header is repeated. Defaults to True.	`True`
`tokenizer_model`	`str`	The name of the tokenizer model to use for encoding. Defaults to "gpt-3.5-turbo".	`'gpt-3.5-turbo'`

Raises:

Type	Description
`OSError`	If the specified sentence splitter cannot be loaded.

`split_markdown_document(doc)` ¶

Split a Markdown Document into Snippets.

`text_sentences(text)` ¶

Split a text into sentences using a sentence splitter model.

This does not use a Regex based approach on purpose as they break with punctuation very easily see: https://stackoverflow.com/a/61254146

Table Splitter¶

For Markdown tables, a custom logic is implemented that preserves the table structure by repeating the header row if a split occurs within a table. So subsequent chunks maintain the semantic table information from the header row. By default, tables are never broken in the middle of a row; if a single row exceeds the budget, it is split at column boundaries instead and full-header is repeated.

`MarkdownTableSplitterUtil` `dataclass` ¶

A class to split markdown tables into token-bounded chunks.

This class encapsulates the logic for splitting large markdown tables while preserving table structure. Tables are never broken in the middle of a row; if a single row exceeds the max length, it is split at column boundaries instead and the full header is repeated.

Example:

>>> from wurzel.utils.tokenizers import Tokenizer
>>> tokenizer = Tokenizer.from_name("cl100k_base")
>>> splitter = MarkdownTableSplitterUtil(token_limit=8000, tokenizer=tokenizer)
>>> chunks = splitter.split(markdown_text)
>>> len(chunks)
3

Parameters:

Name	Type	Description	Default
`token_limit`	`int`	Maximum tokens per chunk (model tokens, not characters).	required
`tokenizer`	`Tokenizer`	Tokenizer used for counting tokens.	required
`repeat_header_row`	`bool`	If True, repeat the header row in each chunk. Defaults to True.	`True`

Attributes:

Name	Type	Description
`chunks`	`list[str]`	Completed chunks of markdown.
`buf`	`list[str]`	Current buffer of lines.
`buf_tok`	`int`	Current token count in buffer.
`min_safety_token_limit`	`int`	A minimum of 10 tokens is a safety threshold to ensure the splitter can always fit at least a minimal table structure in a chunk.

Functions¶

`__post_init__()` ¶

Validate configuration after initialization.

`split(md)` ¶

Split a markdown document into token-bounded chunks while respecting tables.

Parameters:

Name	Type	Description	Default
`md`	`str`	str Markdown document.	required

Returns:

Type	Description
`list[str]`	list[str]: Chunks whose token counts are <= token_limit.

Sentence Splitter¶

The semantic splitter avoids splitting within sentences and to achieve this it relies on a sentence splitter. The sentence splitter takes longer text as input and splits the text into individual sentences. There are different implementations available.

`RegexSentenceSplitter` ¶

Bases: SentenceSplitter

A sentence splitter based on regular expressions.

NOTE: Using the regex splitter is not recommended since it based on very simple heuristics.

Heuristics: - Split after sentence-ending punctuation (. ! ? …) and any closing quotes/brackets. - Only split if the next non-space token looks like a sentence start (capital letter or digit, optionally after an opening quote/paren). - Merge back false positives caused by common abbreviations, initials, dotted acronyms (e.g., U.S.), decimals (e.g., 3.14), ordinals (No. 5), and ellipses.

Notes: - Tweak self.abbreviations for your domain/corpus. - For chatty/poetic text where sentences may start lowercase, relax self._split_re's lookahead (see comment in init).

Functions¶

`init()` ¶

Initialize a regex sentence splitter (compile regex, set abbreviations).

`get_sentences(text)` ¶

Split text into sentences.

`SpacySentenceSplitter` ¶

Bases: SentenceSplitter

Adapter for Spacy sentence splitter.

Functions¶

`init(nlp)` ¶

Initialize a SpacySentenceSplitter.

Parameters:

Name	Type	Description	Default
`nlp`	`Language`	A Spacy model from spacy.load().	required

`get_sentences(text)` ¶

Split text into sentences.

`SaTSentenceSplitter` ¶

Bases: SentenceSplitter

Adapter for wtpsplit's SaT sentence splitter.

SaT (Segment any Text) is a state-of-the-art sentence splitter. Depending on the selected model you may want to use a GPU for faster inference.

Available models and benchmark results: https://github.com/segment-any-text/wtpsplit

Example usage:

splitter = SentenceSplitter.from_name("sat-3l")
splitter.get_sentences("This is a test This is another test.")
# returns ["This is a test ", "This is another test."]

Functions¶

`init(model_name_or_model)` ¶

Initialize a SaTSentenceSplitter.

Parameters:

Name	Type	Description	Default
`model_name_or_model`	`str`	A string or Path (Hugging Face ID or local directory path)	required

`get_sentences(text)` ¶

Split text into sentences.

Splitter¶

Semantic Splitter¶

SemanticSplitter ¶

Functions¶

__init__(token_limit=256, token_limit_buffer=32, token_limit_min=64, sentence_splitter_model='de_core_news_sm', repeat_table_header_row=True, tokenizer_model='gpt-3.5-turbo') ¶

split_markdown_document(doc) ¶

text_sentences(text) ¶

Table Splitter¶

MarkdownTableSplitterUtil dataclass ¶

Functions¶

__post_init__() ¶

split(md) ¶

Sentence Splitter¶

RegexSentenceSplitter ¶

Functions¶

__init__() ¶

get_sentences(text) ¶

SpacySentenceSplitter ¶

Functions¶

__init__(nlp) ¶

get_sentences(text) ¶

SaTSentenceSplitter ¶

Functions¶

__init__(model_name_or_model) ¶

get_sentences(text) ¶

`SemanticSplitter` ¶

`init(token_limit=256, token_limit_buffer=32, token_limit_min=64, sentence_splitter_model='de_core_news_sm', repeat_table_header_row=True, tokenizer_model='gpt-3.5-turbo')` ¶

`split_markdown_document(doc)` ¶

`text_sentences(text)` ¶

`MarkdownTableSplitterUtil` `dataclass` ¶

`__post_init__()` ¶

`split(md)` ¶

`RegexSentenceSplitter` ¶

`init()` ¶

`get_sentences(text)` ¶

`SpacySentenceSplitter` ¶

`init(nlp)` ¶

`get_sentences(text)` ¶

`SaTSentenceSplitter` ¶

`init(model_name_or_model)` ¶

`get_sentences(text)` ¶