Splitter¶
The splitter step (also known as chunking) takes a long Markdown document (*.md
) as the input and returns smaller splits (or chunks) that can easier processed by an embedding model or language model. The splitter keeps the length of the output chunks below a defined threshold (token limit) and tries to split without breaking the document context, e.g., split only at the end of a sentence and not within a sentence.
Semantic Splitter¶
Semantic document elements (e.g., headings) are repeated.
SemanticSplitter
¶
Splitter implementation.
Functions¶
__init__(token_limit=256, token_limit_buffer=32, token_limit_min=64, sentence_splitter_model='de_core_news_sm', repeat_table_header_row=True, tokenizer_model='gpt-3.5-turbo')
¶
Initializes the SemanticSplitter class with specified token limits and a sentence splitter model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token_limit | int | The maximum number of tokens allowed. Defaults to 256. | 256 |
token_limit_buffer | int | The buffer size for token limit to allow flexibility. Defaults to 32. | 32 |
token_limit_min | int | The minimum number of tokens required. Defaults to 64. | 64 |
sentence_splitter_model | str | The name of the sentence splitter model. Defaults to "de_core_news_sm". | 'de_core_news_sm' |
repeat_table_header_row | bool | If a table is splitted, the header is repeated. Defaults to True. | True |
tokenizer_model | str | The name of the tokenizer model to use for encoding. Defaults to "gpt-3.5-turbo". | 'gpt-3.5-turbo' |
Raises:
Type | Description |
---|---|
OSError | If the specified sentence splitter cannot be loaded. |
split_markdown_document(doc)
¶
Split a Markdown Document into Snippets.
text_sentences(text)
¶
Split a text into sentences using a sentence splitter model.
This does not use a Regex based approach on purpose as they break with punctuation very easily see: https://stackoverflow.com/a/61254146
Table Splitter¶
For Markdown tables, a custom logic is implemented that preserves the table structure by repeating the header row if a split occurs within a table. So subsequent chunks maintain the semantic table information from the header row. By default, tables are never broken in the middle of a row; if a single row exceeds the budget, it is split at column boundaries instead and full-header is repeated.
MarkdownTableSplitterUtil
dataclass
¶
A class to split markdown tables into token-bounded chunks.
This class encapsulates the logic for splitting large markdown tables while preserving table structure. Tables are never broken in the middle of a row; if a single row exceeds the max length, it is split at column boundaries instead and the full header is repeated.
Example:
>>> from wurzel.utils.tokenizers import Tokenizer
>>> tokenizer = Tokenizer.from_name("cl100k_base")
>>> splitter = MarkdownTableSplitterUtil(token_limit=8000, tokenizer=tokenizer)
>>> chunks = splitter.split(markdown_text)
>>> len(chunks)
3
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token_limit | int | Maximum tokens per chunk (model tokens, not characters). | required |
tokenizer | Tokenizer | Tokenizer used for counting tokens. | required |
repeat_header_row | bool | If True, repeat the header row in each chunk. Defaults to True. | True |
Attributes:
Name | Type | Description |
---|---|---|
chunks | list[str] | Completed chunks of markdown. |
buf | list[str] | Current buffer of lines. |
buf_tok | int | Current token count in buffer. |
min_safety_token_limit | int | A minimum of 10 tokens is a safety threshold to ensure the splitter can always fit at least a minimal table structure in a chunk. |
Functions¶
__post_init__()
¶
Validate configuration after initialization.
split(md)
¶
Split a markdown document into token-bounded chunks while respecting tables.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md | str | str Markdown document. | required |
Returns:
Type | Description |
---|---|
list[str] | list[str]: Chunks whose token counts are <= token_limit. |
Sentence Splitter¶
The semantic splitter avoids splitting within sentences and to achieve this it relies on a sentence splitter. The sentence splitter takes longer text as input and splits the text into individual sentences. There are different implementations available.
RegexSentenceSplitter
¶
Bases: SentenceSplitter
A sentence splitter based on regular expressions.
NOTE: Using the regex splitter is not recommended since it based on very simple heuristics.
Heuristics: - Split after sentence-ending punctuation (. ! ? …) and any closing quotes/brackets. - Only split if the next non-space token looks like a sentence start (capital letter or digit, optionally after an opening quote/paren). - Merge back false positives caused by common abbreviations, initials, dotted acronyms (e.g., U.S.), decimals (e.g., 3.14), ordinals (No. 5), and ellipses.
Notes: - Tweak self.abbreviations
for your domain/corpus. - For chatty/poetic text where sentences may start lowercase, relax self._split_re
's lookahead (see comment in init).
SpacySentenceSplitter
¶
SaTSentenceSplitter
¶
Bases: SentenceSplitter
Adapter for wtpsplit's SaT sentence splitter.
SaT (Segment any Text) is a state-of-the-art sentence splitter. Depending on the selected model you may want to use a GPU for faster inference.
Available models and benchmark results: https://github.com/segment-any-text/wtpsplit
Example usage: