Splitter
The splitter step (also known as chunking) takes a long Markdown document (*.md
) as the input and returns smaller splits (or chunks) that can easier processed by an embedding model or language model.
The splitter keeps the length of the output chunks below a defined threshold (token limit) and tries to split without breaking the document context, e.g., split only at the end of a sentence and not within a sentence.
Semantic Splitter
Semantic document elements (e.g., headings) are repeated.
Splitter implementation.
__init__(token_limit=256, token_limit_buffer=32, token_limit_min=64, sentence_splitter_model='de_core_news_sm', repeat_table_header_row=True, tokenizer_model='gpt-3.5-turbo')
Initializes the SemanticSplitter class with specified token limits and a sentence splitter model.
Parameters: |
|
---|
Raises: |
|
---|
split_markdown_document(doc)
Split a Markdown Document into Snippets.
text_sentences(text)
Split a text into sentences using a sentence splitter model.
This does not use a Regex based approach on purpose as they break with punctuation very easily see: https://stackoverflow.com/a/61254146
Table Splitter
For Markdown tables, a custom logic is implemented that preserves the table structure by repeating the header row if a split occurs within a table. So subsequent chunks maintain the semantic table information from the header row. By default, tables are never broken in the middle of a row; if a single row exceeds the budget, it is split at column boundaries instead and full-header is repeated.
A class to split markdown tables into token-bounded chunks.
This class encapsulates the logic for splitting large markdown tables while preserving table structure. Tables are never broken in the middle of a row; if a single row exceeds the max length, it is split at column boundaries instead and the full header is repeated.
Example:
>>> import tiktoken
>>> enc = tiktoken.get_encoding("cl100k_base")
>>> splitter = MarkdownTableSplitterUtil(token_limit=8000, enc=enc)
>>> chunks = splitter.split(markdown_text)
>>> len(chunks)
3
Parameters: |
|
---|
Attributes: |
|
---|
__post_init__()
Validate configuration after initialization.
split(md)
Split a markdown document into token-bounded chunks while respecting tables.
Parameters: |
|
---|
Returns: |
|
---|
Sentence Splitter
The semantic splitter avoids splitting within sentences and to achieve this it relies on a sentence splitter. The sentence splitter takes longer text as input and splits the text into individual sentences. There are different implementations available.
Bases: SentenceSplitter
A sentence splitter based on regular expressions.
NOTE: Using the regex splitter is not recommended since it based on very simple heuristics.
Heuristics: - Split after sentence-ending punctuation (. ! ? …) and any closing quotes/brackets. - Only split if the next non-space token looks like a sentence start (capital letter or digit, optionally after an opening quote/paren). - Merge back false positives caused by common abbreviations, initials, dotted acronyms (e.g., U.S.), decimals (e.g., 3.14), ordinals (No. 5), and ellipses.
Notes:
- Tweak self.abbreviations
for your domain/corpus.
- For chatty/poetic text where sentences may start lowercase, relax
self._split_re
's lookahead (see comment in init).
__init__()
Initialize a regex sentence splitter (compile regex, set abbreviations).
get_sentences(text)
Split text into sentences.