Docling
docling_step
¶
Note: Known Limitations with EasyOCR (EasyOcrOptions
).
- Table structure is often lost or misaligned in the OCR output.
- Spelling inaccuracies are occasionally observed (e.g., "Verlängern" → "Verlängenng").
- URLs are not parsed correctly (e.g., "www.telekom.de/agb" → "www telekom delagb").
While investigating EasyOCR issues and testing alternative OCR engines, we observed that some documents produced distorted text with irregular whitespace. This disrupts the natural sentence flow and significantly reduces readability.
Example: "pra kti sche i nform ati o nen zu i h rer fam i l y card basi c Li eber Tel ekom Kunde, schön, dass Si e si ch f ür..."
Despite these limitations, we have decided to proceed with EasyOCR.
Classes¶
CleanMarkdownRenderer
¶
Bases: HTMLRenderer
Custom Markdown renderer extending mistletoe's HTMLRenderer to clean up unwanted elements from Markdown input.
Source code in wurzel/steps/docling/docling_step.py
Functions¶
render_html_block(token)
staticmethod
¶
Render HTML block tokens and clean up unwanted elements.
This method removes HTML comments and returns the cleaned HTML content. Remove comments like
Source code in wurzel/steps/docling/docling_step.py
DoclingStep
¶
Bases: TypedStep[DoclingSettings, None, list[MarkdownDataContract]]
Step to return local Markdown files with enhanced PDF extraction for German.
Source code in wurzel/steps/docling/docling_step.py
Functions¶
create_converter()
¶
Create and configure the document converter for PDF and DOCX.
Returns:
Name | Type | Description |
---|---|---|
DocumentConverter | DocumentConverter | Configured document converter. |
Source code in wurzel/steps/docling/docling_step.py
extract_keywords(md_text)
staticmethod
¶
Cleans a Markdown string using mistletoe and extracts useful content.
- Parses and renders the Markdown content into HTML using a custom HTML renderer
- Removes unwanted HTML comments and escaped underscores
- Extracts the first heading from the content (e.g.,
<h1>
to<h6>
) - Converts the cleaned HTML into plain text
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md_text | str | The raw Markdown input string. | required |
Source code in wurzel/steps/docling/docling_step.py
run(inpt)
¶
Run the document extraction and conversion process for German PDFs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inpt | None | Input parameter (not used). | required |
Returns:
Type | Description |
---|---|
list[MarkdownDataContract] | List[MarkdownDataContract]: List of converted Markdown contracts. |
Source code in wurzel/steps/docling/docling_step.py
settings
¶
Specific docling settings.
Classes¶
DoclingSettings
¶
Bases: Settings
DoclingSettings is a configuration class that inherits from the base Settings
class. It provides customizable settings for document processing.
Attributes:
Name | Type | Description |
---|---|---|
FORCE_FULL_PAGE_OCR | bool | A flag to enforce full-page OCR processing. Defaults to True. |
FORMATS | list[InputFormat] | A list of supported input formats for document processing. Supported formats include: - "docx" - "asciidoc" - "pptx" - "html" - "image" - "pdf" - "md" - "csv" - "xlsx" - "xml_uspto" - "xml_jats" - "json_docling" |
URLS | list[str] | A list of URLs for additional configuration or resources. Defaults to an empty list. |