Docling - Wurzel

Note: Known Limitations with EasyOCR (EasyOcrOptions).

Table structure is often lost or misaligned in the OCR output.
Spelling inaccuracies are occasionally observed (e.g., "Verlängern" → "Verlängenng").
URLs are not parsed correctly (e.g., "www.telekom.de/agb" → "www telekom delagb").

While investigating EasyOCR issues and testing alternative OCR engines, we observed that some documents produced distorted text with irregular whitespace. This disrupts the natural sentence flow and significantly reduces readability.

Example: "pra kti sche i nform ati o nen zu i h rer fam i l y card basi c Li eber Tel ekom Kunde, schön, dass Si e si ch f ür..."

Despite these limitations, we have decided to proceed with EasyOCR.

`CleanMarkdownRenderer`

Bases: HTMLRenderer

Custom Markdown renderer extending mistletoe's HTMLRenderer to clean up unwanted elements from Markdown input.

Source code in wurzel/steps/docling/docling_step.py

class CleanMarkdownRenderer(HTMLRenderer):
    """Custom Markdown renderer extending mistletoe's HTMLRenderer to clean up
    unwanted elements from Markdown input.
    """

    @staticmethod
    def render_html_block(token):
        """Render HTML block tokens and clean up unwanted elements.

        This method removes HTML comments and returns the cleaned HTML content.
        Remove comments like <!-- image -->
        """
        soup = BeautifulSoup(token.content, "html.parser")

        for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
            comment.extract()
        return soup.decode_contents().strip()

`render_html_block(token)` `staticmethod`

Render HTML block tokens and clean up unwanted elements.

This method removes HTML comments and returns the cleaned HTML content. Remove comments like

Source code in wurzel/steps/docling/docling_step.py

@staticmethod
def render_html_block(token):
    """Render HTML block tokens and clean up unwanted elements.

    This method removes HTML comments and returns the cleaned HTML content.
    Remove comments like <!-- image -->
    """
    soup = BeautifulSoup(token.content, "html.parser")

    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()
    return soup.decode_contents().strip()

`DoclingStep`

Bases: TypedStep[DoclingSettings, None, list[MarkdownDataContract]]

Step to return local Markdown files with enhanced PDF extraction for German.

Source code in wurzel/steps/docling/docling_step.py

class DoclingStep(TypedStep[DoclingSettings, None, list[MarkdownDataContract]]):
    """Step to return local Markdown files with enhanced PDF extraction for German."""

    def __init__(self):
        super().__init__()
        self.converter = self.create_converter()

    def create_converter(self) -> DocumentConverter:
        """Create and configure the document converter for PDF and DOCX.

        Returns:
            DocumentConverter: Configured document converter.

        """
        pipeline_options = PdfPipelineOptions()
        ocr_options = EasyOcrOptions()
        pipeline_options.ocr_options = ocr_options

        return DocumentConverter(
            allowed_formats=self.settings.FORMATS,
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_options=pipeline_options,
                )
            },
        )

    @staticmethod
    def extract_keywords(md_text: str) -> str:
        """Cleans a Markdown string using mistletoe and extracts useful content.

        - Parses and renders the Markdown content into HTML using a custom HTML renderer
        - Removes unwanted HTML comments and escaped underscores
        - Extracts the first heading from the content (e.g., `<h1>` to `<h6>`)
        - Converts the cleaned HTML into plain text

        Args:
            md_text (str): The raw Markdown input string.

        """
        with MD_RENDER_LOCK, CleanMarkdownRenderer() as renderer:
            ast = MTDocument(md_text)
            cleaned = renderer.render(ast).replace("\n", "")
            soup = BeautifulSoup(cleaned, "html.parser")
            first_heading_tag = soup.find(["h1", "h2", "h3", "h4", "h5", "h6"])
            heading = first_heading_tag.get_text(strip=True) if first_heading_tag else ""

        return heading

    def run(self, inpt: None) -> list[MarkdownDataContract]:
        """Run the document extraction and conversion process for German PDFs.

        Args:
            inpt (None): Input parameter (not used).

        Returns:
            List[MarkdownDataContract]: List of converted Markdown contracts.

        """
        urls = self.settings.URLS
        contracts = []

        for url in urls:
            try:
                converted_contract = self.converter.convert(url)
                md = converted_contract.document.export_to_markdown(image_placeholder="")
                keyword = self.extract_keywords(md)
                contract_instance = {"md": md, "keywords": " ".join([self.settings.DEFAULT_KEYWORD, keyword]), "url": url}
                contracts.append(contract_instance)

            except (FileNotFoundError, OSError) as e:
                log.error(f"Failed to verify URL: {url}. Error: {e}")
                continue

        return contracts

`create_converter()`

Create and configure the document converter for PDF and DOCX.

Returns:	`DocumentConverter`( `DocumentConverter` ) – Configured document converter.

Source code in wurzel/steps/docling/docling_step.py

def create_converter(self) -> DocumentConverter:
    """Create and configure the document converter for PDF and DOCX.

    Returns:
        DocumentConverter: Configured document converter.

    """
    pipeline_options = PdfPipelineOptions()
    ocr_options = EasyOcrOptions()
    pipeline_options.ocr_options = ocr_options

    return DocumentConverter(
        allowed_formats=self.settings.FORMATS,
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        },
    )

`extract_keywords(md_text)` `staticmethod`

Cleans a Markdown string using mistletoe and extracts useful content.

Parses and renders the Markdown content into HTML using a custom HTML renderer
Removes unwanted HTML comments and escaped underscores
Extracts the first heading from the content (e.g., <h1> to <h6>)
Converts the cleaned HTML into plain text

Parameters:	`md_text` (`str`) – The raw Markdown input string.

Source code in wurzel/steps/docling/docling_step.py

@staticmethod
def extract_keywords(md_text: str) -> str:
    """Cleans a Markdown string using mistletoe and extracts useful content.

    - Parses and renders the Markdown content into HTML using a custom HTML renderer
    - Removes unwanted HTML comments and escaped underscores
    - Extracts the first heading from the content (e.g., `<h1>` to `<h6>`)
    - Converts the cleaned HTML into plain text

    Args:
        md_text (str): The raw Markdown input string.

    """
    with MD_RENDER_LOCK, CleanMarkdownRenderer() as renderer:
        ast = MTDocument(md_text)
        cleaned = renderer.render(ast).replace("\n", "")
        soup = BeautifulSoup(cleaned, "html.parser")
        first_heading_tag = soup.find(["h1", "h2", "h3", "h4", "h5", "h6"])
        heading = first_heading_tag.get_text(strip=True) if first_heading_tag else ""

    return heading

`run(inpt)`

Run the document extraction and conversion process for German PDFs.

Parameters:	`inpt` (`None`) – Input parameter (not used).

Returns:	`list[MarkdownDataContract]` – List[MarkdownDataContract]: List of converted Markdown contracts.

Source code in wurzel/steps/docling/docling_step.py

def run(self, inpt: None) -> list[MarkdownDataContract]:
    """Run the document extraction and conversion process for German PDFs.

    Args:
        inpt (None): Input parameter (not used).

    Returns:
        List[MarkdownDataContract]: List of converted Markdown contracts.

    """
    urls = self.settings.URLS
    contracts = []

    for url in urls:
        try:
            converted_contract = self.converter.convert(url)
            md = converted_contract.document.export_to_markdown(image_placeholder="")
            keyword = self.extract_keywords(md)
            contract_instance = {"md": md, "keywords": " ".join([self.settings.DEFAULT_KEYWORD, keyword]), "url": url}
            contracts.append(contract_instance)

        except (FileNotFoundError, OSError) as e:
            log.error(f"Failed to verify URL: {url}. Error: {e}")
            continue

    return contracts

Specific docling settings.

`DoclingSettings`

Bases: Settings

DoclingSettings is a configuration class that inherits from the base Settings class. It provides customizable settings for document processing.

Attributes:

FORCE_FULL_PAGE_OCR (bool) –

A flag to enforce full-page OCR processing. Defaults to True.
FORMATS (list[InputFormat]) –

A list of supported input formats for document processing. Supported formats include: - "docx" - "asciidoc" - "pptx" - "html" - "image" - "pdf" - "md" - "csv" - "xlsx" - "xml_uspto" - "xml_jats" - "json_docling"
URLS (list[str]) –

A list of URLs for additional configuration or resources. Defaults to an empty list.

Source code in wurzel/steps/docling/settings.py

class DoclingSettings(Settings):
    """DoclingSettings is a configuration class that inherits from the base `Settings` class.
    It provides customizable settings for document processing.

    Attributes:
        FORCE_FULL_PAGE_OCR (bool): A flag to enforce full-page OCR processing. Defaults to True.
        FORMATS (list[InputFormat]): A list of supported input formats for document processing.
            Supported formats include:
            - "docx"
            - "asciidoc"
            - "pptx"
            - "html"
            - "image"
            - "pdf"
            - "md"
            - "csv"
            - "xlsx"
            - "xml_uspto"
            - "xml_jats"
            - "json_docling"
        URLS (list[str]): A list of URLs for additional configuration or resources. Defaults to an empty list.

    """

    FORCE_FULL_PAGE_OCR: bool = True
    FORMATS: list[InputFormat] = [
        "docx",
        "asciidoc",
        "pptx",
        "html",
        "image",
        "pdf",
        "md",
        "csv",
        "xlsx",
        "xml_uspto",
        "xml_jats",
        "json_docling",
    ]
    URLS: list[str] = []
    DEFAULT_KEYWORD: str = ""

telekom/wurzel « Previous Next »