Skip to content

Docling

docling_step

Note: Known Limitations with EasyOCR (EasyOcrOptions).

  1. Table structure is often lost or misaligned in the OCR output.
  2. Spelling inaccuracies are occasionally observed (e.g., "Verlängern" → "Verlängenng").
  3. URLs are not parsed correctly (e.g., "www.telekom.de/agb" → "www telekom delagb").

While investigating EasyOCR issues and testing alternative OCR engines, we observed that some documents produced distorted text with irregular whitespace. This disrupts the natural sentence flow and significantly reduces readability.

Example: "pra kti sche i nform ati o nen zu i h rer fam i l y card basi c Li eber Tel ekom Kunde, schön, dass Si e si ch f ür..."

Despite these limitations, we have decided to proceed with EasyOCR.

Classes

CleanMarkdownRenderer

Bases: HTMLRenderer

Custom Markdown renderer extending mistletoe's HTMLRenderer to clean up unwanted elements from Markdown input.

Source code in wurzel/steps/docling/docling_step.py
class CleanMarkdownRenderer(HTMLRenderer):
    """Custom Markdown renderer extending mistletoe's HTMLRenderer to clean up
    unwanted elements from Markdown input.
    """

    @staticmethod
    def render_html_block(token):
        """Render HTML block tokens and clean up unwanted elements.

        This method removes HTML comments and returns the cleaned HTML content.
        Remove comments like <!-- image -->
        """
        soup = BeautifulSoup(token.content, "html.parser")

        for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
            comment.extract()
        return soup.decode_contents().strip()
Functions
render_html_block(token) staticmethod

Render HTML block tokens and clean up unwanted elements.

This method removes HTML comments and returns the cleaned HTML content. Remove comments like

Source code in wurzel/steps/docling/docling_step.py
@staticmethod
def render_html_block(token):
    """Render HTML block tokens and clean up unwanted elements.

    This method removes HTML comments and returns the cleaned HTML content.
    Remove comments like <!-- image -->
    """
    soup = BeautifulSoup(token.content, "html.parser")

    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()
    return soup.decode_contents().strip()

DoclingStep

Bases: TypedStep[DoclingSettings, None, list[MarkdownDataContract]]

Step to return local Markdown files with enhanced PDF extraction for German.

Source code in wurzel/steps/docling/docling_step.py
class DoclingStep(TypedStep[DoclingSettings, None, list[MarkdownDataContract]]):
    """Step to return local Markdown files with enhanced PDF extraction for German."""

    def __init__(self):
        super().__init__()
        self.converter = self.create_converter()

    def create_converter(self) -> DocumentConverter:
        """Create and configure the document converter for PDF and DOCX.

        Returns:
            DocumentConverter: Configured document converter.

        """
        pipeline_options = PdfPipelineOptions()
        ocr_options = EasyOcrOptions()
        pipeline_options.ocr_options = ocr_options

        return DocumentConverter(
            allowed_formats=self.settings.FORMATS,
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_options=pipeline_options,
                )
            },
        )

    @staticmethod
    def extract_keywords(md_text: str) -> str:
        """Cleans a Markdown string using mistletoe and extracts useful content.

        - Parses and renders the Markdown content into HTML using a custom HTML renderer
        - Removes unwanted HTML comments and escaped underscores
        - Extracts the first heading from the content (e.g., `<h1>` to `<h6>`)
        - Converts the cleaned HTML into plain text

        Args:
            md_text (str): The raw Markdown input string.

        """
        with MD_RENDER_LOCK, CleanMarkdownRenderer() as renderer:
            ast = MTDocument(md_text)
            cleaned = renderer.render(ast).replace("\n", "")
            soup = BeautifulSoup(cleaned, "html.parser")
            first_heading_tag = soup.find(["h1", "h2", "h3", "h4", "h5", "h6"])
            heading = first_heading_tag.get_text(strip=True) if first_heading_tag else ""

        return heading

    def run(self, inpt: None) -> list[MarkdownDataContract]:
        """Run the document extraction and conversion process for German PDFs.

        Args:
            inpt (None): Input parameter (not used).

        Returns:
            List[MarkdownDataContract]: List of converted Markdown contracts.

        """
        urls = self.settings.URLS
        contracts = []

        for url in urls:
            try:
                converted_contract = self.converter.convert(url)
                md = converted_contract.document.export_to_markdown(image_placeholder="")
                keyword = self.extract_keywords(md)
                contract_instance = {"md": md, "keywords": " ".join([self.settings.DEFAULT_KEYWORD, keyword]), "url": url}
                contracts.append(contract_instance)

            except (FileNotFoundError, OSError) as e:
                log.warning(f"Failed to verify URL: {url}. Error: {e}")
                continue

        return contracts
Functions
create_converter()

Create and configure the document converter for PDF and DOCX.

Returns:

Name Type Description
DocumentConverter DocumentConverter

Configured document converter.

Source code in wurzel/steps/docling/docling_step.py
def create_converter(self) -> DocumentConverter:
    """Create and configure the document converter for PDF and DOCX.

    Returns:
        DocumentConverter: Configured document converter.

    """
    pipeline_options = PdfPipelineOptions()
    ocr_options = EasyOcrOptions()
    pipeline_options.ocr_options = ocr_options

    return DocumentConverter(
        allowed_formats=self.settings.FORMATS,
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        },
    )
extract_keywords(md_text) staticmethod

Cleans a Markdown string using mistletoe and extracts useful content.

  • Parses and renders the Markdown content into HTML using a custom HTML renderer
  • Removes unwanted HTML comments and escaped underscores
  • Extracts the first heading from the content (e.g., <h1> to <h6>)
  • Converts the cleaned HTML into plain text

Parameters:

Name Type Description Default
md_text str

The raw Markdown input string.

required
Source code in wurzel/steps/docling/docling_step.py
@staticmethod
def extract_keywords(md_text: str) -> str:
    """Cleans a Markdown string using mistletoe and extracts useful content.

    - Parses and renders the Markdown content into HTML using a custom HTML renderer
    - Removes unwanted HTML comments and escaped underscores
    - Extracts the first heading from the content (e.g., `<h1>` to `<h6>`)
    - Converts the cleaned HTML into plain text

    Args:
        md_text (str): The raw Markdown input string.

    """
    with MD_RENDER_LOCK, CleanMarkdownRenderer() as renderer:
        ast = MTDocument(md_text)
        cleaned = renderer.render(ast).replace("\n", "")
        soup = BeautifulSoup(cleaned, "html.parser")
        first_heading_tag = soup.find(["h1", "h2", "h3", "h4", "h5", "h6"])
        heading = first_heading_tag.get_text(strip=True) if first_heading_tag else ""

    return heading
run(inpt)

Run the document extraction and conversion process for German PDFs.

Parameters:

Name Type Description Default
inpt None

Input parameter (not used).

required

Returns:

Type Description
list[MarkdownDataContract]

List[MarkdownDataContract]: List of converted Markdown contracts.

Source code in wurzel/steps/docling/docling_step.py
def run(self, inpt: None) -> list[MarkdownDataContract]:
    """Run the document extraction and conversion process for German PDFs.

    Args:
        inpt (None): Input parameter (not used).

    Returns:
        List[MarkdownDataContract]: List of converted Markdown contracts.

    """
    urls = self.settings.URLS
    contracts = []

    for url in urls:
        try:
            converted_contract = self.converter.convert(url)
            md = converted_contract.document.export_to_markdown(image_placeholder="")
            keyword = self.extract_keywords(md)
            contract_instance = {"md": md, "keywords": " ".join([self.settings.DEFAULT_KEYWORD, keyword]), "url": url}
            contracts.append(contract_instance)

        except (FileNotFoundError, OSError) as e:
            log.warning(f"Failed to verify URL: {url}. Error: {e}")
            continue

    return contracts

settings

Specific docling settings.

Classes

DoclingSettings

Bases: Settings

DoclingSettings is a configuration class that inherits from the base Settings class. It provides customizable settings for document processing.

Attributes:

Name Type Description
FORCE_FULL_PAGE_OCR bool

A flag to enforce full-page OCR processing. Defaults to True.

FORMATS list[InputFormat]

A list of supported input formats for document processing. Supported formats include: - "docx" - "asciidoc" - "pptx" - "html" - "image" - "pdf" - "md" - "csv" - "xlsx" - "xml_uspto" - "xml_jats" - "json_docling"

URLS list[str]

A list of URLs for additional configuration or resources. Defaults to an empty list.

Source code in wurzel/steps/docling/settings.py
class DoclingSettings(Settings):
    """DoclingSettings is a configuration class that inherits from the base `Settings` class.
    It provides customizable settings for document processing.

    Attributes:
        FORCE_FULL_PAGE_OCR (bool): A flag to enforce full-page OCR processing. Defaults to True.
        FORMATS (list[InputFormat]): A list of supported input formats for document processing.
            Supported formats include:
            - "docx"
            - "asciidoc"
            - "pptx"
            - "html"
            - "image"
            - "pdf"
            - "md"
            - "csv"
            - "xlsx"
            - "xml_uspto"
            - "xml_jats"
            - "json_docling"
        URLS (list[str]): A list of URLs for additional configuration or resources. Defaults to an empty list.

    """

    FORCE_FULL_PAGE_OCR: bool = True
    FORMATS: list[InputFormat] = [
        "docx",
        "asciidoc",
        "pptx",
        "html",
        "image",
        "pdf",
        "md",
        "csv",
        "xlsx",
        "xml_uspto",
        "xml_jats",
        "json_docling",
    ]
    URLS: list[str] = []
    DEFAULT_KEYWORD: str = ""