February 8, 2025

The Modern Document Processing Stack

When building LLM-powered software, one of the biggest challenges (pains) is dealing with user-provided data.

Data formatting and cleaning requirements vary dramatically from one use case to another, and you’ll almost certainly need something bespoke if you’re building anything complex. The problem is that these requirements usually only reveal themselves after you’ve already done some building and hit the limitations of out-of-the-box approaches. It’s a chicken-and-egg type of situation.

In this guide, I’ll describe how I’d tackle this problem if I wanted to get something working quickly that can handle many downstream tasks. The approach is simple: convert common end-user formats—PDF, Word documents, Excel sheets, etc—to Markdown, a format friendly to LLMs and embedding models.

Requirements

This is what I think is necessary for a generic yet complete document-to-Markdown processing engine that I’d want to use in production:

Minimal custom code: each file format has a lot of quirks and I don’t want to spend a lot of time maintaining this before I have more precise requirements.
Supports “every” file format (within reason) -- sorry .gif, you’re on your own.
Can deal with highly visual documents using VLLMs (some corporate flow-charts are like abstract art)
Can scrape websites & clean HTML
Tags the document with some useful metadata I’ll probably need later on and don't want to re-compute at run-time (I opted for language and token count)
Works over an HTTP server: I may want to use a single processing server for multiple projects.

The Libraries

Requirement 1 means that we’ll rely heavily on various libraries to bring everything together. I aimed to keep it open-source whenever possible:

Docling: the powerhouse that handles most non-AI processing (maintained by IBM)
Zerox: for VLLM-based document processing
Jina AI Reader: for scraping websites and cleaning HTML
langdetect: for guessing the language of a document.
FastAPI: for running an HTTP server.

Implementation

Bringing everything together was relatively straightforward. I added some document validation logic—mimetype checks and file upload size verification—and wrapped everything in two endpoints:

/process/document: for PDFs, Word Documents, etc.

/process/url: for webpages.

The /process/document endpoint includes a use_llm parameter, allowing you to choose when to use GPT-4o with Zerox for visual documents. I’d use this sparingly, as it’s expensive and could lead to hallucinations.

Each endpoint also returns the language of the document and the total number of tokens calculated by two popular tokenizers: cl100k_base and o200k_base.

Deployment

To make this easy to deploy anywhere I Dockerized it. I decided to use uv to manage Python dependencies both for both development and production as it’s super fast and means you don’t have to maintain a separate requirements.txt on top of a pyproject.toml

I then deployed this on Railway - an easy to use infrastructure platform. Simply fork the repo and follow the UI from here to deploy from Github.

Usage

The idea is that once this is deployed, you can mostly forget about it. It’s really simple to use in any other project. I suggest these simple wrapper functions for Python and these for Typescript.

1import os
2from pathlib import Path
3
4import requests
5from pydantic import BaseModel
6
7BASE_URL = "https://your-api-url.com"
8API_KEY = os.getenv("API_KEY", "")
9
10
11class TokenCount(BaseModel):
12    o200k_base: int
13    cl100k_base: int
14
15
16class ProcessDocumentResponse(BaseModel):
17    markdown: str
18    language: str
19    mimetype: str
20    token_count: TokenCount
21
22
23def process_document(file_path: Path, use_llm: bool = False) -> ProcessDocumentResponse:
24    if not file_path.exists():
25        raise FileNotFoundError(f"File not found: {file_path}")
26
27    with open(file_path, "rb") as file:
28        files = {"file": (file_path.name, file)}
29        data = {"use_llm": str(use_llm).lower()}
30
31        response = requests.post(
32            f"{BASE_URL}/process/document",
33            headers={"X-API-Key": API_KEY},
34            files=files,
35            data=data,
36        )
37        response.raise_for_status()
38        return ProcessDocumentResponse(**response.json())
39
40
41def process_url(url: str) -> ProcessDocumentResponse:
42    response = requests.post(
43        f"{BASE_URL}/process/url", params={"url": url}, headers={"X-API-Key": API_KEY}
44    )
45    response.raise_for_status()
46    return ProcessDocumentResponse(**response.json())

Marcel Marais

AI Engineer