Add to Cursor Add to Claude Copy for LLM View as MD

Document Processing

PDF and file pipelines are central to AI workflows: parse documents, extract content with OCR or LLMs, validate and enrich the data, and classify or route it. We’ll structure these as Hatchet DAG workflows.

Because the stages are fixed (ingest → parse → extract), document pipelines map naturally to DAG workflows. You can add child spawning within the ingest or parse stage to process multiple documents in parallel.

Step-by-step walkthrough

You’ll build a three-stage pipeline (ingest, parse, extract) using mocks so you can run it locally without API keys.

Define the DAG

Create a workflow with a fixed pipeline: ingest, parse, extract. Each stage depends on the previous.

examples/python/guides/document_processing/worker.py

class DocInput(BaseModel):
    doc_id: str
    content: bytes = b""


doc_wf = hatchet.workflow(name="DocumentPipeline", input_validator=DocInput)


@doc_wf.task()
async def ingest(input: DocInput, ctx: Context) -> dict[str, Any]:
    return {"doc_id": input.doc_id, "content": input.content}

examples/typescript/guides/document-processing/workflow.ts

const docWf = hatchet.workflow<DocInput>({ name: 'DocumentPipeline' });

const ingest = docWf.task({
  name: 'ingest',
  fn: async (input) => ({ doc_id: input.doc_id, content: input.content }),
});

examples/go/guides/document-processing/main.go

workflow := client.NewWorkflow("DocumentPipeline")

The Ruby SDK is in early access, and may change. We'd love your feedback!

examples/ruby/guides/document_processing/worker.rb

DOC_WF = HATCHET.workflow(name: 'DocumentPipeline')

INGEST = DOC_WF.task(:ingest) do |input, _ctx|
  { 'doc_id' => input['doc_id'], 'content' => input['content'] }
end

Parse stage

The parse task depends on ingest (Step 1). It converts raw content to structured text (use OCR for images, mock for examples).

examples/python/guides/document_processing/worker.py

@doc_wf.task(parents=[ingest])
async def parse(input: DocInput, ctx: Context) -> dict[str, Any]:
    ingested = ctx.task_output(ingest)
    text = get_ocr_service().parse(ingested["content"])
    return {"doc_id": input.doc_id, "text": text}

The examples above use a mock OCR service. To use a real provider, swap in one of these. Pick a provider, then your language:

Tesseract is an open-source OCR engine maintained by Google. It supports 100+ languages and runs entirely on your own infrastructure with no API key or cloud service required. A solid default for straightforward text extraction from images and scanned documents.

pip install pytesseract

poetry add pytesseract

uv add pytesseract

examples/python/guides/integrations/ocr_tesseract.py

def parse_document(content: bytes) -> str:
    img = Image.open(io.BytesIO(content))
    return pytesseract.image_to_string(img)

Unstructured is a Python library for parsing and chunking complex documents (PDFs, Word files, HTML, and more). It combines layout detection with OCR to extract structured elements like titles, tables, and narrative text, ready for downstream LLM or RAG pipelines.

pip install "unstructured[pdf]"

poetry add "unstructured[pdf]"

uv add "unstructured[pdf]"

examples/python/guides/integrations/ocr_unstructured.py

def parse_document(content: bytes) -> str:
    elements = partition(file=io.BytesIO(content))
    return "\n\n".join(str(el) for el in elements)

Reducto is a cloud API purpose-built for high-fidelity document extraction. It handles complex layouts, tables, and forms with strong accuracy out of the box, no OCR tuning needed. Particularly useful for invoices, contracts, and structured financial documents.

pip install reductoai

poetry add reductoai

uv add reductoai

examples/python/guides/integrations/ocr_reducto.py

def parse_document(content: bytes) -> str:
    upload = client.upload.upload(file=content, extension=".pdf")
    result = client.parse.parse(input=upload.url)
    return str(result)  # or access result.blocks, result.tables, etc.

Google Cloud Vision provides production-grade OCR through a managed API. It supports handwriting recognition, document text detection, and PDF/TIFF batch processing with strong multilingual accuracy. Requires a Google Cloud project and service account credentials.

pip install google-cloud-vision

poetry add google-cloud-vision

uv add google-cloud-vision

examples/python/guides/integrations/ocr_google_vision.py

def parse_document(content: bytes) -> str:
    image = vision.Image(content=content)
    response = client.document_text_detection(image=image)
    return response.full_text_annotation.text if response.full_text_annotation else ""

Extract stage

The extract task depends on parse (Step 2). Use an LLM or rules to extract entities. Rate Limits keep extract tasks within provider quotas.

examples/python/guides/document_processing/worker.py

@doc_wf.task(parents=[parse])
async def extract(input: DocInput, ctx: Context) -> dict[str, Any]:
    parsed = ctx.task_output(parse)
    entities = get_extract_service().extract(parsed["text"])
    return {"doc_id": parsed["doc_id"], "entities": entities}

The examples above use a mock extractor. To wire in a real LLM for extraction, swap get_extract_service() with a provider:

OpenAI’s Chat Completions API provides access to GPT models for text generation, function calling, and structured outputs. It’s the most widely adopted LLM API and supports streaming, tool use, and JSON mode.

pip install openai

poetry add openai

uv add openai

examples/python/guides/integrations/llm_openai.py

def complete(messages: list[dict]) -> dict:
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tool_choice="auto",
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "description": "Get weather for a location",
                    "parameters": {
                        "type": "object",
                        "properties": {"location": {"type": "string"}},
                        "required": ["location"],
                    },
                },
            }
        ],
    )
    msg = r.choices[0].message
    tool_calls = [
        {"name": tc.function.name, "args": json.loads(tc.function.arguments or "{}")}
        for tc in (msg.tool_calls or [])
    ]
    return {"content": msg.content or "", "tool_calls": tool_calls, "done": not tool_calls}

Anthropic’s Messages API powers the Claude family of models, including Claude Sonnet and Claude Haiku. Claude excels at long-context reasoning, careful instruction following, and tool use with extended thinking support.

pip install anthropic

poetry add anthropic

uv add anthropic

examples/python/guides/integrations/llm_anthropic.py

def complete(messages: list[dict]) -> dict:
    resp = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=1024,
        messages=[{"role": m["role"], "content": m["content"]} for m in messages],
        tools=[{"name": "get_weather", "description": "Get weather", "input_schema": {"type": "object", "properties": {"location": {"type": "string"}}}}],
    )
    for block in resp.content:
        if block.type == "tool_use":
            return {"content": "", "tool_calls": [{"name": block.name, "args": block.input}], "done": False}
    text = "".join(b.text for b in resp.content if hasattr(b, "text"))
    return {"content": text, "tool_calls": [], "done": True}

Groq provides ultra-fast inference for open-source models like Llama and Mixtral using custom LPU hardware. Its OpenAI-compatible API makes it a drop-in replacement when you need low latency.

pip install groq

poetry add groq

uv add groq

examples/python/guides/integrations/llm_groq.py

def complete(messages: list[dict]) -> dict:
    r = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages,
        tool_choice="auto",
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "description": "Get weather for a location",
                    "parameters": {
                        "type": "object",
                        "properties": {"location": {"type": "string"}},
                        "required": ["location"],
                    },
                },
            }
        ],
    )
    msg = r.choices[0].message
    tool_calls = [
        {"name": tc.function.name, "args": json.loads(tc.function.arguments or "{}")}
        for tc in (msg.tool_calls or [])
    ]
    return {"content": msg.content or "", "tool_calls": tool_calls, "done": not tool_calls}

The Vercel AI SDK is a TypeScript toolkit that provides a unified interface across providers (OpenAI, Anthropic, Google, and more). It includes helpers for streaming, tool calls, and structured object generation via generateText and streamText.

Vercel AI SDK is JavaScript/TypeScript only. Use OpenAI, Anthropic, or Groq SDK directly.

Ollama runs open-source models locally with no API key required. It supports Llama, Mistral, Gemma, and others through a simple REST API on localhost:11434. Ideal for development, air-gapped environments, or when you want full control over your model.

pip install ollama

poetry add ollama

uv add ollama

examples/python/guides/integrations/llm_ollama.py

def complete(messages: list[dict]) -> dict:
    resp = ollama.chat(model="llama2", messages=messages)
    content = resp.get("message", {}).get("content", "")
    tool_calls = resp.get("message", {}).get("tool_calls") or []
    return {"content": content, "tool_calls": tool_calls, "done": not tool_calls}

Run the worker

Start the worker. For per-document parallelism, use Child Spawning within the ingest stage.

examples/python/guides/document_processing/worker.py

worker = hatchet.worker(
    "document-worker",
    workflows=[doc_wf],
)
worker.start()

⚠️

When fanning out to many documents, ensure your workers have enough slots or use Concurrency Control to limit how many run simultaneously.

Common Patterns

Pattern	Description
Invoice extraction	Parse invoices, extract line items and totals with LLM, validate amounts, post to ERP
Contract analysis	Extract clauses and terms, classify risk, route for legal review
Resume parsing	Parse resumes, extract skills and experience, match to job requisitions
Form processing	Extract form fields from scans, validate against schemas, submit to backend systems
Document classification	Classify documents by type, route to appropriate DAG workflows