We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.
Learn more.

CookbooksDocument Processing

Document Processing

PDF and file pipelines are central to AI workflows: parse documents, extract content with OCR or LLMs, validate and enrich the data, and classify or route it. We’ll structure these as Hatchet DAG workflows.

Because the stages are fixed (ingest → parse → extract), document pipelines map naturally to DAG workflows. You can add child spawning within the ingest or parse stage to process multiple documents in parallel.

Step-by-step walkthrough

You’ll build a three-stage pipeline (ingest, parse, extract) using mocks so you can run it locally without API keys.

Define the DAG

Create a workflow with a fixed pipeline: ingest, parse, extract. Each stage depends on the previous.

Parse stage

The parse task depends on ingest (Step 1). It converts raw content to structured text (use OCR for images, mock for examples).

The examples above use a mock OCR service. To use a real provider, swap in one of these. Pick a provider, then your language:

Tesseract is an open-source OCR engine maintained by Google. It supports 100+ languages and runs entirely on your own infrastructure with no API key or cloud service required. A solid default for straightforward text extraction from images and scanned documents.

Extract stage

The extract task depends on parse (Step 2). Use an LLM or rules to extract entities. Rate Limits keep extract tasks within provider quotas.

The examples above use a mock extractor. To wire in a real LLM for extraction, swap get_extract_service() with a provider:

OpenAI’s Chat Completions API provides access to GPT models for text generation, function calling, and structured outputs. It’s the most widely adopted LLM API and supports streaming, tool use, and JSON mode.

Run the worker

Start the worker. For per-document parallelism, use Child Spawning within the ingest stage.

⚠️

When fanning out to many documents, ensure your workers have enough slots or use Concurrency Control to limit how many run simultaneously.

Common Patterns

PatternDescription
Invoice extractionParse invoices, extract line items and totals with LLM, validate amounts, post to ERP
Contract analysisExtract clauses and terms, classify risk, route for legal review
Resume parsingParse resumes, extract skills and experience, match to job requisitions
Form processingExtract form fields from scans, validate against schemas, submit to backend systems
Document classificationClassify documents by type, route to appropriate DAG workflows

Next Steps