Document Processing
PDF and file pipelines are central to AI workflows: parse documents, extract content with OCR or LLMs, validate and enrich the data, and classify or route it. We’ll structure these as Hatchet DAG workflows.
Because the stages are fixed (ingest → parse → extract), document pipelines map naturally to DAG workflows. You can add child spawning within the ingest or parse stage to process multiple documents in parallel.
Step-by-step walkthrough
You’ll build a three-stage pipeline (ingest, parse, extract) using mocks so you can run it locally without API keys.
Define the DAG
Create a workflow with a fixed pipeline: ingest, parse, extract. Each stage depends on the previous.
Parse stage
The parse task depends on ingest (Step 1). It converts raw content to structured text (use OCR for images, mock for examples).
The examples above use a mock OCR service. To use a real provider, swap in one of these. Pick a provider, then your language:
Tesseract is an open-source OCR engine maintained by Google. It supports 100+ languages and runs entirely on your own infrastructure with no API key or cloud service required. A solid default for straightforward text extraction from images and scanned documents.
Extract stage
The extract task depends on parse (Step 2). Use an LLM or rules to extract entities. Rate Limits keep extract tasks within provider quotas.
The examples above use a mock extractor. To wire in a real LLM for extraction, swap get_extract_service() with a provider:
OpenAI’s Chat Completions API provides access to GPT models for text generation, function calling, and structured outputs. It’s the most widely adopted LLM API and supports streaming, tool use, and JSON mode.
Run the worker
Start the worker. For per-document parallelism, use Child Spawning within the ingest stage.
When fanning out to many documents, ensure your workers have enough slots or use Concurrency Control to limit how many run simultaneously.
Common Patterns
| Pattern | Description |
|---|---|
| Invoice extraction | Parse invoices, extract line items and totals with LLM, validate amounts, post to ERP |
| Contract analysis | Extract clauses and terms, classify risk, route for legal review |
| Resume parsing | Parse resumes, extract skills and experience, match to job requisitions |
| Form processing | Extract form fields from scans, validate against schemas, submit to backend systems |
| Document classification | Classify documents by type, route to appropriate DAG workflows |
Related Patterns
The fixed-stage DAG pattern that document pipelines are built on.
DAG WorkflowsWhen your goal is retrieval (chunk, embed, index) rather than extract and transform.
RAG & Data IndexingGeneral-purpose batch patterns that apply to document workloads.
Batch ProcessingParallelize document processing across your worker fleet.
Child SpawningNext Steps
- DAG Workflows: define multi-stage pipelines with task dependencies
- Rate Limits: configure rate limiting for OCR and LLM APIs
- Child Spawning: fan out to per-document tasks
- Webhooks: trigger pipelines from file upload endpoints
- Concurrency Control: limit parallel document processing