Web Scraping
Web scraping workflows fetch content from external websites, process it, and store the results. Scraping is inherently unreliable: pages change layout, rate limits kick in, requests time out. Scrape tasks need retries, timeouts, and concurrency control. Hatchet provides all three, plus cron scheduling to refresh scraped data on a recurring cadence.
A typical web scraping pipeline has three parts:
- Scrape: fetch the page content (HTML, rendered JS, or structured API response)
- Process: extract, transform, or summarize the content (optionally with an LLM)
- Schedule: run the pipeline periodically via a cron workflow
Step-by-step walkthrough
You’ll build a scrape task with retries, a processing step, a cron workflow that refreshes your scraped data every 6 hours, and a rate-limited variant to avoid getting blocked.
Define the scrape task
Create a task that fetches a URL and returns the content. Set a timeout (pages can hang) and retries (transient failures are common). The examples below use a mock. Swap it for Firecrawl, Playwright, or any HTTP client.
The examples above use a mock scraper. To use a real scraping provider, swap the mock with one of these. Pick a provider, then your language:
Firecrawl is a managed web scraping API that returns clean markdown from any URL. It handles JavaScript rendering, anti-bot bypasses, and sitemap crawling out of the box, so you can focus on what to do with the content instead of how to extract it.
Process the scraped content
A separate task extracts or transforms the raw scraped content. This could be simple parsing, or an LLM call to summarize or extract structured data. Keeping it separate lets you retry processing independently from scraping.
Schedule recurring scrapes
Wrap the pipeline in a cron workflow to refresh data on a schedule. The example below runs every 6 hours and scrapes a list of URLs. Each scrape + process pair runs as child tasks, so failures on one URL don’t block the others.
Add rate limiting
Target sites will block you if you send too many requests. Create a separate rate-limited scrape task that caps requests to a fixed number per minute across all workers. Hatchet holds back task executions that would exceed the limit, so you stay within budget without adding sleep logic in your code. See Rate Limits for details.
Run the worker
Register all tasks (including the rate-limited variant) and upsert the rate limit before starting the worker. The cron schedule activates when the worker connects.
Always set timeouts and retries on scrape tasks. Pages can hang indefinitely, and transient network failures are common. See Timeouts and Retry Policies.
Common Patterns
| Pattern | Description |
|---|---|
| Price monitoring | Scrape competitor pricing pages on a schedule; alert on changes |
| Content aggregation | Scrape multiple news sources; use LLM to deduplicate and summarize |
| SEO monitoring | Scrape your own pages to verify meta tags, headings, and content |
| Lead enrichment | Scrape company websites to enrich CRM records with latest info |
| Documentation sync | Scrape external docs; chunk and embed for RAG (see RAG & Indexing) |
| Compliance checking | Scrape regulatory pages; alert when content changes |
Related Patterns
Cron expressions and one-time scheduled runs for periodic scraping.
Scheduled Jobs / CronFan out scrapes across many URLs in parallel with concurrency control.
Batch ProcessingChunk and embed scraped content for retrieval-augmented generation.
RAG & Data IndexingExtract structured data from scraped documents with OCR and LLM pipelines.
Document ProcessingNext Steps
- Cron Triggers: cron expression syntax and configuration
- Retry Policies: handle transient scraping failures
- Rate Limits: throttle requests to avoid being blocked
- Concurrency Control: limit parallel scrapes per domain