We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.
Learn more.

CookbooksWeb Scraping

Web Scraping

Web scraping workflows fetch content from external websites, process it, and store the results. Scraping is inherently unreliable: pages change layout, rate limits kick in, requests time out. Scrape tasks need retries, timeouts, and concurrency control. Hatchet provides all three, plus cron scheduling to refresh scraped data on a recurring cadence.

A typical web scraping pipeline has three parts:

  1. Scrape: fetch the page content (HTML, rendered JS, or structured API response)
  2. Process: extract, transform, or summarize the content (optionally with an LLM)
  3. Schedule: run the pipeline periodically via a cron workflow

Step-by-step walkthrough

You’ll build a scrape task with retries, a processing step, a cron workflow that refreshes your scraped data every 6 hours, and a rate-limited variant to avoid getting blocked.

Define the scrape task

Create a task that fetches a URL and returns the content. Set a timeout (pages can hang) and retries (transient failures are common). The examples below use a mock. Swap it for Firecrawl, Playwright, or any HTTP client.

The examples above use a mock scraper. To use a real scraping provider, swap the mock with one of these. Pick a provider, then your language:

Firecrawl is a managed web scraping API that returns clean markdown from any URL. It handles JavaScript rendering, anti-bot bypasses, and sitemap crawling out of the box, so you can focus on what to do with the content instead of how to extract it.

Process the scraped content

A separate task extracts or transforms the raw scraped content. This could be simple parsing, or an LLM call to summarize or extract structured data. Keeping it separate lets you retry processing independently from scraping.

Schedule recurring scrapes

Wrap the pipeline in a cron workflow to refresh data on a schedule. The example below runs every 6 hours and scrapes a list of URLs. Each scrape + process pair runs as child tasks, so failures on one URL don’t block the others.

Add rate limiting

Target sites will block you if you send too many requests. Create a separate rate-limited scrape task that caps requests to a fixed number per minute across all workers. Hatchet holds back task executions that would exceed the limit, so you stay within budget without adding sleep logic in your code. See Rate Limits for details.

Run the worker

Register all tasks (including the rate-limited variant) and upsert the rate limit before starting the worker. The cron schedule activates when the worker connects.

⚠️

Always set timeouts and retries on scrape tasks. Pages can hang indefinitely, and transient network failures are common. See Timeouts and Retry Policies.

Common Patterns

PatternDescription
Price monitoringScrape competitor pricing pages on a schedule; alert on changes
Content aggregationScrape multiple news sources; use LLM to deduplicate and summarize
SEO monitoringScrape your own pages to verify meta tags, headings, and content
Lead enrichmentScrape company websites to enrich CRM records with latest info
Documentation syncScrape external docs; chunk and embed for RAG (see RAG & Indexing)
Compliance checkingScrape regulatory pages; alert when content changes

Next Steps