Add to Cursor Add to Claude Copy for LLM View as MD

Web Scraping

Web scraping workflows fetch content from external websites, process it, and store the results. Scraping is inherently unreliable: pages change layout, rate limits kick in, requests time out. Scrape tasks need retries, timeouts, and concurrency control. Hatchet provides all three, plus cron scheduling to refresh scraped data on a recurring cadence.

A typical web scraping pipeline has three parts:

Scrape: fetch the page content (HTML, rendered JS, or structured API response)
Process: extract, transform, or summarize the content (optionally with an LLM)
Schedule: run the pipeline periodically via a cron workflow

Step-by-step walkthrough

You’ll build a scrape task with retries, a processing step, a cron workflow that refreshes your scraped data every 6 hours, and a rate-limited variant to avoid getting blocked.

Define the scrape task

Create a task that fetches a URL and returns the content. Set a timeout (pages can hang) and retries (transient failures are common). The examples below use a mock. Swap it for Firecrawl, Playwright, or any HTTP client.

examples/python/guides/web_scraping/worker.py

@scrape_wf.task(execution_timeout="2m", retries=2)
async def scrape_url(input: dict, ctx: Context) -> dict:
    return mock_scrape(input["url"])

examples/typescript/guides/web-scraping/workflow.ts

const scrapeTask = hatchet.task({
  name: 'scrape-url',
  executionTimeout: '2m',
  retries: 2,
  fn: async (input: ScrapeInput) => {
    return mockScrape(input.url);
  },
});

examples/go/guides/web-scraping/main.go

scrapeTask := client.NewStandaloneTask("scrape-url", func(ctx hatchet.Context, input ScrapeInput) (map[string]interface{}, error) {
	result := MockScrape(input.URL)
	return map[string]interface{}{
		"url": result.URL, "title": result.Title,
		"content": result.Content, "scraped_at": result.ScrapedAt,
	}, nil
}, hatchet.WithRetries(2))

The Ruby SDK is in early access, and may change. We'd love your feedback!

examples/ruby/guides/web_scraping/worker.rb

SCRAPE_WF.task(:scrape_url, execution_timeout: '2m', retries: 2) do |input, _ctx|
  mock_scrape(input['url'])
end

The examples above use a mock scraper. To use a real scraping provider, swap the mock with one of these. Pick a provider, then your language:

Firecrawl is a managed web scraping API that returns clean markdown from any URL. It handles JavaScript rendering, anti-bot bypasses, and sitemap crawling out of the box, so you can focus on what to do with the content instead of how to extract it.

pip install firecrawl-py

poetry add firecrawl-py

uv add firecrawl-py

examples/python/guides/integrations/scraper_firecrawl.py

def scrape_url(url: str) -> dict:
    result = firecrawl.scrape_url(url, params={"formats": ["markdown"]})
    return {
        "url": url,
        "content": result["markdown"],
        "metadata": result.get("metadata", {}),
    }

Browserbase provides managed, headless Chrome browsers in the cloud. It pairs with Playwright or Puppeteer for full browser automation, handling stealth fingerprinting, proxies, and session management, so you can scrape JavaScript-heavy sites that block traditional HTTP requests.

pip install browserbase playwright

poetry add browserbase playwright

uv add browserbase playwright

examples/python/guides/integrations/scraper_browserbase.py

async def scrape_url(url: str) -> dict:
    session = bb.sessions.create(project_id=os.environ["BROWSERBASE_PROJECT_ID"])
    async with async_playwright() as pw:
        browser = await pw.chromium.connect_over_cdp(session.connect_url)
        page = browser.contexts[0].pages[0]
        await page.goto(url)
        content = await page.content()
        await browser.close()
    return {"url": url, "content": content}

Playwright is an open-source browser automation framework from Microsoft. It drives Chromium, Firefox, and WebKit with a single API, supporting navigation, clicks, form fills, and screenshots. Run it locally or in CI for scraping pages that require full browser rendering.

pip install playwright

poetry add playwright

uv add playwright

examples/python/guides/integrations/scraper_playwright.py

async def scrape_url(url: str) -> dict:
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        content = await page.content()
        await browser.close()
    return {"url": url, "content": content}

OpenAI’s Web Search tool lets you augment a chat completion with live search results via the Responses API. The model decides when to search, synthesizes the results, and returns cited answers with no scraping infrastructure needed.

pip install openai

poetry add openai

uv add openai

examples/python/guides/integrations/scraper_openai.py

def search_and_extract(query: str) -> dict:
    response = client.responses.create(
        model="gpt-4o-mini",
        tools=[{"type": "web_search"}],
        input=query,
    )
    return {"query": query, "content": response.output_text}

Process the scraped content

A separate task extracts or transforms the raw scraped content. This could be simple parsing, or an LLM call to summarize or extract structured data. Keeping it separate lets you retry processing independently from scraping.

examples/python/guides/web_scraping/worker.py

@process_wf.task()
async def process_content(input: dict, ctx: Context) -> dict:
    content = input["content"]
    links = re.findall(r"https?://[^\s<>\"']+", content)
    summary = content[:200].strip()
    word_count = len(content.split())
    return {"summary": summary, "word_count": word_count, "links": links}

Schedule recurring scrapes

Wrap the pipeline in a cron workflow to refresh data on a schedule. The example below runs every 6 hours and scrapes a list of URLs. Each scrape + process pair runs as child tasks, so failures on one URL don’t block the others.

examples/python/guides/web_scraping/worker.py

cron_wf = hatchet.workflow(name="WebScrapeWorkflow", on_crons=["0 */6 * * *"])


@cron_wf.task()
async def scheduled_scrape(input: EmptyModel, ctx: Context) -> dict:
    urls = [
        "https://example.com/pricing",
        "https://example.com/blog",
        "https://example.com/docs",
    ]

    results = []
    for url in urls:
        scraped = await scrape_wf.aio_run(input={"url": url})
        processed = await process_wf.aio_run(input={"url": url, "content": scraped["content"]})
        results.append({"url": url, **processed})
    return {"refreshed": len(results), "results": results}

Add rate limiting

Target sites will block you if you send too many requests. Create a separate rate-limited scrape task that caps requests to a fixed number per minute across all workers. Hatchet holds back task executions that would exceed the limit, so you stay within budget without adding sleep logic in your code. See Rate Limits for details.

examples/python/guides/web_scraping/worker.py

SCRAPE_RATE_LIMIT_KEY = "scrape-rate-limit"

rate_limited_wf = hatchet.workflow(name="RateLimitedScrape")


@rate_limited_wf.task(
    execution_timeout="2m",
    retries=2,
    rate_limits=[RateLimit(static_key=SCRAPE_RATE_LIMIT_KEY, units=1)],
)
async def rate_limited_scrape(input: dict, ctx: Context) -> dict:
    return mock_scrape(input["url"])

Run the worker

Register all tasks (including the rate-limited variant) and upsert the rate limit before starting the worker. The cron schedule activates when the worker connects.

examples/python/guides/web_scraping/worker.py

hatchet.rate_limits.put(SCRAPE_RATE_LIMIT_KEY, 10, RateLimitDuration.MINUTE)

worker = hatchet.worker(
    "web-scraping-worker",
    workflows=[scrape_wf, process_wf, cron_wf, rate_limited_wf],
    slots=5,
)
worker.start()

⚠️

Always set timeouts and retries on scrape tasks. Pages can hang indefinitely, and transient network failures are common. See Timeouts and Retry Policies.

Common Patterns

Pattern	Description
Price monitoring	Scrape competitor pricing pages on a schedule; alert on changes
Content aggregation	Scrape multiple news sources; use LLM to deduplicate and summarize
SEO monitoring	Scrape your own pages to verify meta tags, headings, and content
Lead enrichment	Scrape company websites to enrich CRM records with latest info
Documentation sync	Scrape external docs; chunk and embed for RAG (see RAG & Indexing)
Compliance checking	Scrape regulatory pages; alert when content changes