We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.
Learn more.

User GuideBulk Retries and Cancellations

Bulk Cancellations and Replays

V1 add the ability to cancel or replay workflow runs in bulk, which you can now do either in the Hatchet Dashboard or programmatically via the SDKs and the REST API.

There are two ways of bulk cancelling or replaying workflows in both cases:

  1. You can provide a list of workflow run ids to cancel or replay, which will cancel or replay all of the workflows in the list.
  2. You can provide a list of filters, similar to the list of filters on workflow runs in the Dashboard, and cancel or replay runs matching those filters. For instance, if you wanted to replay all failed runs of a SimpleWorkflow from the past fifteen minutes that had the foo field in additional_metadata set to bar, you could apply those filters and replay all of the matching runs.

Bulk Operations by Run Ids

The first way to bulk cancel or replay runs is by providing a list of run ids. This is the most straightforward way to cancel or replay runs in bulk.

In the Python SDK, the mechanics of bulk replaying and bulk cancelling workflows are exactly the same. The only change would be replacing e.g. hatchet.runs.bulk_cancel with hatchet.runs.bulk_replay.

First, we’ll start by fetching a workflow via the REST API.

Now that we have a workflow, we’ll get runs for it, so that we can use them to bulk cancel by run id.

And finally, we can cancel the runs in bulk.

Note that the Python SDK also exposes async versions of each of these methods:

  • workflows.list -> await workflows.aio_list
  • runs.list -> await runs.aio_list
  • runs.bulk_cancel -> await runs.aio_bulk_cancel

Bulk Operations by Filters

The second way to bulk cancel or replay runs is by providing a list of filters. This is the most powerful way to cancel or replay runs in bulk, as it allows you to cancel or replay all runs matching a set of arbitrary filters without needing to provide IDs for the runs in advance.

The example below provides some filters you might use to cancel or replay runs in bulk. Importantly, these filters are very similar to the filters you can use in the Hatchet Dashboard to filter which workflow runs are displaying.

Running this request will cancel all workflow runs matching the filters provided.

Manual Retries

Hatchet provides a manual retry mechanism that allows you to handle failed workflow instances flexibly from the Hatchet dashboard.

Navigate to the specific workflow in the Hatchet dashboard and click on the failed run. From there, you can inspect the details of the run, including the input data and the failure reason for each task.

To retry a failed task, simply click on the task in the run details view and then click the “Replay” button. This will create a new instance of the workflow, starting from the failed task, and using the same input data as the original run.

Manual retries give you full control over when and how to reprocess failed instances. For example, you may choose to wait until an external service is back online before retrying instances that depend on that service, or you may need to deploy a bug fix to your workflow code before retrying instances that were affected by the bug.

A Note on Dead Letter Queues

A dead letter queue (DLQ) is a messaging concept used to handle messages that cannot be processed successfully. In the context of workflow management, a DLQ can be used to store failed workflow instances that require manual intervention or further analysis.

While Hatchet does not have a built-in dead letter queue feature, the persistence of failed workflow instances in the dashboard serves a similar purpose. By keeping a record of failed instances, Hatchet allows you to track and manage failures, perform root cause analysis, and take appropriate actions, such as modifying input data or updating your workflow code before manually retrying the failed instances.

It’s important to note that the term “dead letter queue” is more commonly associated with messaging systems like Apache Kafka or Amazon SQS, where unprocessed messages are automatically moved to a separate queue for manual handling. In Hatchet, the failed instances are not automatically moved to a separate queue but are instead persisted in the dashboard for manual management.