We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.
Learn more.

GuideRetry Policies

Simple Task Retries

Hatchet provides a simple and effective way to handle failures in your tasks using a retry policy. This feature allows you to specify the number of times a task should be retried if it fails, helping to improve the reliability and resilience of your tasks.

Task-level retries can be added to both Standalone Tasks and Workflow Tasks.

How it works

When a task fails (i.e. throws an error or returns a non-zero exit code), Hatchet can automatically retry the task based on the retries configuration defined in the task object. Here’s how it works:

  1. If a task fails and retries is set to a value greater than 0, Hatchet will catch the error and retry the task.
  2. The task will be retried up to the specified number of times, with each retry being executed after a short delay to avoid overwhelming the system.
  3. If the task succeeds during any of the retries, the task will continue as normal.
  4. If the task continues to fail after exhausting all the specified retries, the task will be marked as failed.

This simple retry mechanism can help to mitigate transient failures, such as network issues or temporary unavailability of external services, without requiring complex error handling logic in your task code.

How to use task-level retries

To enable retries for a task, simply add the retries property to the task object in your task definition:

You can add the retries property to any task, and Hatchet will handle the retry logic automatically.

It’s important to note that task-level retries are not suitable for all types of failures. For example, if a task fails due to a programming error or an invalid configuration, retrying the task will likely not resolve the issue. In these cases, you should fix the underlying problem in your code or configuration rather than relying on retries. See Bypassing retry logic.

Additionally, if a task interacts with external services or databases, you should ensure that the operation is idempotent (i.e. can be safely repeated without changing the result) before enabling retries. Otherwise, retrying the task could lead to unintended side effects or inconsistencies in your data.

Accessing the Retry Count in a Running Task

You can access the current retry count on the task’s context object:

Exponential Backoff

Hatchet also supports exponential backoff for retries, which can be useful for handling failures in a more resilient manner. Exponential backoff increases the delay between retries exponentially, giving the failing service more time to recover before the next retry.

Bypassing Retry logic

The Hatchet SDKs each expose a NonRetryable exception, which allows you to bypass pre-configured retry logic for the task. If your task raises this exception, it will not be retried. This allows you to circumvent the default retry behavior in instances where you don’t want to or cannot safely retry. Some examples in which this might be useful include:

  1. A task that calls an external API which returns a 4XX response code.
  2. A task that contains a single non-idempotent operation that can fail but cannot safely be rerun on failure, such as a billing operation.
  3. A failure that requires manual intervention to resolve.

In these cases, even though retries is set to a non-zero number (meaning the task would ordinarily retry), Hatchet will not retry.

Python SDK Client Retry Behavior

The retry behavior described above is for task execution inside Hatchet. The Python SDK also has separate retry behavior for certain client-side REST and gRPC calls made by the SDK itself.

These client retries are configured separately from task retries and do not control whether a task is retried after failing in a worker.

Task retries and SDK client retries are separate mechanisms. Task retries control whether Hatchet retries a task after task failure. SDK client retries control whether the Python SDK retries certain API calls to Hatchet.

Default client retry behavior

By default, the Python SDK retries certain client calls with exponential backoff, with max_attempts defaulting to 5.

REST API calls

Error TypeRetried by Default
HTTP 5xx (server errors)Yes
HTTP 404 (not found)Yes
HTTP 429 (too many requests)No
HTTP 400, 401, 403, 409, 422 (client errors)No
Transport errors (timeout, connection, TLS, protocol)No

gRPC calls

Status CodeRetried
UNAVAILABLE, DEADLINE_EXCEEDED, INTERNALYes
RESOURCE_EXHAUSTED, ABORTED, UNKNOWNYes
UNIMPLEMENTED, NOT_FOUND, INVALID_ARGUMENTNo
ALREADY_EXISTS, UNAUTHENTICATED, PERMISSION_DENIEDNo

REST 404 responses are retried by default because some REST reads can observe replication lag between the core database and the OLAP database.

Configuring Python SDK client retries

The Python SDK exposes client retry configuration through TenacityConfig, either directly in ClientConfig or via environment variables.

import os
 
from hatchet_sdk import Hatchet
from hatchet_sdk.config import ClientConfig, HTTPMethod, TenacityConfig
 
hatchet = Hatchet(
    config=ClientConfig(
        token=os.environ["HATCHET_CLIENT_TOKEN"],
        tenacity=TenacityConfig(
            max_attempts=5,
            retry_429=False,
            retry_transport_errors=False,
            retry_transport_methods=[HTTPMethod.GET, HTTPMethod.DELETE],
        ),
    )
)
NameTypeDescriptionDefault
max_attemptsintMaximum number of retry attempts. Set to 0 to disable retries.5
retry_429boolEnable retries for HTTP 429 Too Many Requests responses.False
retry_transport_errorsboolEnable retries for REST transport-level errors (timeout, connection, TLS).False
retry_transport_methodslist[HTTPMethod]HTTP methods to retry on transport errors when retry_transport_errors is enabled.[GET, DELETE]

You can also configure these via environment variables:

Environment VariableDescription
HATCHET_CLIENT_TENACITY_MAX_ATTEMPTSMaximum retry attempts
HATCHET_CLIENT_TENACITY_RETRY_429Enable 429 retries (true/false)
HATCHET_CLIENT_TENACITY_RETRY_TRANSPORT_ERRORSEnable transport error retries (true/false)

Idempotency considerations

⚠️

When retry_transport_errors is enabled, only idempotent HTTP methods (GET, DELETE) are retried by default. Non-idempotent methods (POST, PUT, PATCH) are excluded because retrying them after a transport error could result in duplicate operations if the original request succeeded but the response was lost.

You can add non-idempotent methods to retry_transport_methods, but only do so if:

  1. Your operations are idempotent (for example, because they use idempotency keys), or
  2. You understand and accept the risk of duplicate operations

Retry timing

Python SDK client retries use exponential backoff with jitter. Fine-grained backoff timing is not currently configurable through TenacityConfig.

Go SDK Client Retry Behavior

The retry behavior described above is for task execution inside Hatchet. The Go SDK also retries some REST and gRPC calls that the SDK itself makes to Hatchet.

These SDK client retries are configured separately from task retries. They do not control whether Hatchet retries a task after it fails in a worker.

Task retries and SDK client retries are separate mechanisms. Task retries control whether Hatchet retries a task after task failure. SDK client retries control whether the Go SDK retries certain API calls to Hatchet.

Default Go client retry behavior

By default, the Go SDK retries certain client calls with exponential backoff. REST reads use up to 5 total attempts: the initial attempt plus up to 4 retries. gRPC calls keep the existing 5 attempt retry limit.

REST read retries use bounded jittered backoff. When the caller request context has no deadline, each REST attempt uses a response-header timeout without cutting off response body reads. If the caller context already has a deadline, that deadline governs the whole request.

REST API calls (bodyless GET and HEAD only)

Error TypeRetried by Default
HTTP 502, 503, 504 (gateway errors)Yes
HTTP 404 (not found)No
HTTP 429 (too many requests)Yes
HTTP 400, 401, 403, 409, 422 (client errors)No
Transport errors (timeout, connection, TLS, protocol)Yes

For HTTP 429 responses on idempotent reads, the Go SDK honors a valid Retry-After header when it fits the client retry cap. When Retry-After is missing, invalid, or oversized, it falls back to the same bounded jittered backoff used for other retriable errors.

Unlike the Python SDK, the Go SDK does not retry HTTP 404 responses on REST reads in this release. Python retries some 404 reads to account for replication lag between the core database and the OLAP database.

gRPC calls

Status CodeRetried
UNAVAILABLE, DEADLINE_EXCEEDED, INTERNALYes
RESOURCE_EXHAUSTEDYes
FAILED_PRECONDITIONNo
UNIMPLEMENTED, NOT_FOUND, INVALID_ARGUMENTNo
ALREADY_EXISTS, UNAUTHENTICATED, PERMISSION_DENIEDNo

FAILED_PRECONDITION is not retried because Hatchet uses it for non-transient control-plane signals such as inactive listeners. The unary interceptor still retries all unary RPCs, including writes, in this release.

Configuring Go SDK client retries

Use environment variables to disable SDK client retries:

Environment VariableDescription
HATCHET_CLIENT_NO_RETRYDisables both REST and gRPC SDK client retries when set to a truthy value.
HATCHET_CLIENT_NO_GRPC_RETRYLegacy gRPC-only retry control. Disables gRPC SDK retries only. REST read retries remain enabled unless HATCHET_CLIENT_NO_RETRY is set.

If both variables are set, all SDK client retries are disabled.

Idempotency considerations

⚠️

Go SDK REST retries apply only to bodyless GET and HEAD requests. POST, PUT, PATCH, and DELETE requests are never retried by the SDK client in this release. Bodied requests are excluded because Go http.Request bodies are one-shot unless GetBody is set or the SDK buffers and rebuilds the body. The generated REST clients do not set GetBody.

Conclusion

Hatchet’s task-level retry feature is a simple and effective way to handle transient failures in your tasks, improving the reliability and resilience of your tasks. By specifying the number of retries for each task, you can ensure that your tasks can recover from temporary issues without requiring complex error handling logic.

Remember to use retries judiciously and only for tasks that are idempotent. For more advanced retry strategies, such as exponential backoff or circuit breaking, stay tuned for future updates to Hatchet’s retry capabilities.