We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.

By clicking "Accept", you agree to our use of cookies.
Learn more.

GuideRetry Policies

Simple Task Retries

Hatchet provides a simple and effective way to handle failures in your tasks using a retry policy. This feature allows you to specify the number of times a task should be retried if it fails, helping to improve the reliability and resilience of your tasks.

Task-level retries can be added to both Standalone Tasks and Workflow Tasks.

How it works

When a task fails (i.e. throws an error or returns a non-zero exit code), Hatchet can automatically retry the task based on the retries configuration defined in the task object. Here’s how it works:

  1. If a task fails and retries is set to a value greater than 0, Hatchet will catch the error and retry the task.
  2. The task will be retried up to the specified number of times, with each retry being executed after a short delay to avoid overwhelming the system.
  3. If the task succeeds during any of the retries, the task will continue as normal.
  4. If the task continues to fail after exhausting all the specified retries, the task will be marked as failed.

This simple retry mechanism can help to mitigate transient failures, such as network issues or temporary unavailability of external services, without requiring complex error handling logic in your task code.

How to use task-level retries

To enable retries for a task, simply add the retries property to the task object in your task definition:

You can add the retries property to any task, and Hatchet will handle the retry logic automatically.

It’s important to note that task-level retries are not suitable for all types of failures. For example, if a task fails due to a programming error or an invalid configuration, retrying the task will likely not resolve the issue. In these cases, you should fix the underlying problem in your code or configuration rather than relying on retries.

Additionally, if a task interacts with external services or databases, you should ensure that the operation is idempotent (i.e. can be safely repeated without changing the result) before enabling retries. Otherwise, retrying the task could lead to unintended side effects or inconsistencies in your data.

Accessing the Retry Count in a Running Task

If you need to access the current retry count within a task, you can use the retryCount method available in the task context:

Exponential Backoff

Hatchet also supports exponential backoff for retries, which can be useful for handling failures in a more resilient manner. Exponential backoff increases the delay between retries exponentially, giving the failing service more time to recover before the next retry.

Bypassing Retry logic

The Hatchet SDKs each expose a NonRetryable exception, which allows you to bypass pre-configured retry logic for the task. If your task raises this exception, it will not be retried. This allows you to circumvent the default retry behavior in instances where you don’t want to or cannot safely retry. Some examples in which this might be useful include:

  1. A task that calls an external API which returns a 4XX response code.
  2. A task that contains a single non-idempotent operation that can fail but cannot safely be rerun on failure, such as a billing operation.
  3. A failure that requires manual intervention to resolve.

In these cases, even though retries is set to a non-zero number (meaning the task would ordinarily retry), Hatchet will not retry.

Python SDK Client Retry Behavior

The retry behavior described above is for task execution inside Hatchet. The Python SDK also has separate retry behavior for certain client-side REST and gRPC calls made by the SDK itself.

These client retries are configured separately from task retries and do not control whether a task is retried after failing in a worker.

Task retries and SDK client retries are separate mechanisms. Task retries control whether Hatchet retries a task after task failure. SDK client retries control whether the Python SDK retries certain API calls to Hatchet.

Default client retry behavior

By default, the Python SDK retries certain client calls with exponential backoff, with max_attempts defaulting to 5.

REST API calls

Error TypeRetried by Default
HTTP 5xx (server errors)Yes
HTTP 404 (not found)Yes
HTTP 429 (too many requests)No
HTTP 400, 401, 403, 409, 422 (client errors)No
Transport errors (timeout, connection, TLS, protocol)No

gRPC calls

Status CodeRetried
UNAVAILABLE, DEADLINE_EXCEEDED, INTERNALYes
RESOURCE_EXHAUSTED, ABORTED, UNKNOWNYes
UNIMPLEMENTED, NOT_FOUND, INVALID_ARGUMENTNo
ALREADY_EXISTS, UNAUTHENTICATED, PERMISSION_DENIEDNo

REST 404 responses are retried by default because some REST reads can observe replication lag between the core database and the OLAP database.

Configuring Python SDK client retries

The Python SDK exposes client retry configuration through TenacityConfig, either directly in ClientConfig or via environment variables.

import os
 
from hatchet_sdk import Hatchet
from hatchet_sdk.config import ClientConfig, HTTPMethod, TenacityConfig
 
hatchet = Hatchet(
    config=ClientConfig(
        token=os.environ["HATCHET_CLIENT_TOKEN"],
        tenacity=TenacityConfig(
            max_attempts=5,
            retry_429=False,
            retry_transport_errors=False,
            retry_transport_methods=[HTTPMethod.GET, HTTPMethod.DELETE],
        ),
    )
)
NameTypeDescriptionDefault
max_attemptsintMaximum number of retry attempts. Set to 0 to disable retries.5
retry_429boolEnable retries for HTTP 429 Too Many Requests responses.False
retry_transport_errorsboolEnable retries for REST transport-level errors (timeout, connection, TLS).False
retry_transport_methodslist[HTTPMethod]HTTP methods to retry on transport errors when retry_transport_errors is enabled.[GET, DELETE]

You can also configure these via environment variables:

Environment VariableDescription
HATCHET_CLIENT_TENACITY_MAX_ATTEMPTSMaximum retry attempts
HATCHET_CLIENT_TENACITY_RETRY_429Enable 429 retries (true/false)
HATCHET_CLIENT_TENACITY_RETRY_TRANSPORT_ERRORSEnable transport error retries (true/false)

Idempotency considerations

⚠️

When retry_transport_errors is enabled, only idempotent HTTP methods (GET, DELETE) are retried by default. Non-idempotent methods (POST, PUT, PATCH) are excluded because retrying them after a transport error could result in duplicate operations if the original request succeeded but the response was lost.

You can add non-idempotent methods to retry_transport_methods, but only do so if:

  1. Your operations are idempotent (for example, because they use idempotency keys), or
  2. You understand and accept the risk of duplicate operations

Retry timing

Python SDK client retries use exponential backoff with jitter. Fine-grained backoff timing is not currently configurable through TenacityConfig.

Conclusion

Hatchet’s task-level retry feature is a simple and effective way to handle transient failures in your tasks, improving the reliability and resilience of your tasks. By specifying the number of retries for each task, you can ensure that your tasks can recover from temporary issues without requiring complex error handling logic.

Remember to use retries judiciously and only for tasks that are idempotent and can safely be repeated. For more advanced retry strategies, such as exponential backoff or circuit breaking, stay tuned for future updates to Hatchet’s retry capabilities.