Worker Health Checks
The Python SDK allows you to enable and ping a healthcheck to check on the status of your worker.
Usage
First, set the HATCHET_CLIENT_WORKER_HEALTHCHECK_ENABLED environment variable to True. Once that flag is set, two health check endpoints will be available (on port 8001 by default):
/health- Returns 200 when the worker listener is healthy, otherwise 503 with body{"status":"HEALTHY"}or{"status":"UNHEALTHY"}./metrics- A metrics endpoint intended to be used by a monitoring system like Prometheus.
Custom Port
You can set a custom port with the HATCHET_CLIENT_WORKER_HEALTHCHECK_PORT environment variable, e.g. HATCHET_CLIENT_WORKER_HEALTHCHECK_PORT=8002.
Event loop blocked threshold
If the worker listener process event loop becomes blocked for longer than a threshold, /health will return 503.
You can configure this threshold (in seconds) with:
HATCHET_CLIENT_WORKER_HEALTHCHECK_EVENT_LOOP_BLOCK_THRESHOLD_SECONDS(default:5.0)
Example request to /health:
curl localhost:8001/health
{"status":"HEALTHY"}Example request to /metrics:
curl localhost:8001/metrics
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 18782.0
python_gc_objects_collected_total{generation="1"} 4907.0
python_gc_objects_collected_total{generation="2"} 244.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 308.0
python_gc_collections_total{generation="1"} 27.0
python_gc_collections_total{generation="2"} 2.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="15",version="3.10.15"} 1.0
# HELP hatchet_worker_listener_health_my_worker Listener health (1 healthy, 0 unhealthy)
# TYPE hatchet_worker_listener_health_my_worker gauge
hatchet_worker_listener_health_my_worker 1.0
# HELP hatchet_worker_event_loop_lag_seconds_my_worker Event loop lag in seconds (listener process)
# TYPE hatchet_worker_event_loop_lag_seconds_my_worker gauge
hatchet_worker_event_loop_lag_seconds_my_worker 0.0Example Prometheus Configuration for /metrics:
scrape_configs:
- job_name: "hatchet"
scrape_interval: 5s
static_configs:
- targets: ["localhost:8001"]Example Prometheus Query
An example query to check if the worker is healthy might look something like:
(hatchet_worker_listener_health_my_worker{instance="localhost:8001", job="hatchet"}) or vector(0)