Runbook: DeadlockSpike
Source of truth:
tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7, RM-0.6.4.4 (SSI).
What It Means
rate(angarabase_deadlock_detected_total[1m]) > 1 — the deadlock detector
triggered more than once per minute. One or two deadlocks per hour is normal;
a spike points to a problematic workload pattern.
For SERIALIZABLE transactions: a spike in 40001 errors (serialization_failure)
can look like a deadlock spike in application logs, but has a different cause
(rw anti-dependencies). See the angarabase_ssi_aborts_total metric.
Severity
critical. Deadlock = aborted transaction = potential loss of client work.
For SSI: 40001 is expected behavior, but a high rate requires contention analysis.
Initial response
- Grafana Overview v2 → row “Locks”.
- Check which tables participate in the spike (see server log messages
deadlock detected: ...). - Correlate with recent deploy / migration — new workload?
Diagnostics
curl -sf http://127.0.0.1:9898/metrics | rg -E 'lock_|deadlock'
journalctl -u angarabase-server -n 500 | rg -i 'deadlock'
# Active locks (if a compatible view exists)
psql -c "SELECT * FROM angara_stat_locks WHERE granted = false;"
Mitigation
| Cause | Action |
|---|---|
| Different lock acquisition orders | Standardize the order (UPDATE by PK ASC) in client code |
| Long-running txn holds a lock | See LongTransaction |
| Hot row contention | Shard the counter; use a sequence instead of UPDATE |
| Specific code needs updating | Roll back deploy, fix, redeploy |
Escalation
If the spike does not subside for more than 15 minutes, it blocks business operations; escalate urgently.