Runbook: `DeadlockSpike`

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7, RM-0.6.4.4 (SSI).

What It Means

rate(angarabase_deadlock_detected_total[1m]) > 1 — the deadlock detector triggered more than once per minute. One or two deadlocks per hour is normal; a spike points to a problematic workload pattern.

For SERIALIZABLE transactions: a spike in 40001 errors (serialization_failure) can look like a deadlock spike in application logs, but has a different cause (rw anti-dependencies). See the angarabase_ssi_aborts_total metric.

Severity

critical. Deadlock = aborted transaction = potential loss of client work. For SSI: 40001 is expected behavior, but a high rate requires contention analysis.

Initial response

Grafana Overview v2 → row “Locks”.
Check which tables participate in the spike (see server log messages deadlock detected: ...).
Correlate with recent deploy / migration — new workload?

Diagnostics

curl -sf http://127.0.0.1:9898/metrics | rg -E 'lock_|deadlock'
journalctl -u angarabase-server -n 500 | rg -i 'deadlock'

# Active locks (if a compatible view exists)
psql -c "SELECT * FROM angara_stat_locks WHERE granted = false;"

Mitigation

Cause	Action
Different lock acquisition orders	Standardize the order (UPDATE by PK ASC) in client code
Long-running txn holds a lock	See LongTransaction
Hot row contention	Shard the counter; use a sequence instead of UPDATE
Specific code needs updating	Roll back deploy, fix, redeploy

Escalation

If the spike does not subside for more than 15 minutes, it blocks business operations; escalate urgently.

Keyboard shortcuts

AngaraBook