Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: DeadlockSpike

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7, RM-0.6.4.4 (SSI).

What It Means

rate(angarabase_deadlock_detected_total[1m]) > 1 — the deadlock detector triggered more than once per minute. One or two deadlocks per hour is normal; a spike points to a problematic workload pattern.

For SERIALIZABLE transactions: a spike in 40001 errors (serialization_failure) can look like a deadlock spike in application logs, but has a different cause (rw anti-dependencies). See the angarabase_ssi_aborts_total metric.

Severity

critical. Deadlock = aborted transaction = potential loss of client work. For SSI: 40001 is expected behavior, but a high rate requires contention analysis.

Initial response

  1. Grafana Overview v2 → row “Locks”.
  2. Check which tables participate in the spike (see server log messages deadlock detected: ...).
  3. Correlate with recent deploy / migration — new workload?

Diagnostics

curl -sf http://127.0.0.1:9898/metrics | rg -E 'lock_|deadlock'
journalctl -u angarabase-server -n 500 | rg -i 'deadlock'

# Active locks (if a compatible view exists)
psql -c "SELECT * FROM angara_stat_locks WHERE granted = false;"

Mitigation

CauseAction
Different lock acquisition ordersStandardize the order (UPDATE by PK ASC) in client code
Long-running txn holds a lockSee LongTransaction
Hot row contentionShard the counter; use a sequence instead of UPDATE
Specific code needs updatingRoll back deploy, fix, redeploy

Escalation

If the spike does not subside for more than 15 minutes, it blocks business operations; escalate urgently.