Runbook: LongTransaction
Source of truth:
tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7, RM-0.6.4.4 (SSI).
What It Means
angarabase_txn_oldest_snapshot_age_seconds > 300 — the oldest open transaction
has been alive for more than 5 minutes. This blocks MVCC GC and leads to bloat.
For SERIALIZABLE transactions: holding a transaction for a long time also blocks cleanup (GC) of SIREAD locks and the SSI conflict graph, which can increase false positive aborts (40001) for new transactions because of lock escalation.
Severity
warning. At 30+ minutes it becomes a real GC blocker.
For SSI workloads, it is critical for throughput because of aborts.
Initial response
- Grafana Overview v2 → row “GC / MVCC”.
- Find the transaction PID:
SELECT pid, age(now(), xact_start) AS age, state, query
FROM angara_stat_activity
WHERE state IN ('idle in transaction', 'active')
ORDER BY xact_start ASC LIMIT 5;
Diagnostics
curl -sf http://127.0.0.1:9898/metrics | rg -E 'txn_(oldest_snapshot|active|idle)'
curl -sf http://127.0.0.1:9898/metrics | rg gc_
Mitigation
| Cause | Action |
|---|---|
Client stuck in idle in transaction | Enable idle_in_transaction_session_timeout |
| Long analytical query | Move to a read replica; split into batches |
| Pgbouncer pool | Check server_idle_timeout, restart the pool |
| Application bug | Fix on the client side (transaction scope) |
Forced abort (last resort):
SELECT pg_terminate_backend(<pid>);
Escalation
If the transaction is older than 1 hour and blocks GC until bloat > 30%, consider terminate + escalation and document the incident.