Runbook: HighP99Latency
Source of truth:
tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7.
What It Means
P99 query latency exceeds 100 ms for 5 minutes.
Metric: histogram_quantile(0.99, rate(angarabase_query_exec_duration_ms_bucket[5m])).
Severity
warning. Signal of degraded UX, not an outage.
Initial response
- Open the Grafana dashboard AngaraBase Overview v2 → row “Query Performance”.
- Compare with P50/P95 — if all three increased together, this is a global issue (CPU/IO/lock); if only P99 did, tail latency (GC, fsync stall, single slow query).
- Check the
slow_query_totalrate — whether the number of slow queries is growing.
Diagnostics
# Top-N slow queries
curl -sf http://127.0.0.1:9898/metrics | rg slow_query_total
curl -sf http://127.0.0.1:9898/metrics | rg query_exec_duration_ms_bucket
# Active long-running transactions
psql -c "SELECT pid, age(now(), xact_start), query FROM angara_stat_activity \
WHERE state = 'active' ORDER BY xact_start LIMIT 10;"
Cross-check with other signals: BufferPoolPressure, WALFsyncSlow, LongTransaction.
Mitigation
- Optimization plan: see performance-tuning.md.
- ANALYZE on hot tables.
- Indexes: check
angarabase_index_routing_legacy_total > 0— if yes, run DROP+CREATE INDEX (see index-routing-legacy-fallback). - Buffer pool: hit ratio < 90% → increase
buffer_pool_pages.
Escalation
If latency does not decrease after standard actions for more than 30 minutes → diagnostics bundle + escalation.