Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: HighP99Latency

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7.

What It Means

P99 query latency exceeds 100 ms for 5 minutes. Metric: histogram_quantile(0.99, rate(angarabase_query_exec_duration_ms_bucket[5m])).

Severity

warning. Signal of degraded UX, not an outage.

Initial response

  1. Open the Grafana dashboard AngaraBase Overview v2 → row “Query Performance”.
  2. Compare with P50/P95 — if all three increased together, this is a global issue (CPU/IO/lock); if only P99 did, tail latency (GC, fsync stall, single slow query).
  3. Check the slow_query_total rate — whether the number of slow queries is growing.

Diagnostics

# Top-N slow queries
curl -sf http://127.0.0.1:9898/metrics | rg slow_query_total
curl -sf http://127.0.0.1:9898/metrics | rg query_exec_duration_ms_bucket

# Active long-running transactions
psql -c "SELECT pid, age(now(), xact_start), query FROM angara_stat_activity \
         WHERE state = 'active' ORDER BY xact_start LIMIT 10;"

Cross-check with other signals: BufferPoolPressure, WALFsyncSlow, LongTransaction.

Mitigation

  • Optimization plan: see performance-tuning.md.
  • ANALYZE on hot tables.
  • Indexes: check angarabase_index_routing_legacy_total > 0 — if yes, run DROP+CREATE INDEX (see index-routing-legacy-fallback).
  • Buffer pool: hit ratio < 90% → increase buffer_pool_pages.

Escalation

If latency does not decrease after standard actions for more than 30 minutes → diagnostics bundle + escalation.