Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parallel Runtime Observability Runbook

Operator runbook for diagnosing regressions in AngaraParallel. Canonical source: this runbook in angarabook/src/operations/.

Goal

Quickly determine the source of QPS drops / latency growth without deep debugging in code:

  • planner/plan shape;
  • runtime/scheduler pressure;
  • storage/IO contention.

Fast triage

  1. Compare bench metrics and server metrics in the same time window.
  2. Check QPS, p95/p99, queue depth, lock waits, error-rate.
  3. Classify the issue: planner vs runtime vs storage.

Required signals

  • USDT:
  • probe_parallel_query_start
  • probe_morsel_dispatched
  • probe_morsel_completed
  • Prometheus minimum:
  • angarabase_storage_io_read_duration_ms_*
  • angarabase_storage_io_write_duration_ms_*
  • angarabase_pgwire_pool_queue_depth
  • angarabase_lock_wait_duration_ms_*
  • angarabase_slow_query_total

Incident playbook

  1. Capture baseline and regression run on the same profile.
  2. Collect EXPLAIN ANALYZE for slow queries.
  3. Verify that the expected parallel path is used: workers_planned, workers_launched, Vector* operators, and reason_codes.
  4. Correlate dispatch/completion with tail latency.
  5. Check memory guardrails and degradation instead of hard-fail.
  6. Record a short report: impact, suspect component, next action.

Next