Parallel Runtime Observability Runbook
Operator runbook for diagnosing regressions in AngaraParallel.
Canonical source: this runbook in angarabook/src/operations/.
Goal
Quickly determine the source of QPS drops / latency growth without deep debugging in code:
- planner/plan shape;
- runtime/scheduler pressure;
- storage/IO contention.
Fast triage
- Compare bench metrics and server metrics in the same time window.
- Check QPS, p95/p99, queue depth, lock waits, error-rate.
- Classify the issue: planner vs runtime vs storage.
Required signals
- USDT:
probe_parallel_query_startprobe_morsel_dispatchedprobe_morsel_completed- Prometheus minimum:
angarabase_storage_io_read_duration_ms_*angarabase_storage_io_write_duration_ms_*angarabase_pgwire_pool_queue_depthangarabase_lock_wait_duration_ms_*angarabase_slow_query_total
Incident playbook
- Capture baseline and regression run on the same profile.
- Collect
EXPLAIN ANALYZEfor slow queries. - Verify that the expected parallel path is used:
workers_planned,workers_launched,Vector*operators, andreason_codes. - Correlate dispatch/completion with tail latency.
- Check memory guardrails and degradation instead of hard-fail.
- Record a short report: impact, suspect component, next action.
Next
- How to read query plans — detailed explanation of
workers_planned,workers_launched,Vector*, and optimizer diagnostics. - Performance tuning guide — general approaches to tuning for parallelism.
- Observability metrics checklist — general metrics that include parallel counters.