Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Incident runbook: debug errors in 10 minutes

Goal

Quickly localize the cause of production degradation/errors: understand what is breaking (logs/tracing) and where the bottleneck is (USDT wait events + eBPF).

Prerequisites

  • angarabase-server is running.
  • Access to server logs.
  • Access to an SQL client (psql/pgwire).
  • For USDT/eBPF: Linux with eBPF support, bpftrace installed, and permissions (CAP_BPF/CAP_PERFMON or root).

Fast path (10 minutes)

Step 1 (0-2 min): record the symptom and affected sessions

Check active sessions and wait state:

SELECT pid, state, wait_event_type, wait_event, query
FROM angara_stat_activity
ORDER BY pid;

If the problem is widespread, save the snapshot to the incident ticket/chat.

Step 2 (2-4 min): understand exactly what is failing or degrading

Find ERROR/WARN entries in logs around the incident time:

rg "ERROR|WARN|panic|timeout|failed|degraded" /var/log/angarabase.log

If tracing is enabled (JSON), quickly filter long-running operations:

jq 'select(.fields.duration_ms != null and .fields.duration_ms > 1000)' /var/log/angarabase.log

Interpretation:

  • many Lock/timeout entries -> contention is likely;
  • many io/fsync/wal entries -> a storage bottleneck is likely;
  • network errors -> the Net path is likely.

Step 3 (4-7 min): capture runtime evidence via USDT probes

Verify that probes are available:

bpftrace -l 'usdt:./angarabase-server:angarabase:*'

Quick lock-wait histogram:

bpftrace -e 'usdt:./angarabase-server:angarabase:lock_wait_end { @lock_us = hist(arg1); } interval:s:10 { print(@lock_us); clear(@lock_us); }'

Quick I/O latency histogram:

bpftrace -e 'usdt:./angarabase-server:angarabase:io_end { @io_us = hist(arg1); } interval:s:10 { print(@io_us); clear(@io_us); }'

Quick query-latency slice:

bpftrace -e 'usdt:./angarabase-server:angarabase:query_end { @q_us = hist(arg2); } interval:s:10 { print(@q_us); clear(@q_us); }'

Step 4 (7-9 min): correlation and root-cause hypothesis

Correlate:

  • angara_stat_activity.wait_event_type
  • logs/traces
  • USDT histograms

Triage rule:

  • high lock_wait_end + wait_event_type=Lock -> contention/serialization;
  • high io_end + storage warnings -> disk/flush path;
  • normal lock/io but high query_end -> planner/execute CPU path.

Step 5 (9-10 min): record evidence and next action

Minimum report contents:

  • time window;
  • top symptoms from logs;
  • 1-2 commands and their output (histograms);
  • preliminary root cause;
  • immediate mitigation (for example, reduce load, limit heavy queries, increase monitoring).

Expected result

Within 10 minutes you have:

  • a reproducible evidence pack;
  • initial incident classification (Lock/IO/Net/CPU/Scheduler);
  • a clear next step for mitigation/fix.

Troubleshooting

SymptomAction
bpftrace does not see probesCheck the binary and stapsdt section: `readelf -n ./angarabase-server
failed to attach probeRun as root or grant capability to bpftrace
Logs exist but root cause is unclearIncrease the logging level during the incident and repeat the USDT slice for 60-120 seconds
phase_* probes correlate poorly across sessionsUse query_*, lock_*, io_* as the primary signal; treat phase_* as auxiliary