Incident runbook: debug errors in 10 minutes

Goal

Quickly localize the cause of production degradation/errors: understand what is breaking (logs/tracing) and where the bottleneck is (USDT wait events + eBPF).

Prerequisites

angarabase-server is running.
Access to server logs.
Access to an SQL client (psql/pgwire).
For USDT/eBPF: Linux with eBPF support, bpftrace installed, and permissions (CAP_BPF/CAP_PERFMON or root).

Fast path (10 minutes)

Step 1 (0-2 min): record the symptom and affected sessions

Check active sessions and wait state:

SELECT pid, state, wait_event_type, wait_event, query
FROM angara_stat_activity
ORDER BY pid;

If the problem is widespread, save the snapshot to the incident ticket/chat.

Step 2 (2-4 min): understand exactly what is failing or degrading

Find ERROR/WARN entries in logs around the incident time:

rg "ERROR|WARN|panic|timeout|failed|degraded" /var/log/angarabase.log

If tracing is enabled (JSON), quickly filter long-running operations:

jq 'select(.fields.duration_ms != null and .fields.duration_ms > 1000)' /var/log/angarabase.log

Interpretation:

many Lock/timeout entries -> contention is likely;
many io/fsync/wal entries -> a storage bottleneck is likely;
network errors -> the Net path is likely.

Step 3 (4-7 min): capture runtime evidence via USDT probes

Verify that probes are available:

bpftrace -l 'usdt:./angarabase-server:angarabase:*'

Quick lock-wait histogram:

bpftrace -e 'usdt:./angarabase-server:angarabase:lock_wait_end { @lock_us = hist(arg1); } interval:s:10 { print(@lock_us); clear(@lock_us); }'

Quick I/O latency histogram:

bpftrace -e 'usdt:./angarabase-server:angarabase:io_end { @io_us = hist(arg1); } interval:s:10 { print(@io_us); clear(@io_us); }'

Quick query-latency slice:

bpftrace -e 'usdt:./angarabase-server:angarabase:query_end { @q_us = hist(arg2); } interval:s:10 { print(@q_us); clear(@q_us); }'

Step 4 (7-9 min): correlation and root-cause hypothesis

Correlate:

angara_stat_activity.wait_event_type
logs/traces
USDT histograms

Triage rule:

high lock_wait_end + wait_event_type=Lock -> contention/serialization;
high io_end + storage warnings -> disk/flush path;
normal lock/io but high query_end -> planner/execute CPU path.

Step 5 (9-10 min): record evidence and next action

Minimum report contents:

time window;
top symptoms from logs;
1-2 commands and their output (histograms);
preliminary root cause;
immediate mitigation (for example, reduce load, limit heavy queries, increase monitoring).

Expected result

Within 10 minutes you have:

a reproducible evidence pack;
initial incident classification (Lock/IO/Net/CPU/Scheduler);
a clear next step for mitigation/fix.

Troubleshooting

Symptom	Action
`bpftrace` does not see probes	Check the binary and stapsdt section: `readelf -n ./angarabase-server
`failed to attach probe`	Run as root or grant capability to `bpftrace`
Logs exist but root cause is unclear	Increase the logging level during the incident and repeat the USDT slice for 60-120 seconds
`phase_*` probes correlate poorly across sessions	Use `query_`, `lock_`, `io_` as the primary signal; treat `phase_` as auxiliary

AngaraBook