Incident runbook: debug errors in 10 minutes
Goal
Quickly localize the cause of production degradation/errors: understand what is breaking (logs/tracing) and where the bottleneck is (USDT wait events + eBPF).
Prerequisites
angarabase-serveris running.- Access to server logs.
- Access to an SQL client (
psql/pgwire). - For USDT/eBPF: Linux with eBPF support,
bpftraceinstalled, and permissions (CAP_BPF/CAP_PERFMONor root).
Fast path (10 minutes)
Step 1 (0-2 min): record the symptom and affected sessions
Check active sessions and wait state:
SELECT pid, state, wait_event_type, wait_event, query
FROM angara_stat_activity
ORDER BY pid;
If the problem is widespread, save the snapshot to the incident ticket/chat.
Step 2 (2-4 min): understand exactly what is failing or degrading
Find ERROR/WARN entries in logs around the incident time:
rg "ERROR|WARN|panic|timeout|failed|degraded" /var/log/angarabase.log
If tracing is enabled (JSON), quickly filter long-running operations:
jq 'select(.fields.duration_ms != null and .fields.duration_ms > 1000)' /var/log/angarabase.log
Interpretation:
- many
Lock/timeoutentries -> contention is likely; - many
io/fsync/walentries -> a storage bottleneck is likely; - network errors -> the
Netpath is likely.
Step 3 (4-7 min): capture runtime evidence via USDT probes
Verify that probes are available:
bpftrace -l 'usdt:./angarabase-server:angarabase:*'
Quick lock-wait histogram:
bpftrace -e 'usdt:./angarabase-server:angarabase:lock_wait_end { @lock_us = hist(arg1); } interval:s:10 { print(@lock_us); clear(@lock_us); }'
Quick I/O latency histogram:
bpftrace -e 'usdt:./angarabase-server:angarabase:io_end { @io_us = hist(arg1); } interval:s:10 { print(@io_us); clear(@io_us); }'
Quick query-latency slice:
bpftrace -e 'usdt:./angarabase-server:angarabase:query_end { @q_us = hist(arg2); } interval:s:10 { print(@q_us); clear(@q_us); }'
Step 4 (7-9 min): correlation and root-cause hypothesis
Correlate:
angara_stat_activity.wait_event_type- logs/traces
- USDT histograms
Triage rule:
- high
lock_wait_end+wait_event_type=Lock-> contention/serialization; - high
io_end+ storage warnings -> disk/flush path; - normal lock/io but high
query_end-> planner/execute CPU path.
Step 5 (9-10 min): record evidence and next action
Minimum report contents:
- time window;
- top symptoms from logs;
- 1-2 commands and their output (histograms);
- preliminary root cause;
- immediate mitigation (for example, reduce load, limit heavy queries, increase monitoring).
Expected result
Within 10 minutes you have:
- a reproducible evidence pack;
- initial incident classification (Lock/IO/Net/CPU/Scheduler);
- a clear next step for mitigation/fix.
Troubleshooting
| Symptom | Action |
|---|---|
bpftrace does not see probes | Check the binary and stapsdt section: `readelf -n ./angarabase-server |
failed to attach probe | Run as root or grant capability to bpftrace |
| Logs exist but root cause is unclear | Increase the logging level during the incident and repeat the USDT slice for 60-120 seconds |
phase_* probes correlate poorly across sessions | Use query_*, lock_*, io_* as the primary signal; treat phase_* as auxiliary |
Links
- Structured logging: Structured logging
- Tracing: Tracing
- USDT probes: USDT probes
- General diagnostics: Diagnostics