Runbook: `WALFsyncSlow`

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7.

What It Means

P99 fsync latency for WAL exceeds 50 ms for 5 minutes. Each commit waits for disk longer than the target budget — TPS drops, commit latency grows, and cascading backlog risk increases.

Severity

warning. At 200 ms+, it is close to critical (consider escalation).

Initial response

Grafana Overview v2 → row “WAL & Durability”.
Check whether WAL throughput rate (bytes/s) has grown — write buffer overflow.
iostat -xm 1 5 on the host — whether the WAL disk is saturated.

Diagnostics

curl -sf http://127.0.0.1:9898/metrics | rg transaction_log
iostat -xm 1 5
dmesg | tail -50   # I/O errors / SMART warnings

Mitigation

Cause	Action
Disk saturated	Move WAL to a separate disk; use SSD/NVMe instead of HDD
Group commit off	Enable `wal.group_commit = true` in config
Network FS	Do NOT use NFS / CIFS for `wal/` — fsync semantics are unpredictable
Large `wal_buffer_bytes`	Reduce to a reasonable value (16-64 MB)
Filesystem barriers off	Check mount options (`barrier=1`, `data=ordered`)

Escalation

If fsync > 200 ms persists for more than 10 minutes, this is a path to coordinated omission and commit loss; collect a diagnostics bundle and escalate urgently (durability-critical).

Keyboard shortcuts

AngaraBook