Runbook: WALFsyncSlow
Source of truth:
tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7.
What It Means
P99 fsync latency for WAL exceeds 50 ms for 5 minutes. Each commit waits for disk longer than the target budget — TPS drops, commit latency grows, and cascading backlog risk increases.
Severity
warning. At 200 ms+, it is close to critical (consider escalation).
Initial response
- Grafana Overview v2 → row “WAL & Durability”.
- Check whether WAL throughput rate (bytes/s) has grown — write buffer overflow.
iostat -xm 1 5on the host — whether the WAL disk is saturated.
Diagnostics
curl -sf http://127.0.0.1:9898/metrics | rg transaction_log
iostat -xm 1 5
dmesg | tail -50 # I/O errors / SMART warnings
Mitigation
| Cause | Action |
|---|---|
| Disk saturated | Move WAL to a separate disk; use SSD/NVMe instead of HDD |
| Group commit off | Enable wal.group_commit = true in config |
| Network FS | Do NOT use NFS / CIFS for wal/ — fsync semantics are unpredictable |
Large wal_buffer_bytes | Reduce to a reasonable value (16-64 MB) |
| Filesystem barriers off | Check mount options (barrier=1, data=ordered) |
Escalation
If fsync > 200 ms persists for more than 10 minutes, this is a path to coordinated omission and commit loss; collect a diagnostics bundle and escalate urgently (durability-critical).