Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: WALFsyncSlow

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7.

What It Means

P99 fsync latency for WAL exceeds 50 ms for 5 minutes. Each commit waits for disk longer than the target budget — TPS drops, commit latency grows, and cascading backlog risk increases.

Severity

warning. At 200 ms+, it is close to critical (consider escalation).

Initial response

  1. Grafana Overview v2 → row “WAL & Durability”.
  2. Check whether WAL throughput rate (bytes/s) has grown — write buffer overflow.
  3. iostat -xm 1 5 on the host — whether the WAL disk is saturated.

Diagnostics

curl -sf http://127.0.0.1:9898/metrics | rg transaction_log
iostat -xm 1 5
dmesg | tail -50   # I/O errors / SMART warnings

Mitigation

CauseAction
Disk saturatedMove WAL to a separate disk; use SSD/NVMe instead of HDD
Group commit offEnable wal.group_commit = true in config
Network FSDo NOT use NFS / CIFS for wal/ — fsync semantics are unpredictable
Large wal_buffer_bytesReduce to a reasonable value (16-64 MB)
Filesystem barriers offCheck mount options (barrier=1, data=ordered)

Escalation

If fsync > 200 ms persists for more than 10 minutes, this is a path to coordinated omission and commit loss; collect a diagnostics bundle and escalate urgently (durability-critical).