Runbook: CommitLatencyTuning
Sources of truth:
- RM-0.6.3.10 (Track B S11/S12/S13) — group-commit baseline.
- RM-0.6.4.0 (Sprint 2/3) — WAL contract, SyncAtCommit mode.
What It Means
Runbook for situations where COMMIT latency is higher than expected or unstable between identical workloads (single-client cron jobs, batch DML, mixed RW).
After RM-0.6.4.0, the sync_at_commit mode was introduced (alias strict): each COMMIT
forces WAL fsync before acknowledgment. This adds new modes
to the expected-latency table.
Durability Modes (RM-0.6.4.0+)
Configured through ANGARABASE_TRANSACTION_LOG_DURABILITY (env).
| Mode | Env value | Behavior | Use |
|---|---|---|---|
| Relaxed | relaxed | WAL is buffered, no fsync per commit | Dev/bench only |
| Group commit | group_commit | WAL pump coalesces and fsyncs by batch | Production (default) |
| Sync at commit | sync_at_commit or strict | fsync on every COMMIT | Banks, finance, max durability |
Important:
SET [LOCAL] durability = ...andCOMMIT WITH DURABILITY = ...are reserved for v0.6.5 and returnSQLSTATE 0A000 feature_not_supported. Use env for configuration.
Baseline Latency Expectations
| Mode | Condition | Expectation (guide) |
|---|---|---|
relaxed | fsync=false | sub-ms COMMIT; not for production |
group_commit | fsync=false | COMMIT ~0.1–5 ms; batches smooth spikes |
group_commit | fsync=true | COMMIT 2–20 ms; disk dominates |
sync_at_commit | NVMe | COMMIT 1–5 ms per tx (one fsync) |
sync_at_commit | HDD | COMMIT 5–20+ ms per tx |
If p50 or p99 is significantly above the range, check the diagnostics block below.
Which Metrics to Watch
New RM-0.6.4.0 Metrics (WAL Commit Path)
curl -sf http://127.0.0.1:9898/metrics | rg "wal_commit|wal_durability|wal_barrier"
| Metric | Meaning |
|---|---|
angarabase_wal_commit_fsync_total | Number of WAL writer fsync calls (growth = active sync) |
angarabase_wal_durability_epoch | Monotonic counter of durability barrier epochs |
angarabase_wal_barrier_wait_total | Number of transactions that waited for the durability barrier |
angarabase_wal_barrier_duration_seconds | Histogram of barrier wait time |
Baseline Metrics (group commit / write path)
curl -sf http://127.0.0.1:9898/metrics | rg "write_path_phase_b|group_commit|transaction_log"
| Metric | Meaning |
|---|---|
angarabase_write_path_phase_b_duration_seconds | Phase B histogram (commit hot path) |
angarabase_write_path_phase_b_timeout_total | Phase B timeouts — should be low |
angarabase_group_commit_batches_total | Number of pump batches |
angarabase_group_commit_batch_size | Batch-size distribution |
angarabase_transaction_log_group_commit_pumps_total | Number of pump runs |
angarabase_transaction_log_group_commit_pump_duration_ms | Duration of one pump |
Quick Diagnostics
# New WAL metrics (RM-0.6.4.0)
curl -sf http://127.0.0.1:9898/metrics | rg "wal_(commit|durability|barrier)"
# Group commit baseline
curl -sf http://127.0.0.1:9898/metrics | rg "write_path_phase_b|group_commit|transaction_log_group_commit"
# I/O correlate
iostat -xm 1 5
If iostat shows high await/util and *_pump_duration_ms and p99 COMMIT grow at the same time,
the problem is almost always in the I/O layer.
With sync_at_commit: if angarabase_wal_commit_fsync_total grows
proportionally to tx rate, the mode works correctly. If the rate is disproportionately high,
check wal_barrier_duration_seconds for stalls.
Tuning Order
- Confirm the durability mode (
ANGARABASE_TRANSACTION_LOG_DURABILITY) and target SLA. - For
sync_at_commit: make sure WAL files are on NVMe / a separate spindle. - Check whether the workload burst: compare tx-rate and batch-size histogram.
- For production, stabilize disk first, then tune
group_commit_interval_ms. - For bench/dev,
relaxedis allowed, but record it in the report.
DML-coverage check
For triage, it is useful to confirm that the latency anomaly is not masking a regression:
INSERT INTO t(...) VALUES (..., now())— should succeed.UPDATE t SET x = x + 1 WHERE ...— the expression should be applied.UPDATE/DELETEin autocommit and in txn should return the correct row count.
Escalation
- If
fsync=trueand p99 COMMIT > 200 ms for more than 10 minutes, escalate as a durability-risk incident. - If
wal_barrier_duration_secondsp99 > 50 ms withsync_at_commit, check for I/O stall. - If
errors_total > 0in TPC-B-lite smoke, stop performance claims until correctness is fixed.