Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: CommitLatencyTuning

Sources of truth:

  • RM-0.6.3.10 (Track B S11/S12/S13) — group-commit baseline.
  • RM-0.6.4.0 (Sprint 2/3) — WAL contract, SyncAtCommit mode.

What It Means

Runbook for situations where COMMIT latency is higher than expected or unstable between identical workloads (single-client cron jobs, batch DML, mixed RW).

After RM-0.6.4.0, the sync_at_commit mode was introduced (alias strict): each COMMIT forces WAL fsync before acknowledgment. This adds new modes to the expected-latency table.

Durability Modes (RM-0.6.4.0+)

Configured through ANGARABASE_TRANSACTION_LOG_DURABILITY (env).

ModeEnv valueBehaviorUse
RelaxedrelaxedWAL is buffered, no fsync per commitDev/bench only
Group commitgroup_commitWAL pump coalesces and fsyncs by batchProduction (default)
Sync at commitsync_at_commit or strictfsync on every COMMITBanks, finance, max durability

Important: SET [LOCAL] durability = ... and COMMIT WITH DURABILITY = ... are reserved for v0.6.5 and return SQLSTATE 0A000 feature_not_supported. Use env for configuration.

Baseline Latency Expectations

ModeConditionExpectation (guide)
relaxedfsync=falsesub-ms COMMIT; not for production
group_commitfsync=falseCOMMIT ~0.1–5 ms; batches smooth spikes
group_commitfsync=trueCOMMIT 2–20 ms; disk dominates
sync_at_commitNVMeCOMMIT 1–5 ms per tx (one fsync)
sync_at_commitHDDCOMMIT 5–20+ ms per tx

If p50 or p99 is significantly above the range, check the diagnostics block below.

Which Metrics to Watch

New RM-0.6.4.0 Metrics (WAL Commit Path)

curl -sf http://127.0.0.1:9898/metrics | rg "wal_commit|wal_durability|wal_barrier"
MetricMeaning
angarabase_wal_commit_fsync_totalNumber of WAL writer fsync calls (growth = active sync)
angarabase_wal_durability_epochMonotonic counter of durability barrier epochs
angarabase_wal_barrier_wait_totalNumber of transactions that waited for the durability barrier
angarabase_wal_barrier_duration_secondsHistogram of barrier wait time

Baseline Metrics (group commit / write path)

curl -sf http://127.0.0.1:9898/metrics | rg "write_path_phase_b|group_commit|transaction_log"
MetricMeaning
angarabase_write_path_phase_b_duration_secondsPhase B histogram (commit hot path)
angarabase_write_path_phase_b_timeout_totalPhase B timeouts — should be low
angarabase_group_commit_batches_totalNumber of pump batches
angarabase_group_commit_batch_sizeBatch-size distribution
angarabase_transaction_log_group_commit_pumps_totalNumber of pump runs
angarabase_transaction_log_group_commit_pump_duration_msDuration of one pump

Quick Diagnostics

# New WAL metrics (RM-0.6.4.0)
curl -sf http://127.0.0.1:9898/metrics | rg "wal_(commit|durability|barrier)"
# Group commit baseline
curl -sf http://127.0.0.1:9898/metrics | rg "write_path_phase_b|group_commit|transaction_log_group_commit"
# I/O correlate
iostat -xm 1 5

If iostat shows high await/util and *_pump_duration_ms and p99 COMMIT grow at the same time, the problem is almost always in the I/O layer.

With sync_at_commit: if angarabase_wal_commit_fsync_total grows proportionally to tx rate, the mode works correctly. If the rate is disproportionately high, check wal_barrier_duration_seconds for stalls.

Tuning Order

  1. Confirm the durability mode (ANGARABASE_TRANSACTION_LOG_DURABILITY) and target SLA.
  2. For sync_at_commit: make sure WAL files are on NVMe / a separate spindle.
  3. Check whether the workload burst: compare tx-rate and batch-size histogram.
  4. For production, stabilize disk first, then tune group_commit_interval_ms.
  5. For bench/dev, relaxed is allowed, but record it in the report.

DML-coverage check

For triage, it is useful to confirm that the latency anomaly is not masking a regression:

  • INSERT INTO t(...) VALUES (..., now()) — should succeed.
  • UPDATE t SET x = x + 1 WHERE ... — the expression should be applied.
  • UPDATE/DELETE in autocommit and in txn should return the correct row count.

Escalation

  • If fsync=true and p99 COMMIT > 200 ms for more than 10 minutes, escalate as a durability-risk incident.
  • If wal_barrier_duration_seconds p99 > 50 ms with sync_at_commit, check for I/O stall.
  • If errors_total > 0 in TPC-B-lite smoke, stop performance claims until correctness is fixed.