Runbook: `ReplicationLag`

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7.

What It Means

Replica lag (angarabase_replication_lag_bytes or equivalent in seconds) > 10 seconds. The replica is behind primary; reads from the replica return stale data.

Severity

warning. At > 60 seconds there is a risk of data loss during failover.

Initial response

Grafana Overview v2 → row “Replication”.
On primary: check slot status / sender backpressure.
On replica: check apply rate / disk space / network bandwidth.

Diagnostics

# Primary
curl -sf http://primary:9898/metrics | rg replication

# Replica
curl -sf http://replica:9898/metrics | rg replication

# Application lag in seconds
psql -h replica -c "SELECT now() - pg_last_xact_replay_timestamp() AS apply_lag;"

See also replication-v2.md §Diagnostics.

Mitigation

Cause	Action
Network	Check bandwidth, RTT, packet loss between primary and replica
Replica slower than primary	Upgrade hardware (SSD, CPU, RAM) on replica
Large slot backlog	Free it (risky — drop inactive slot)
Apply bottleneck (single-threaded)	See replication-v2.md §Tuning
Competing GC on replica	Reduce query load on replica

Escalation

If lag > 60 seconds and grows for more than 15 minutes, assess split-brain risk during failover and prepare a recovery plan.

Keyboard shortcuts

AngaraBook