Runbook: ReplicationLag
Source of truth:
tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7.
What It Means
Replica lag (angarabase_replication_lag_bytes or equivalent in seconds) > 10 seconds.
The replica is behind primary; reads from the replica return stale data.
Severity
warning. At > 60 seconds there is a risk of data loss during failover.
Initial response
- Grafana Overview v2 → row “Replication”.
- On primary: check slot status / sender backpressure.
- On replica: check apply rate / disk space / network bandwidth.
Diagnostics
# Primary
curl -sf http://primary:9898/metrics | rg replication
# Replica
curl -sf http://replica:9898/metrics | rg replication
# Application lag in seconds
psql -h replica -c "SELECT now() - pg_last_xact_replay_timestamp() AS apply_lag;"
See also replication-v2.md §Diagnostics.
Mitigation
| Cause | Action |
|---|---|
| Network | Check bandwidth, RTT, packet loss between primary and replica |
| Replica slower than primary | Upgrade hardware (SSD, CPU, RAM) on replica |
| Large slot backlog | Free it (risky — drop inactive slot) |
| Apply bottleneck (single-threaded) | See replication-v2.md §Tuning |
| Competing GC on replica | Reduce query load on replica |
Escalation
If lag > 60 seconds and grows for more than 15 minutes, assess split-brain risk during failover and prepare a recovery plan.