Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: ReplicationLag

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7.

What It Means

Replica lag (angarabase_replication_lag_bytes or equivalent in seconds) > 10 seconds. The replica is behind primary; reads from the replica return stale data.

Severity

warning. At > 60 seconds there is a risk of data loss during failover.

Initial response

  1. Grafana Overview v2 → row “Replication”.
  2. On primary: check slot status / sender backpressure.
  3. On replica: check apply rate / disk space / network bandwidth.

Diagnostics

# Primary
curl -sf http://primary:9898/metrics | rg replication

# Replica
curl -sf http://replica:9898/metrics | rg replication

# Application lag in seconds
psql -h replica -c "SELECT now() - pg_last_xact_replay_timestamp() AS apply_lag;"

See also replication-v2.md §Diagnostics.

Mitigation

CauseAction
NetworkCheck bandwidth, RTT, packet loss between primary and replica
Replica slower than primaryUpgrade hardware (SSD, CPU, RAM) on replica
Large slot backlogFree it (risky — drop inactive slot)
Apply bottleneck (single-threaded)See replication-v2.md §Tuning
Competing GC on replicaReduce query load on replica

Escalation

If lag > 60 seconds and grows for more than 15 minutes, assess split-brain risk during failover and prepare a recovery plan.