AngaraReplica v2 Operations Guide
Краткий операторский guide для streaming replication v2.
Каноничный источник: этот runbook в angarabook/src/operations/.
Topology and scope
- 1 primary + до 8 standby (async replication).
- Standby работает в read-only режиме (
SQLSTATE 25006на write). - Promote выполняется вручную (auto-failover в следующей major line).
Configuration baseline
Primary:
[replication].role = "primary"listen_addrwal_retention_segments
Standby:
[replication].role = "standby"primary_addrslot_namewal_path
Operations flow
- Запуск primary.
- Запуск standby и проверка lag-метрик.
- Мониторинг replication lag / reconnects / slots.
Promote (manual failover)
- Promote должен завершиться через sync-checkpoint handshake.
- Таймаут promote fail-closed (standby не принимает writes, если handshake не завершился).
- Lease-based fencing снижает риск split-brain, но не заменяет полноценно STONITH/Raft.
Key monitoring signals
angara_node_is_standbyangara_replication_lag_bytesangara_replication_lag_msangara_replication_reconnects_totalangara_promote_totalangara_promote_duration_ms_last
Typical incidents
- Standby не подключается: адрес/порт/firewall/reconnects.
WAL segment gone: нужен base backup и restart standby.- Promote timeout: проверить сеть и WAL write path на primary.
Дальше
- Disaster recovery playbook — DR-сценарии поверх репликации.
- Backup and restore (operator-level) — как репликация дополняет (не заменяет) backup.
- Operational policies baseline — соглашения SLA/RTO/RPO, в рамках которых работает replication v2.
Security Context Propagation
Starting with RM-0.6.7.0, the security context (including tenant_id) is automatically propagated through the WAL replication stream.
Key Features
- Tenant Isolation: The
tenant_idis embedded in WAL records, ensuring that standby nodes maintain the same multi-tenancy boundaries as the primary. - Integrity Verification: Replication tokens are protected by CRC32C checksums.
- Fail-Closed Security: If a tampered or invalid token is detected during replication, the connection is immediately terminated to prevent unauthorized data access.