Runbook: `AngarabaseDown`

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7 (Prometheus Alert Rules v0).

What It Means

Prometheus has not received a response from the up{job="angarabase"} target for more than 30 seconds. The server either crashed, does not respond on /metrics, or the network path between Prometheus and the instance is broken.

Severity

critical. Affects service availability for all clients.

Initial response (5 minutes)

# 1. Check the process
systemctl status angarabase-server   # or your service manager
ps -ef | grep angarabase-server

# 2. Check the port
ss -ltnp | grep -E ':(5432|9898)'

# 3. Fetch metrics directly from the host
curl -sf http://127.0.0.1:9898/metrics | head -5

Diagnostics

Server log: journalctl -u angarabase-server -n 200 (or your log path).

Crash diagnostics (RM-0.6.5.6):

Panic hook: on crash, the server writes [PANIC] thread='...' message='...' backtrace: to stderr (usually redirected to wrapper.log). Look for the backtrace to understand the cause.
Supervisor crash log: manage.sh writes [CRASH] pid=N exit_code=M to wrapper.log. This line confirms that the process crashed under supervisor control.

Commands for quick diagnostics:

# Find the latest panic with backtrace (show 20 context lines):
grep -A 20 "\[PANIC\]" artifacts/golden_db/logs/wrapper.log | tail -40

# Find all crash events with exit codes:
grep "\[CRASH\]" artifacts/golden_db/logs/wrapper.log | tail -10
# Example output: [CRASH] pid=18073 exit_code=101 timestamp=2026-05-07T07:03:57Z

# Check the last 50 server-log lines before the crash:
grep -B 5 "\[CRASH\]\|\[PANIC\]" artifacts/golden_db/logs/wrapper.log | tail -30

Lease: see crash-recovery.md if the server failed because of ResourceBusy (PID file / lease).
Network: ss -s, iptables -L -n, check the firewall between Prometheus and the instance.

Mitigation

Scenario	Action
Process crashed	`systemctl restart angarabase-server` + collect a crash dump
Lease stuck	`ANGARABASE_FORCE_LEASE_TAKEOVER=1` + restart (see troubleshooting.md)
Network	Check firewall, route, DNS
`/metrics` overloaded	Lower `scrape_interval`; check timeouts in Prometheus

Escalation

If restart does not help for more than 10 minutes, collect a diagnostics bundle and escalate through the support flow.

Keyboard shortcuts

AngaraBook