Runbook: AngarabaseDown
Source of truth:
tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7 (Prometheus Alert Rules v0).
What It Means
Prometheus has not received a response from the up{job="angarabase"} target for more than 30 seconds.
The server either crashed, does not respond on /metrics, or the network path between Prometheus and the instance is broken.
Severity
critical. Affects service availability for all clients.
Initial response (5 minutes)
# 1. Check the process
systemctl status angarabase-server # or your service manager
ps -ef | grep angarabase-server
# 2. Check the port
ss -ltnp | grep -E ':(5432|9898)'
# 3. Fetch metrics directly from the host
curl -sf http://127.0.0.1:9898/metrics | head -5
Diagnostics
-
Server log:
journalctl -u angarabase-server -n 200(or your log path). -
Crash diagnostics (RM-0.6.5.6):
- Panic hook: on crash, the server writes
[PANIC] thread='...' message='...' backtrace:to stderr (usually redirected towrapper.log). Look for the backtrace to understand the cause. - Supervisor crash log:
manage.shwrites[CRASH] pid=N exit_code=Mtowrapper.log. This line confirms that the process crashed under supervisor control.
Commands for quick diagnostics:
# Find the latest panic with backtrace (show 20 context lines): grep -A 20 "\[PANIC\]" artifacts/golden_db/logs/wrapper.log | tail -40 # Find all crash events with exit codes: grep "\[CRASH\]" artifacts/golden_db/logs/wrapper.log | tail -10 # Example output: [CRASH] pid=18073 exit_code=101 timestamp=2026-05-07T07:03:57Z # Check the last 50 server-log lines before the crash: grep -B 5 "\[CRASH\]\|\[PANIC\]" artifacts/golden_db/logs/wrapper.log | tail -30 - Panic hook: on crash, the server writes
-
Lease: see
crash-recovery.mdif the server failed because ofResourceBusy(PID file / lease). -
Network:
ss -s,iptables -L -n, check the firewall between Prometheus and the instance.
Mitigation
| Scenario | Action |
|---|---|
| Process crashed | systemctl restart angarabase-server + collect a crash dump |
| Lease stuck | ANGARABASE_FORCE_LEASE_TAKEOVER=1 + restart (see troubleshooting.md) |
| Network | Check firewall, route, DNS |
/metrics overloaded | Lower scrape_interval; check timeouts in Prometheus |
Escalation
If restart does not help for more than 10 minutes, collect a diagnostics bundle and escalate through the support flow.