Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: AngarabaseDown

Source of truth: tools/observability/alerts/angarabase_alerts.yaml. Backed by: RM-0.6.3.8 S7 (Prometheus Alert Rules v0).

What It Means

Prometheus has not received a response from the up{job="angarabase"} target for more than 30 seconds. The server either crashed, does not respond on /metrics, or the network path between Prometheus and the instance is broken.

Severity

critical. Affects service availability for all clients.

Initial response (5 minutes)

# 1. Check the process
systemctl status angarabase-server   # or your service manager
ps -ef | grep angarabase-server

# 2. Check the port
ss -ltnp | grep -E ':(5432|9898)'

# 3. Fetch metrics directly from the host
curl -sf http://127.0.0.1:9898/metrics | head -5

Diagnostics

  • Server log: journalctl -u angarabase-server -n 200 (or your log path).

  • Crash diagnostics (RM-0.6.5.6):

    • Panic hook: on crash, the server writes [PANIC] thread='...' message='...' backtrace: to stderr (usually redirected to wrapper.log). Look for the backtrace to understand the cause.
    • Supervisor crash log: manage.sh writes [CRASH] pid=N exit_code=M to wrapper.log. This line confirms that the process crashed under supervisor control.

    Commands for quick diagnostics:

    # Find the latest panic with backtrace (show 20 context lines):
    grep -A 20 "\[PANIC\]" artifacts/golden_db/logs/wrapper.log | tail -40
    
    # Find all crash events with exit codes:
    grep "\[CRASH\]" artifacts/golden_db/logs/wrapper.log | tail -10
    # Example output: [CRASH] pid=18073 exit_code=101 timestamp=2026-05-07T07:03:57Z
    
    # Check the last 50 server-log lines before the crash:
    grep -B 5 "\[CRASH\]\|\[PANIC\]" artifacts/golden_db/logs/wrapper.log | tail -30
    
  • Lease: see crash-recovery.md if the server failed because of ResourceBusy (PID file / lease).

  • Network: ss -s, iptables -L -n, check the firewall between Prometheus and the instance.

Mitigation

ScenarioAction
Process crashedsystemctl restart angarabase-server + collect a crash dump
Lease stuckANGARABASE_FORCE_LEASE_TAKEOVER=1 + restart (see troubleshooting.md)
NetworkCheck firewall, route, DNS
/metrics overloadedLower scrape_interval; check timeouts in Prometheus

Escalation

If restart does not help for more than 10 minutes, collect a diagnostics bundle and escalate through the support flow.