Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Alert Runbooks

Operator-facing runbooks for each alert rule from tools/observability/alerts/angarabase_alerts.yaml (RM-0.6.3.8 S7). Each alert contains annotations.runbook_url with a link to one of the pages below — this is the binding between the observability surface and the operator remediation path.

Repo-reproducibility contract (G2-FIX cycle 2 / F-DOC-1): for each runbook_url in the alert YAML there is a backing markdown file in this directory. Verifier:

python3 - <<'PY'
import re, pathlib
rules = pathlib.Path("tools/observability/alerts/angarabase_alerts.yaml").read_text()
slugs = re.findall(r"runbooks/([a-z0-9-]+)", rules)
root = pathlib.Path("angarabook/src/operations/runbooks")
missing = [s for s in slugs if not (root / f"{s}.md").exists()]
print("OK" if not missing else f"MISSING: {missing}")
PY

By Alert Rule

AlertSeverityRunbook
AngarabaseDowncriticalangarabase-down.md
HighP99Latencywarninghigh-p99-latency.md
HighSlowQueryRatiowarninghigh-slow-query-ratio.md
BufferPoolPressurewarningbuffer-pool-pressure.md
WALFsyncSlowwarningwal-fsync-slow.md
DeadlockSpikecriticaldeadlock-spike.md
LongTransactionwarninglong-transaction.md
GCBloatHighwarninggc-bloat-high.md
ReplicationLagwarningreplication-lag.md
IndexRoutingLegacyFallbackwarningindex-routing-legacy-fallback.md

URL Convention

The production angarabook deployment maps /operations/runbooks/<slug>angarabook/src/operations/runbooks/<slug>.md. If your build uses a different layout, update runbook_url in the alert YAML accordingly (the source of truth is the alert file, not the runbooks themselves).

New Runbook Page Template

Each runbook page contains:

  1. What it means (required) — short explanation of alert semantics + PromQL link.
  2. Severity — critical / warning / info.
  3. Initial response (≤ 5 minutes) — what to do right now.
  4. Diagnostics — concrete commands (curl, psql, iostat, …).
  5. Mitigation — “symptom → action” table.
  6. Escalation — when and how to escalate.
  7. Related — links to adjacent runbooks and reference docs.