Observability Metrics Reference
Full AngaraBase metrics reference with diagnostic routes and a quick reference card.
Canonical source: this runbook in angarabook/src/operations/.
Quick Reference Card (Top-10 for wallboard)
Print this and keep it near the on-call desk. These 10 metrics cover 80% of production incidents.
| # | Metric | Type | Normal range | What crossing the boundary means |
|---|---|---|---|---|
| 1 | angarabase_connections_active | gauge | < 80% max_pool | Connection leak / missing PgBouncer — check angara_stat_activity |
| 2 | angarabase_txn_rollback_total (rate 1m) | counter rate | < 5% of commit rate | Abnormal rollback rate — MVCC conflicts, deadlock, or application bugs |
| 3 | angarabase_storage_dirty_pages_total | gauge | < 10,000 pages | Checkpoint cannot keep up — lower write rate or reduce checkpoint interval |
| 4 | angarabase_checkpoint_errors_total (change) | counter | 0 | Checkpoint error = critical incident; inspect logs immediately |
| 5 | angarabase_transaction_log_flush_lsn vs durable_lsn (delta) | gauge | < 1 MB | Large gap = WAL durability lag; data-loss risk on crash |
| 6 | angarabase_query_exec_duration_ms_bucket P99 | histogram | < 100 ms | P99 degradation — check angara_stat_activity + EXPLAIN |
| 7 | angarabase_buffer_pool_miss_total (rate) | counter rate | < 20% hit/miss | Low cache hit ratio — increase buffer_pool_size_mb |
| 8 | angarabase_memory_rss_bytes | gauge | < soft_limit*0.9 | Approaching soft limit — OOM risk; check query patterns + GC |
| 9 | angarabase_qos_rejected_critical_total (rate) | counter rate | 0 | Any CRITICAL rejections = production incident candidate |
| 10 | angarabase_uptime_seconds | gauge | monotonically increasing | Value < 60 after a pause = unexpected restart / crash |
Full Metrics Reference
Connections and Sessions
| Metric | Type | What it measures | Normal | Crossing the boundary |
|---|---|---|---|---|
angarabase_connections_active | gauge | Active client connections | < max_pool * 0.8 | Check pool config, connection leaks |
angarabase_connections_accepted_total | counter | Total connections since startup | monotonic | Sudden rate spike — DDoS or reconnect storm |
angarabase_pgwire_active_tasks | gauge | Active pgwire spawn_blocking tasks | ≤ max_blocking_threads | Saturation of blocking runtime path |
angarabase_session_claims_set_total | counter | Session claims set operations (app.*) | — | Used for audit trail |
Connection diagnostics:
SELECT pid, state, consumer_id, wait_event FROM angara_stat_activity;
Transactions and MVCC
| Metric | Type | What it measures | Normal | Crossing the boundary |
|---|---|---|---|---|
angarabase_txn_begin_total | counter | Total BEGIN | — | Throughput baseline |
angarabase_txn_commit_total | counter | Total COMMIT | — | rate(1m) = TPS |
angarabase_txn_rollback_total | counter | Total ROLLBACK | < 5% of commit | Conflicts, application errors |
angarabase_txn_active_count | gauge | Transactions in flight | < 100 (OLTP) | Long txns — check txn_oldest_snapshot_age_seconds |
angarabase_txn_commit_conflicts_total | counter | MVCC conflicts | close to 0 | High rate = competing writes to the same rows |
angarabase_txn_oldest_snapshot_age_seconds | gauge | Age of oldest snapshot | < 60s | Long snapshot blocks GC → GC bloat |
angarabase_mvcc_history_versions_total | gauge | Versions in MVCC store | grows slowly | Fast growth = GC cannot keep up (see MVCC GC runbook) |
angarabase_txn_commit_epoch_current | gauge | Current commit epoch | monotonic | Does not change for > 30s under load = WAL issue |
PromQL — TPS:
rate(angarabase_txn_commit_total[1m])
PromQL — Conflict ratio:
rate(angarabase_txn_commit_conflicts_total[5m]) / rate(angarabase_txn_commit_total[5m])
WAL and durability
| Metric | Type | What it measures | Normal | Crossing the boundary |
|---|---|---|---|---|
angarabase_transaction_log_flush_lsn | gauge | LSN of last flush | monotonic | Growth stops = WAL writer hung |
angarabase_transaction_log_durable_lsn | gauge | LSN of last fsync | ≤ flush_lsn | gap > 1 MB = durability lag |
angarabase_transaction_log_last_checkpoint_id | gauge | ID of last checkpoint | monotonic | — |
angarabase_transaction_log_checkpoint_end_valid_total | counter | Successful checkpoint ends | monotonic | — |
angarabase_transaction_log_checkpoint_end_invalid_total | counter | Invalid checkpoint ends | 0 | > 0 = WAL corruption |
angarabase_wal_sync_wait_total | counter | WAL sync waits (strict mode) | — | rate grows = I/O latency |
angarabase_wal_group_commit_wait_total | counter | WAL group commit waits | — | rate grows = group commit backlog |
angarabase_transaction_log_bytes_appended_total | counter | Bytes written to WAL | — | WAL write throughput |
PromQL — WAL durability gap (bytes):
angarabase_transaction_log_flush_lsn - angarabase_transaction_log_durable_lsn
Storage and buffer pool
| Metric | Type | What it measures | Normal | Crossing the boundary |
|---|---|---|---|---|
angarabase_storage_dirty_pages_total | gauge | Dirty pages in memory | < 10,000 | Checkpoint lag; reduce write rate or checkpoint_interval |
angarabase_storage_cached_pages_total | gauge | Cached pages | grows up to bp size | Sudden drop = eviction storm |
angarabase_buffer_pool_hit_total | counter | Cache hits | — | hit rate = hits / (hits + misses) |
angarabase_buffer_pool_miss_total | counter | Cache misses | — | miss rate > 20% = larger buffer pool needed |
angarabase_buffer_pool_warmup_pages_total | counter | Pages loaded during warmup | — | After restart |
angarabase_storage_flush_ok_total | counter | Successful flushes | monotonic | — |
angarabase_storage_backpressure_events_total | counter | Backpressure events | 0 | > 0 = writer faster than disk |
angarabase_storage_backpressure_commit_rejected_total | counter | Commit rejected by backpressure | 0 | I/O performance is insufficient |
angarabase_storage_flush_bytes_total | counter | Bytes flushed to disk | — | I/O write throughput |
PromQL — Buffer pool hit ratio:
rate(angarabase_buffer_pool_hit_total[5m]) /
(rate(angarabase_buffer_pool_hit_total[5m]) + rate(angarabase_buffer_pool_miss_total[5m]))
Checkpoint and bgwriter
| Metric | Type | What it measures | Normal | Crossing the boundary |
|---|---|---|---|---|
angarabase_checkpoint_total | counter | Successful checkpoints | > 0 in 5 min | = 0 for 10 min = checkpoint stopped |
angarabase_checkpoint_errors_total | counter | Checkpoint errors | 0 | Inspect logs immediately |
angarabase_checkpoint_dirty_pages | gauge | Dirty pages at checkpoint time | < 5,000 | High value = checkpoint cannot keep up |
angarabase_checkpoint_duration_ms_sum | counter | Total checkpoint time (ms) | — | avg = sum/count |
angarabase_checkpoint_aborted_total | counter | Aborted checkpoints | 0 | > 0 = cancellations; check reason |
angarabase_checkpoint_per_db_timeout_total | counter | Per-DB checkpoint timeouts | 0 | timeout = disk too slow |
angarabase_angarabase_wal_forced_checkpoints_total | counter | Forced checkpoints due to backpressure | 0 | > 0 = write pressure is critical |
SQL — bgwriter state:
SELECT * FROM angara_stat_bgwriter;
PromQL — checkpoint avg duration:
rate(angarabase_checkpoint_duration_ms_sum[5m]) / rate(angarabase_checkpoint_duration_ms_count[5m])
Query execution
| Metric | Type | What it measures | Normal | Crossing the boundary |
|---|---|---|---|---|
angarabase_query_exec_total_ok_select | counter | SELECT queries OK | — | QPS baseline |
angarabase_query_exec_total_ok_write | counter | Write queries OK | — | Write TPS |
angarabase_query_exec_total_err_select | counter | SELECT errors | close to 0 | rate grows = bugs or overload |
angarabase_query_exec_duration_ms_bucket | histogram | Latency distribution | P99 < 100ms | P99 > 500ms = degradation |
angarabase_slow_query_total | counter | Slow queries (> threshold) | 0 | > 0 = EXPLAIN slow queries needed |
angarabase_sql_routing_not_supported_total | counter | Unsupported SQL routes | 0 | > 0 = application uses unsupported SQL |
angarabase_legacy_fallback_triggered_total | counter | Legacy path fallbacks | 0 | > 0 = unsupported query plan |
angarabase_simd_agg_fallback_total | counter | SIMD aggregation fallback to scalar path | 0 | > 0 = AVX2/NEON support missing or type incompatibility |
angarabase_adaptive_probe_swap_total | counter | Number of adaptive Hash Join side swaps | — | Shows optimizer activity under table-size skew |
PromQL — P99 latency:
histogram_quantile(0.99,
rate(angarabase_query_exec_duration_ms_bucket[5m])
)
SQL — slow queries:
SELECT query, calls, mean_exec_time_ms, max_exec_time_ms
FROM angara_stat_statements
ORDER BY mean_exec_time_ms DESC LIMIT 10;
Memory
| Metric | Type | What it measures | Normal | Crossing the boundary |
|---|---|---|---|---|
angarabase_memory_rss_bytes | gauge | Process RSS (bytes) | < soft_limit * 0.9 | OOM risk; check query patterns |
angarabase_memory_soft_limit_exceeded_total | counter | soft_limit_mb crossings | 0 | > 0 = memory under pressure |
angarabase_tx_overlay_dataset_bytes_total | gauge | In-memory tx overlay size | < 512 MB | Large txns keep much data in memory |
QoS Scheduler
| Metric | Type | What it measures | Normal | Crossing the boundary |
|---|---|---|---|---|
angarabase_qos_rejected_critical_total | counter | CRITICAL queue rejections | 0 | Incident candidate — immediate triage |
angarabase_qos_rejected_interactive_total | counter | INTERACTIVE queue rejections | 0 | User-facing degradation |
angarabase_qos_rejected_background_total | counter | BACKGROUND queue rejections | — | Reduce background concurrency |
angarabase_qos_blocking_inflight | gauge | Blocking tasks | < max_blocking | scheduler saturation |
angarabase_spawn_blocking_active | gauge | Active spawn_blocking | < max_blocking | — |
Troubleshooting by Dashboard
Route 1: High P99 latency
angarabase_query_exec_duration_ms P99 > 500ms?
│
├─ Yes → angara_stat_activity: any waiting sessions?
│ │
│ ├─ Yes (wait_event != '') → Lock contention or WAL sync wait
│ │ → check angarabase_txn_commit_conflicts_total
│ │ → check angarabase_wal_sync_wait_total
│ │
│ └─ No → angara_stat_statements: top queries by max_exec_time_ms
│ → EXPLAIN the top query
│ → check buffer_pool_miss_total rate (I/O bound?)
│
└─ No → baseline normal, false alarm
SQL:
SELECT query, calls, max_exec_time_ms, mean_exec_time_ms
FROM angara_stat_statements
ORDER BY max_exec_time_ms DESC LIMIT 5;
Route 2: QPS Drop (sudden SELECT rate drop)
rate(angarabase_query_exec_total_ok_select[1m]) dropped sharply?
│
├─ connections_active also dropped → process restarted? uptime < 60s?
│ → check logs for panic / OOM / segfault
│
├─ connections_active high, QPS low → scheduler saturation?
│ → qos_rejected_* > 0?
│ → qos_blocking_inflight high?
│ → spawn_blocking_active ≈ spawn_blocking_max?
│
└─ Connections normal → long transaction blocking?
→ angara_stat_activity WHERE state = 'idle in transaction'
→ txn_oldest_snapshot_age_seconds > 60s?
Route 3: GC Pressure / MVCC bloat
mvcc_history_versions_total grows monotonically without decrease?
│
├─ txn_oldest_snapshot_age_seconds > 120s → long open snapshot
│ → find pid from angara_stat_activity ORDER BY query_start ASC
│ → terminate or wait for completion
│
├─ columnar_pending_deleted_rows > 1M → compaction lagging
│ → check Background Compactor in angara_stat_activity
│ → temporarily SET angarabase.compaction_enabled = true
│
└─ memory_rss_bytes grows together → GC bloat + memory pressure
→ see mvcc-gc.md runbook
Route 4: Checkpoint Issues
checkpoint_errors_total changed?
│
├─ Yes → inspect logs immediately (disk full? I/O error?)
│ → storage_backpressure_events_total > 0?
│ → df -h on data directory
│
└─ No, but dirty_pages_total high (> 10,000)?
→ checkpoint cannot keep up with writes
→ lower checkpoint_interval_ms
→ or limit write throughput
→ SQL: SELECT * FROM angara_stat_bgwriter;
Memory and Buffer Pool Metrics (RM-0.6.5.8)
Goal
Keep the minimum sufficient signal set for:
- durability;
- concurrency/locks;
- storage/checkpoint;
- recovery.
Metrics source
ANGARABASE_METRICS_ADDR=host:port- endpoint:
GET /metrics(Prometheus format)
Must-have groups
- Transactions / concurrency
- Transaction log / durability
- Locks
- Storage / writeback / checkpoint
- Query diagnostics / stats
- Recovery / replay outcomes
Memory and Buffer Pool Metrics (RM-0.6.5.8)
| Metric | Type | Meaning |
|---|---|---|
angarabase_memory_rss_bytes | gauge | Resident Set Size of the server process in bytes. Updated every 5s. |
angarabase_memory_soft_limit_exceeded_total | counter | Number of soft_limit_mb threshold crossings (edge-trigger). |
angarabase_buffer_pool_warmup_evictions_during_warmup_total | counter | Number of page evictions from buffer pool during warmup (warmup cap enforcement). |
angarabase_buffer_pool_warmup_completed_pages | counter | Number of pages loaded during warmup. |
angarabase_buffer_pool_warmup_aborted_at_cap_total | counter | Warmup aborted because cap was exceeded (>95%). |
PromQL — Alert when approaching soft limit:
# Replace <soft_limit_bytes> with soft_limit_mb * 1024 * 1024
# For example, for soft_limit_mb = 4096: threshold = 4294967296
angarabase_memory_rss_bytes > <soft_limit_bytes> * 0.9
Storage and Checkpoint Metrics (RM-0.6.5.8)
| Metric | Type | Meaning |
|---|---|---|
angarabase_checkpoint_total | counter | Total number of completed checkpoints. > 0 after 5 min uptime confirms auto-checkpoint is working. |
Visibility Map and Index-Only Scan (RM-0.6.4.3)
| Metric | Type | Meaning |
|---|---|---|
angarabase_visibility_map_all_visible_fraction | gauge | Share of all-visible pages (planner signal). |
angarabase_index_only_scan_hits_total | counter | Successful Index-Only Scan (without Heap access). |
angarabase_index_only_scan_heap_fetches_total | counter | Fallback to Heap during Index-Only Scan (VM bit=0). |
angarabase_visibility_map_rebuild_pages_remaining | gauge | Remaining pages for background VM rebuild. |
angarabase_visibility_map_corrupt_total | counter | Detected VM corruptions (rebuild trigger). |
Specific metric names linked to dashboard panels: see the table. The full name contract is pinned by a test; link below in “Contract pinning”.
New RM-0.6.4.0 Metrics (WAL Commit Path + Durability)
Added in Sprint 2/3 RM-0.6.4.0 (RFC-2026-090). Cover the new sync_at_commit
mode and the durability barrier group.
curl -sf http://127.0.0.1:9898/metrics | rg "wal_(sync_wait|group_commit_wait)|wait_events_total\\{event=\"wal_"
| Metric | Type | Meaning |
|---|---|---|
angarabase_wal_sync_wait_total | counter | Number of commit-wait events on the IO::WalSync path (strict durability). |
angarabase_wal_group_commit_wait_total | counter | Number of commit-wait events on the IO::WalGroupCommit path (batched durability wait). |
angarabase_wait_events_total{event="wal_sync"} | counter | Unified wait-event counter for the WAL sync path. |
angarabase_wait_events_total{event="wal_group_commit"} | counter | Unified wait-event counter for the group-commit path. |
Diagnostics by mode
relaxed:wal_sync_wait_totalandwal_group_commit_wait_totalare close to 0.group_commit:wal_group_commit_wait_totalgrows;wal_sync_wait_totalis usually noticeably lower.sync_at_commit/strict:wal_sync_wait_totalgrows;wait_events_total{event="wal_sync"}reflects long-term sync-path load.
Durability mode is checked through env ANGARABASE_TRANSACTION_LOG_DURABILITY.
SQL SET durability / COMMIT WITH DURABILITY are reserved for v0.6.5 → SQLSTATE 0A000.
Details: WAL writer contract spec (wal_writer_contract_v0.md) and RFC-2026-090.
HTAP / Vector Execution Metrics (RM-0.6.4.13 / RM-0.6.4.14 / RM-0.6.6.9)
HTAP-specific metrics for diagnosing vector and stream execution paths.
The label contract is stable starting with v0.6.x.
curl -sf http://127.0.0.1:9898/metrics | grep -E "scan_stream|vector_fallback|vector_memory|columnar_manifest|vector_columnar_native|columnar_batched_scan|segments_pruned|parallel_agg"
| Metric | Type | Meaning |
|---|---|---|
angarabase_scan_stream_materialize_total{reason="batch_to_rows"} | counter | Materialization at batch→rows boundary. |
angarabase_scan_stream_materialize_total{reason="drain_rows_default"} | counter | Materialization through drain_rows (fallback default). |
angarabase_scan_stream_materialize_total{reason="stream_to_relation_boundary"} | counter | Materialization at stream→relation boundary. |
angarabase_scan_stream_fallback_total | counter | Stream-plan fallback to legacy executor. |
angarabase_vector_fallback_total | counter | Vector-path fallback to row path (unsupported plan or type error). |
angarabase_vector_columnar_native_total | counter | Successful native vector-path activations for columnar tables. |
angarabase_columnar_batched_scan_batches_total | counter | Total processed columnar batches in native path. |
angarabase_columnar_segments_pruned_total | counter | Number of segments pruned by metadata (zone-map pruning). |
angarabase_parallel_agg_total | counter | Number of parallel aggregator runs. |
angarabase_vector_memory_budget_exceeded_total | counter | Vector budget allocation refusal (SQLSTATE 53100). |
angarabase_columnar_manifest_init_failed_total | counter | SegmentManifest init error during CREATE TABLE USING COLUMNAR. |
Note:
reason=labels onangarabase_scan_stream_materialize_totalare a stable operator-facing contract withinv0.6.x.
Columnar DV Pressure (RM-0.6.4.19 Track C C2)
angarabase_columnar_pending_deleted_rows — signed gauge showing the total
number of logically deleted rows in live segments that have not yet been reclaimed by compaction.
- Increment on
AttachDeleteVector(on every columnar DELETE):+row_countfrom the DV op. - Decrement on
compact_l0_to_l1:-rows_reclaimedby number of rows not included in the L1 pack.
Normally, the gauge grows after DELETE and decreases after a Background Compactor run. If the gauge grows monotonically, compaction is lagging or fully disabled.
curl -sf http://127.0.0.1:9898/metrics | rg "pending_deleted_rows"
Alert rule (DV fragmentation)
# Alert if accumulated DV pressure > 5 million rows.
angarabase_columnar_pending_deleted_rows > 5_000_000
Recommended severity:
warningwhen >1M rows — compaction is likely lagging;criticalwhen >10M rows — scan performance degradation is possible.
Interpretation:
- gauge ≤ 0 — normal (all DV reclaimed, possibly a small transient underflow during replay);
- gauge grows without decrease for > 30 minutes — check Background Compactor (
angara_stat_activity,angarabase_columnar_compaction_total).
| Metric | Type | Meaning |
|---|---|---|
angarabase_columnar_pending_deleted_rows | gauge (signed) | Net pending-deleted rows across all columnar segments. |
Heap fetch fallback reason metrics (RM-0.6.5.6)
angarabase_heap_point_fetch_fallback_reason_stale_tid_index_total— fallback due to stale tid indexangarabase_heap_point_fetch_fallback_reason_not_found_total— fallback due to row not found
Quick check (curl):
curl -s http://localhost:8080/metrics | grep "fallback_reason"
# angarabase_heap_point_fetch_fallback_reason_stale_tid_index_total 0
# angarabase_heap_point_fetch_fallback_reason_not_found_total 0
PromQL — fallback rate by reason:
rate(angarabase_heap_point_fetch_fallback_reason_stale_tid_index_total[5m])
rate(angarabase_heap_point_fetch_fallback_reason_not_found_total[5m])
If stale_tid_index grows, there may be an issue with the V3 chain path or index rebuild. If not_found grows, data loss or an MVCC visibility bug is possible.
QoS Scheduler and spawn_blocking (RM-0.6.4.10 / RM-0.6.4.19)
RM-0.6.4.10 adds runtime signals for QoS scheduler and blocking path. They
help distinguish SQL contention from scheduler saturation: if QoS
rejections or qos_blocking grow, the problem is in execution queues, not in
row/table locks.
curl -sf http://127.0.0.1:9898/metrics | rg "qos_(queued|rejected|blocking)|spawn_blocking"
| Metric | Type | Meaning |
|---|---|---|
angarabase_qos_queued_critical_total | counter | Total tasks placed in QoS CRITICAL queue. |
angarabase_qos_queued_interactive_total | counter | Total tasks placed in QoS INTERACTIVE queue. |
angarabase_qos_queued_background_total | counter | Total tasks placed in QoS BACKGROUND queue. |
angarabase_qos_rejected_critical_total | counter | CRITICAL queue rejections with SQLSTATE 53600. |
angarabase_qos_rejected_interactive_total | counter | INTERACTIVE queue rejections with SQLSTATE 53600. |
angarabase_qos_rejected_background_total | counter | BACKGROUND queue rejections with SQLSTATE 53600. |
angarabase_qos_blocking_inflight | gauge | Current blocking tasks across QoS shards. |
angarabase_spawn_blocking_max | gauge | spawn_blocking thread limit from max_blocking_threads; 0 before startup init. |
angarabase_spawn_blocking_active | gauge | Active spawn_blocking tasks. Incremented at start, decremented on completion via SpawnBlockingGuard (RM-0.6.4.19 Track C C2). |
QoS queues by level:
rate({__name__=~"angarabase_qos_queued_.*_total"}[5m])
QoS rejections by level:
rate({__name__=~"angarabase_qos_rejected_.*_total"}[5m])
Alert on any scheduler rejection:
sum(rate({__name__=~"angarabase_qos_rejected_.*_total"}[5m])) > 0
Blocking pressure:
angarabase_qos_blocking_inflight > 0
Blocking budget headroom:
angarabase_spawn_blocking_max - angarabase_spawn_blocking_active
Interpretation:
queued_background_totalgrows but norejected_*— scheduler accepts batch workload; usually normal;rejected_background_totalgrows — batch/ETL is too aggressive; lower concurrency or raiseANGARABASE_QOS_MAX_QUEUED;rejected_critical_totalgrows — production incident candidate: CRITICAL workload should not regularly hit the queue cap;qos_blocking_inflight > 0together with growth inqos_blockingwait event means pressure in the blocking runtime path.
Query Execution Duration Histogram (RM-0.6.5.10)
angarabase_query_exec_duration_ms — histogram of SQL query execution latency.
Note (RM-0.6.5.10 S6):
histogram_quantile(0.99)is correct only if the value is < 10,000ms. At p99 ≥ 10,000ms, inspect the share in bucket+Inf. Buckets:[1,5,10,50,100,500,1000,2500,5000,10000,+Inf]ms.
SLO-oriented usage
- Latency:
histogram_quantile()over*_bucket(p95/p99) - Throughput:
rate()over counters - Errors/contention: conflict/timeout/deadlock rates
- Saturation: backpressure counters and queue depth
Contract pinning
Must-have metric names are considered part of the operability contract and are protected by a test:
crates/angarabase/src/metrics.rsprometheus_export_contains_must_have_metrics_names
Next
- Performance tuning guide — which metrics to read first during degradation.
- Parallel runtime observability runbook — narrow metrics for the parallel runtime.
- MVCC and GC operator minimum — separate MVCC/GC metrics and alerts package.