Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Observability Metrics Reference

Full AngaraBase metrics reference with diagnostic routes and a quick reference card. Canonical source: this runbook in angarabook/src/operations/.


Quick Reference Card (Top-10 for wallboard)

Print this and keep it near the on-call desk. These 10 metrics cover 80% of production incidents.

#MetricTypeNormal rangeWhat crossing the boundary means
1angarabase_connections_activegauge< 80% max_poolConnection leak / missing PgBouncer — check angara_stat_activity
2angarabase_txn_rollback_total (rate 1m)counter rate< 5% of commit rateAbnormal rollback rate — MVCC conflicts, deadlock, or application bugs
3angarabase_storage_dirty_pages_totalgauge< 10,000 pagesCheckpoint cannot keep up — lower write rate or reduce checkpoint interval
4angarabase_checkpoint_errors_total (change)counter0Checkpoint error = critical incident; inspect logs immediately
5angarabase_transaction_log_flush_lsn vs durable_lsn (delta)gauge< 1 MBLarge gap = WAL durability lag; data-loss risk on crash
6angarabase_query_exec_duration_ms_bucket P99histogram< 100 msP99 degradation — check angara_stat_activity + EXPLAIN
7angarabase_buffer_pool_miss_total (rate)counter rate< 20% hit/missLow cache hit ratio — increase buffer_pool_size_mb
8angarabase_memory_rss_bytesgauge< soft_limit*0.9Approaching soft limit — OOM risk; check query patterns + GC
9angarabase_qos_rejected_critical_total (rate)counter rate0Any CRITICAL rejections = production incident candidate
10angarabase_uptime_secondsgaugemonotonically increasingValue < 60 after a pause = unexpected restart / crash

Full Metrics Reference

Connections and Sessions

MetricTypeWhat it measuresNormalCrossing the boundary
angarabase_connections_activegaugeActive client connections< max_pool * 0.8Check pool config, connection leaks
angarabase_connections_accepted_totalcounterTotal connections since startupmonotonicSudden rate spike — DDoS or reconnect storm
angarabase_pgwire_active_tasksgaugeActive pgwire spawn_blocking tasks≤ max_blocking_threadsSaturation of blocking runtime path
angarabase_session_claims_set_totalcounterSession claims set operations (app.*)Used for audit trail

Connection diagnostics:

SELECT pid, state, consumer_id, wait_event FROM angara_stat_activity;

Transactions and MVCC

MetricTypeWhat it measuresNormalCrossing the boundary
angarabase_txn_begin_totalcounterTotal BEGINThroughput baseline
angarabase_txn_commit_totalcounterTotal COMMITrate(1m) = TPS
angarabase_txn_rollback_totalcounterTotal ROLLBACK< 5% of commitConflicts, application errors
angarabase_txn_active_countgaugeTransactions in flight< 100 (OLTP)Long txns — check txn_oldest_snapshot_age_seconds
angarabase_txn_commit_conflicts_totalcounterMVCC conflictsclose to 0High rate = competing writes to the same rows
angarabase_txn_oldest_snapshot_age_secondsgaugeAge of oldest snapshot< 60sLong snapshot blocks GC → GC bloat
angarabase_mvcc_history_versions_totalgaugeVersions in MVCC storegrows slowlyFast growth = GC cannot keep up (see MVCC GC runbook)
angarabase_txn_commit_epoch_currentgaugeCurrent commit epochmonotonicDoes not change for > 30s under load = WAL issue

PromQL — TPS:

rate(angarabase_txn_commit_total[1m])

PromQL — Conflict ratio:

rate(angarabase_txn_commit_conflicts_total[5m]) / rate(angarabase_txn_commit_total[5m])

WAL and durability

MetricTypeWhat it measuresNormalCrossing the boundary
angarabase_transaction_log_flush_lsngaugeLSN of last flushmonotonicGrowth stops = WAL writer hung
angarabase_transaction_log_durable_lsngaugeLSN of last fsync≤ flush_lsngap > 1 MB = durability lag
angarabase_transaction_log_last_checkpoint_idgaugeID of last checkpointmonotonic
angarabase_transaction_log_checkpoint_end_valid_totalcounterSuccessful checkpoint endsmonotonic
angarabase_transaction_log_checkpoint_end_invalid_totalcounterInvalid checkpoint ends0> 0 = WAL corruption
angarabase_wal_sync_wait_totalcounterWAL sync waits (strict mode)rate grows = I/O latency
angarabase_wal_group_commit_wait_totalcounterWAL group commit waitsrate grows = group commit backlog
angarabase_transaction_log_bytes_appended_totalcounterBytes written to WALWAL write throughput

PromQL — WAL durability gap (bytes):

angarabase_transaction_log_flush_lsn - angarabase_transaction_log_durable_lsn

Storage and buffer pool

MetricTypeWhat it measuresNormalCrossing the boundary
angarabase_storage_dirty_pages_totalgaugeDirty pages in memory< 10,000Checkpoint lag; reduce write rate or checkpoint_interval
angarabase_storage_cached_pages_totalgaugeCached pagesgrows up to bp sizeSudden drop = eviction storm
angarabase_buffer_pool_hit_totalcounterCache hitshit rate = hits / (hits + misses)
angarabase_buffer_pool_miss_totalcounterCache missesmiss rate > 20% = larger buffer pool needed
angarabase_buffer_pool_warmup_pages_totalcounterPages loaded during warmupAfter restart
angarabase_storage_flush_ok_totalcounterSuccessful flushesmonotonic
angarabase_storage_backpressure_events_totalcounterBackpressure events0> 0 = writer faster than disk
angarabase_storage_backpressure_commit_rejected_totalcounterCommit rejected by backpressure0I/O performance is insufficient
angarabase_storage_flush_bytes_totalcounterBytes flushed to diskI/O write throughput

PromQL — Buffer pool hit ratio:

rate(angarabase_buffer_pool_hit_total[5m]) /
  (rate(angarabase_buffer_pool_hit_total[5m]) + rate(angarabase_buffer_pool_miss_total[5m]))

Checkpoint and bgwriter

MetricTypeWhat it measuresNormalCrossing the boundary
angarabase_checkpoint_totalcounterSuccessful checkpoints> 0 in 5 min= 0 for 10 min = checkpoint stopped
angarabase_checkpoint_errors_totalcounterCheckpoint errors0Inspect logs immediately
angarabase_checkpoint_dirty_pagesgaugeDirty pages at checkpoint time< 5,000High value = checkpoint cannot keep up
angarabase_checkpoint_duration_ms_sumcounterTotal checkpoint time (ms)avg = sum/count
angarabase_checkpoint_aborted_totalcounterAborted checkpoints0> 0 = cancellations; check reason
angarabase_checkpoint_per_db_timeout_totalcounterPer-DB checkpoint timeouts0timeout = disk too slow
angarabase_angarabase_wal_forced_checkpoints_totalcounterForced checkpoints due to backpressure0> 0 = write pressure is critical

SQL — bgwriter state:

SELECT * FROM angara_stat_bgwriter;

PromQL — checkpoint avg duration:

rate(angarabase_checkpoint_duration_ms_sum[5m]) / rate(angarabase_checkpoint_duration_ms_count[5m])

Query execution

MetricTypeWhat it measuresNormalCrossing the boundary
angarabase_query_exec_total_ok_selectcounterSELECT queries OKQPS baseline
angarabase_query_exec_total_ok_writecounterWrite queries OKWrite TPS
angarabase_query_exec_total_err_selectcounterSELECT errorsclose to 0rate grows = bugs or overload
angarabase_query_exec_duration_ms_buckethistogramLatency distributionP99 < 100msP99 > 500ms = degradation
angarabase_slow_query_totalcounterSlow queries (> threshold)0> 0 = EXPLAIN slow queries needed
angarabase_sql_routing_not_supported_totalcounterUnsupported SQL routes0> 0 = application uses unsupported SQL
angarabase_legacy_fallback_triggered_totalcounterLegacy path fallbacks0> 0 = unsupported query plan
angarabase_simd_agg_fallback_totalcounterSIMD aggregation fallback to scalar path0> 0 = AVX2/NEON support missing or type incompatibility
angarabase_adaptive_probe_swap_totalcounterNumber of adaptive Hash Join side swapsShows optimizer activity under table-size skew

PromQL — P99 latency:

histogram_quantile(0.99,
  rate(angarabase_query_exec_duration_ms_bucket[5m])
)

SQL — slow queries:

SELECT query, calls, mean_exec_time_ms, max_exec_time_ms
FROM angara_stat_statements
ORDER BY mean_exec_time_ms DESC LIMIT 10;

Memory

MetricTypeWhat it measuresNormalCrossing the boundary
angarabase_memory_rss_bytesgaugeProcess RSS (bytes)< soft_limit * 0.9OOM risk; check query patterns
angarabase_memory_soft_limit_exceeded_totalcountersoft_limit_mb crossings0> 0 = memory under pressure
angarabase_tx_overlay_dataset_bytes_totalgaugeIn-memory tx overlay size< 512 MBLarge txns keep much data in memory

QoS Scheduler

MetricTypeWhat it measuresNormalCrossing the boundary
angarabase_qos_rejected_critical_totalcounterCRITICAL queue rejections0Incident candidate — immediate triage
angarabase_qos_rejected_interactive_totalcounterINTERACTIVE queue rejections0User-facing degradation
angarabase_qos_rejected_background_totalcounterBACKGROUND queue rejectionsReduce background concurrency
angarabase_qos_blocking_inflightgaugeBlocking tasks< max_blockingscheduler saturation
angarabase_spawn_blocking_activegaugeActive spawn_blocking< max_blocking

Troubleshooting by Dashboard

Route 1: High P99 latency

angarabase_query_exec_duration_ms P99 > 500ms?
  │
  ├─ Yes → angara_stat_activity: any waiting sessions?
  │        │
  │        ├─ Yes (wait_event != '') → Lock contention or WAL sync wait
  │        │   → check angarabase_txn_commit_conflicts_total
  │        │   → check angarabase_wal_sync_wait_total
  │        │
  │        └─ No → angara_stat_statements: top queries by max_exec_time_ms
  │            → EXPLAIN the top query
  │            → check buffer_pool_miss_total rate (I/O bound?)
  │
  └─ No → baseline normal, false alarm

SQL:

SELECT query, calls, max_exec_time_ms, mean_exec_time_ms
FROM angara_stat_statements
ORDER BY max_exec_time_ms DESC LIMIT 5;

Route 2: QPS Drop (sudden SELECT rate drop)

rate(angarabase_query_exec_total_ok_select[1m]) dropped sharply?
  │
  ├─ connections_active also dropped → process restarted? uptime < 60s?
  │   → check logs for panic / OOM / segfault
  │
  ├─ connections_active high, QPS low → scheduler saturation?
  │   → qos_rejected_* > 0?
  │   → qos_blocking_inflight high?
  │   → spawn_blocking_active ≈ spawn_blocking_max?
  │
  └─ Connections normal → long transaction blocking?
      → angara_stat_activity WHERE state = 'idle in transaction'
      → txn_oldest_snapshot_age_seconds > 60s?

Route 3: GC Pressure / MVCC bloat

mvcc_history_versions_total grows monotonically without decrease?
  │
  ├─ txn_oldest_snapshot_age_seconds > 120s → long open snapshot
  │   → find pid from angara_stat_activity ORDER BY query_start ASC
  │   → terminate or wait for completion
  │
  ├─ columnar_pending_deleted_rows > 1M → compaction lagging
  │   → check Background Compactor in angara_stat_activity
  │   → temporarily SET angarabase.compaction_enabled = true
  │
  └─ memory_rss_bytes grows together → GC bloat + memory pressure
      → see mvcc-gc.md runbook

Route 4: Checkpoint Issues

checkpoint_errors_total changed?
  │
  ├─ Yes → inspect logs immediately (disk full? I/O error?)
  │   → storage_backpressure_events_total > 0?
  │   → df -h on data directory
  │
  └─ No, but dirty_pages_total high (> 10,000)?
      → checkpoint cannot keep up with writes
      → lower checkpoint_interval_ms
      → or limit write throughput
      → SQL: SELECT * FROM angara_stat_bgwriter;

Memory and Buffer Pool Metrics (RM-0.6.5.8)

Goal

Keep the minimum sufficient signal set for:

  • durability;
  • concurrency/locks;
  • storage/checkpoint;
  • recovery.

Metrics source

  • ANGARABASE_METRICS_ADDR=host:port
  • endpoint: GET /metrics (Prometheus format)

Must-have groups

  • Transactions / concurrency
  • Transaction log / durability
  • Locks
  • Storage / writeback / checkpoint
  • Query diagnostics / stats
  • Recovery / replay outcomes

Memory and Buffer Pool Metrics (RM-0.6.5.8)

MetricTypeMeaning
angarabase_memory_rss_bytesgaugeResident Set Size of the server process in bytes. Updated every 5s.
angarabase_memory_soft_limit_exceeded_totalcounterNumber of soft_limit_mb threshold crossings (edge-trigger).
angarabase_buffer_pool_warmup_evictions_during_warmup_totalcounterNumber of page evictions from buffer pool during warmup (warmup cap enforcement).
angarabase_buffer_pool_warmup_completed_pagescounterNumber of pages loaded during warmup.
angarabase_buffer_pool_warmup_aborted_at_cap_totalcounterWarmup aborted because cap was exceeded (>95%).

PromQL — Alert when approaching soft limit:

# Replace <soft_limit_bytes> with soft_limit_mb * 1024 * 1024
# For example, for soft_limit_mb = 4096: threshold = 4294967296
angarabase_memory_rss_bytes > <soft_limit_bytes> * 0.9

Storage and Checkpoint Metrics (RM-0.6.5.8)

MetricTypeMeaning
angarabase_checkpoint_totalcounterTotal number of completed checkpoints. > 0 after 5 min uptime confirms auto-checkpoint is working.

Visibility Map and Index-Only Scan (RM-0.6.4.3)

MetricTypeMeaning
angarabase_visibility_map_all_visible_fractiongaugeShare of all-visible pages (planner signal).
angarabase_index_only_scan_hits_totalcounterSuccessful Index-Only Scan (without Heap access).
angarabase_index_only_scan_heap_fetches_totalcounterFallback to Heap during Index-Only Scan (VM bit=0).
angarabase_visibility_map_rebuild_pages_remaininggaugeRemaining pages for background VM rebuild.
angarabase_visibility_map_corrupt_totalcounterDetected VM corruptions (rebuild trigger).

Specific metric names linked to dashboard panels: see the table. The full name contract is pinned by a test; link below in “Contract pinning”.

New RM-0.6.4.0 Metrics (WAL Commit Path + Durability)

Added in Sprint 2/3 RM-0.6.4.0 (RFC-2026-090). Cover the new sync_at_commit mode and the durability barrier group.

curl -sf http://127.0.0.1:9898/metrics | rg "wal_(sync_wait|group_commit_wait)|wait_events_total\\{event=\"wal_"
MetricTypeMeaning
angarabase_wal_sync_wait_totalcounterNumber of commit-wait events on the IO::WalSync path (strict durability).
angarabase_wal_group_commit_wait_totalcounterNumber of commit-wait events on the IO::WalGroupCommit path (batched durability wait).
angarabase_wait_events_total{event="wal_sync"}counterUnified wait-event counter for the WAL sync path.
angarabase_wait_events_total{event="wal_group_commit"}counterUnified wait-event counter for the group-commit path.

Diagnostics by mode

  • relaxed: wal_sync_wait_total and wal_group_commit_wait_total are close to 0.
  • group_commit: wal_group_commit_wait_total grows; wal_sync_wait_total is usually noticeably lower.
  • sync_at_commit / strict: wal_sync_wait_total grows; wait_events_total{event="wal_sync"} reflects long-term sync-path load.

Durability mode is checked through env ANGARABASE_TRANSACTION_LOG_DURABILITY. SQL SET durability / COMMIT WITH DURABILITY are reserved for v0.6.5 → SQLSTATE 0A000. Details: WAL writer contract spec (wal_writer_contract_v0.md) and RFC-2026-090.

HTAP / Vector Execution Metrics (RM-0.6.4.13 / RM-0.6.4.14 / RM-0.6.6.9)

HTAP-specific metrics for diagnosing vector and stream execution paths. The label contract is stable starting with v0.6.x.

curl -sf http://127.0.0.1:9898/metrics | grep -E "scan_stream|vector_fallback|vector_memory|columnar_manifest|vector_columnar_native|columnar_batched_scan|segments_pruned|parallel_agg"
MetricTypeMeaning
angarabase_scan_stream_materialize_total{reason="batch_to_rows"}counterMaterialization at batch→rows boundary.
angarabase_scan_stream_materialize_total{reason="drain_rows_default"}counterMaterialization through drain_rows (fallback default).
angarabase_scan_stream_materialize_total{reason="stream_to_relation_boundary"}counterMaterialization at stream→relation boundary.
angarabase_scan_stream_fallback_totalcounterStream-plan fallback to legacy executor.
angarabase_vector_fallback_totalcounterVector-path fallback to row path (unsupported plan or type error).
angarabase_vector_columnar_native_totalcounterSuccessful native vector-path activations for columnar tables.
angarabase_columnar_batched_scan_batches_totalcounterTotal processed columnar batches in native path.
angarabase_columnar_segments_pruned_totalcounterNumber of segments pruned by metadata (zone-map pruning).
angarabase_parallel_agg_totalcounterNumber of parallel aggregator runs.
angarabase_vector_memory_budget_exceeded_totalcounterVector budget allocation refusal (SQLSTATE 53100).
angarabase_columnar_manifest_init_failed_totalcounterSegmentManifest init error during CREATE TABLE USING COLUMNAR.

Note: reason= labels on angarabase_scan_stream_materialize_total are a stable operator-facing contract within v0.6.x.

Columnar DV Pressure (RM-0.6.4.19 Track C C2)

angarabase_columnar_pending_deleted_rows — signed gauge showing the total number of logically deleted rows in live segments that have not yet been reclaimed by compaction.

  • Increment on AttachDeleteVector (on every columnar DELETE): +row_count from the DV op.
  • Decrement on compact_l0_to_l1: -rows_reclaimed by number of rows not included in the L1 pack.

Normally, the gauge grows after DELETE and decreases after a Background Compactor run. If the gauge grows monotonically, compaction is lagging or fully disabled.

curl -sf http://127.0.0.1:9898/metrics | rg "pending_deleted_rows"

Alert rule (DV fragmentation)

# Alert if accumulated DV pressure > 5 million rows.
angarabase_columnar_pending_deleted_rows > 5_000_000

Recommended severity:

  • warning when >1M rows — compaction is likely lagging;
  • critical when >10M rows — scan performance degradation is possible.

Interpretation:

  • gauge ≤ 0 — normal (all DV reclaimed, possibly a small transient underflow during replay);
  • gauge grows without decrease for > 30 minutes — check Background Compactor (angara_stat_activity, angarabase_columnar_compaction_total).
MetricTypeMeaning
angarabase_columnar_pending_deleted_rowsgauge (signed)Net pending-deleted rows across all columnar segments.

Heap fetch fallback reason metrics (RM-0.6.5.6)

  • angarabase_heap_point_fetch_fallback_reason_stale_tid_index_total — fallback due to stale tid index
  • angarabase_heap_point_fetch_fallback_reason_not_found_total — fallback due to row not found

Quick check (curl):

curl -s http://localhost:8080/metrics | grep "fallback_reason"
# angarabase_heap_point_fetch_fallback_reason_stale_tid_index_total 0
# angarabase_heap_point_fetch_fallback_reason_not_found_total 0

PromQL — fallback rate by reason:

rate(angarabase_heap_point_fetch_fallback_reason_stale_tid_index_total[5m])
rate(angarabase_heap_point_fetch_fallback_reason_not_found_total[5m])

If stale_tid_index grows, there may be an issue with the V3 chain path or index rebuild. If not_found grows, data loss or an MVCC visibility bug is possible.

QoS Scheduler and spawn_blocking (RM-0.6.4.10 / RM-0.6.4.19)

RM-0.6.4.10 adds runtime signals for QoS scheduler and blocking path. They help distinguish SQL contention from scheduler saturation: if QoS rejections or qos_blocking grow, the problem is in execution queues, not in row/table locks.

curl -sf http://127.0.0.1:9898/metrics | rg "qos_(queued|rejected|blocking)|spawn_blocking"
MetricTypeMeaning
angarabase_qos_queued_critical_totalcounterTotal tasks placed in QoS CRITICAL queue.
angarabase_qos_queued_interactive_totalcounterTotal tasks placed in QoS INTERACTIVE queue.
angarabase_qos_queued_background_totalcounterTotal tasks placed in QoS BACKGROUND queue.
angarabase_qos_rejected_critical_totalcounterCRITICAL queue rejections with SQLSTATE 53600.
angarabase_qos_rejected_interactive_totalcounterINTERACTIVE queue rejections with SQLSTATE 53600.
angarabase_qos_rejected_background_totalcounterBACKGROUND queue rejections with SQLSTATE 53600.
angarabase_qos_blocking_inflightgaugeCurrent blocking tasks across QoS shards.
angarabase_spawn_blocking_maxgaugespawn_blocking thread limit from max_blocking_threads; 0 before startup init.
angarabase_spawn_blocking_activegaugeActive spawn_blocking tasks. Incremented at start, decremented on completion via SpawnBlockingGuard (RM-0.6.4.19 Track C C2).

QoS queues by level:

rate({__name__=~"angarabase_qos_queued_.*_total"}[5m])

QoS rejections by level:

rate({__name__=~"angarabase_qos_rejected_.*_total"}[5m])

Alert on any scheduler rejection:

sum(rate({__name__=~"angarabase_qos_rejected_.*_total"}[5m])) > 0

Blocking pressure:

angarabase_qos_blocking_inflight > 0

Blocking budget headroom:

angarabase_spawn_blocking_max - angarabase_spawn_blocking_active

Interpretation:

  • queued_background_total grows but no rejected_* — scheduler accepts batch workload; usually normal;
  • rejected_background_total grows — batch/ETL is too aggressive; lower concurrency or raise ANGARABASE_QOS_MAX_QUEUED;
  • rejected_critical_total grows — production incident candidate: CRITICAL workload should not regularly hit the queue cap;
  • qos_blocking_inflight > 0 together with growth in qos_blocking wait event means pressure in the blocking runtime path.

Query Execution Duration Histogram (RM-0.6.5.10)

angarabase_query_exec_duration_ms — histogram of SQL query execution latency.

Note (RM-0.6.5.10 S6): histogram_quantile(0.99) is correct only if the value is < 10,000ms. At p99 ≥ 10,000ms, inspect the share in bucket +Inf. Buckets: [1,5,10,50,100,500,1000,2500,5000,10000,+Inf] ms.

SLO-oriented usage

  • Latency: histogram_quantile() over *_bucket (p95/p99)
  • Throughput: rate() over counters
  • Errors/contention: conflict/timeout/deadlock rates
  • Saturation: backpressure counters and queue depth

Contract pinning

Must-have metric names are considered part of the operability contract and are protected by a test:

  • crates/angarabase/src/metrics.rs
  • prometheus_export_contains_must_have_metrics_names

Next