Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Instance Lifecycle

This document explains the conceptual model of AngaraBase instance identity, lifecycle, and the Instance Lease system that enables safe crash recovery and storage portability.

Instance Identity

Each AngaraBase instance has a unique identity established during initialization:

Core Identity Components

  • cluster_id: UUID identifying the logical database cluster
  • instance_id: UUID identifying this specific instance
  • Data directory: Physical location of database files
  • Transaction log directory: Physical location of WAL files

Identity Persistence

Identity is stored in two places:

  1. VERSION marker: Binary file with format version and IDs
  2. System catalog pages: In base.adb reserved pages with full metadata

Instance Lease System

The Instance Lease prevents multiple instances from accessing the same data files simultaneously, which would cause corruption.

Lease Structure

#![allow(unused)]
fn main() {
pub struct InstanceLeaseV0 {
 pub holder_id: String, // UUID of owning instance
 pub acquired_at_unix_s: u64, // When lease was taken
 pub expires_at_unix_s: u64, // When lease expires (TTL)
 pub holder_pid: u32, // Process ID (diagnostic)
 pub holder_hostname: String, // Hostname (diagnostic)
}
}

Lease State Machine

 [None] ──acquire──> [Held] ──heartbeat──> [Held]
 ↑ │ │
 │ │ │
 └──expired/release──┘ │
 │
 [Expired] <──────────timeout─────────────────┘
 │
 └──takeover──> [Held by new instance]

State Transitions

  1. None → Held: First instance startup or after graceful shutdown
  2. Held → Held: Periodic heartbeat updates (every 10s by default)
  3. Held → None: Graceful shutdown releases lease immediately
  4. Held → Expired: Heartbeat stops (crash, network partition)
  5. Expired → Held: New instance takes over after TTL expiration

Lease Storage

  • Location: Stored in SysCatalogMetaV0 within base.adb pages
  • Persistence: Atomic updates with full page images
  • Reliability: Works on NFS/SAN where flock() is unreliable

Startup Sequence

Phase 1: Pre-flight Checks

  1. Verify data directory exists and is initialized
  2. Check VERSION marker compatibility
  3. Validate page size matches compiled binary

Phase 2: Lease Acquisition

  1. Load system catalog from base.adb
  2. Check existing lease status:
  • No lease: Acquire immediately
  • Expired lease: Take over with warning
  • Active lease: Fail with informative error
  • Force takeover: Override active lease (dangerous)

Phase 3: Recovery

  1. WAL Recovery: Replay transaction log (file_bin backend)
  2. MVCC Recovery: Restore in-memory transaction state
  3. Heartbeat Start: Begin periodic lease renewal

Phase 4: Ready for Connections

  1. Start protocol listeners (pgwire, admin)
  2. Begin accepting client connections
  3. Continue heartbeat until shutdown

Recovery Modes

AngaraBase tracks the recovery mode for operational visibility:

Normal Startup

  • Clean start on existing, properly shut down data
  • No WAL replay required
  • recovery_mode = "normal"

Crash Recovery

  • Previous instance terminated unexpectedly
  • WAL replay recovers committed transactions
  • MVCC state rebuilt from transaction log
  • recovery_mode = "crash_recovery"

Forced Takeover

  • Operator used ANGARABASE_FORCE_LEASE_TAKEOVER=1
  • May indicate emergency recovery scenario
  • recovery_mode = "forced_takeover"

Shared Storage Scenarios

The Instance Lease system enables AngaraBase to work correctly on shared storage where multiple hosts can access the same files.

NFS/SAN Deployment

Host A ──┐
 ├── NFS/SAN ──> [data/] [txlog/]
Host B ──┘ [base.adb with lease]

Benefits

  • Failover: Host B can take over if Host A crashes
  • Maintenance: Move instance between hosts without dump/restore
  • Testing: Run against production data copies safely

Limitations

  • Single writer: Only one instance can write at a time
  • Network partitions: May cause false lease expiration
  • Performance: Network storage latency affects throughput

File Copy Scenarios

For non-shared storage, manual file copy enables:

  • Backup testing: Verify backup integrity on different host
  • Development: Use production data copy for debugging
  • Migration: Move to new hardware without downtime

Configuration

Lease Timing

  • ANGARABASE_LEASE_TTL_S: How long lease lasts (default: 30s)
  • ANGARABASE_LEASE_HEARTBEAT_S: Renewal frequency (default: 10s)

Safety Controls

  • ANGARABASE_FORCE_LEASE_TAKEOVER: Emergency override (default: false)
# Production: Longer TTL for network stability
export ANGARABASE_LEASE_TTL_S=60
export ANGARABASE_LEASE_HEARTBEAT_S=20

# Development: Shorter TTL for faster iteration 
export ANGARABASE_LEASE_TTL_S=15
export ANGARABASE_LEASE_HEARTBEAT_S=5

Monitoring and Observability

Instance Status

-- Check current lease holder
SELECT lease_holder_id, lease_holder_hostname, 
 lease_expires_at, recovery_mode 
FROM sys.identity;

-- Check system health
SELECT uptime_seconds, txn_commit_epoch_current 
FROM sys.health;

Lease Events

AngaraBase logs lease events to stderr:

Instance lease acquired: holder=abc123...
Instance lease taken over: holder=def456...
Warning: lease heartbeat failed: I/O error
Instance lease released: holder=abc123...

Metrics Integration

Future versions will expose lease metrics via:

  • Prometheus metrics endpoint
  • sys.metrics virtual table
  • Structured logging output

Security Considerations

Access Control

  • Lease system does NOT provide authentication
  • File system permissions still required
  • Network access controls recommended for shared storage

Audit Trail

  • Lease changes logged with timestamps
  • Instance identity tracked in sys.identity
  • Recovery mode visible for forensics

Troubleshooting

Common Issues

“Cannot start: database files are owned by another instance”

  • Diagnosis: Active lease prevents startup
  • Resolution: Wait for expiration or verify other instance is dead

Frequent lease takeovers

  • Diagnosis: Network instability or resource contention
  • Resolution: Increase TTL, check network/disk performance

“MVCC recovery failed”

  • Diagnosis: Corrupted transaction log
  • Resolution: Check filesystem, restore from backup if needed

Debug Information

-- Instance identity and lease
SELECT * FROM sys.identity;

-- Recent recovery statistics 
SELECT * FROM sys.health;

-- Transaction log status
SELECT * FROM sys.settings WHERE name LIKE 'transaction_log.%';

Связанные разделы

Концепции (что почитать дальше)

How-to (что сделать)

Справочник