Query Processing

Goal

Explain how AngaraBase processes SQL queries: from text to result. Useful for understanding EXPLAIN plans and performance diagnostics.

Pipeline Overview

Each SQL query goes through four stages:

SQL text ──▸ Parsing ──▸ Planning ──▸ Optimization ──▸ Execution ──▸ Result

1. Parsing: SQL → AST

The parser converts the query text into an Abstract Syntax Tree (AST). AngaraBase uses a PostgreSQL-compatible SQL dialect.

If the syntax is not supported, the server returns SQLSTATE 0A000 (feature_not_supported) with a description of the unsupported construct.

2. Planning: AST → Logical Plan

During the planning stage:

Name resolution — matching table, column, and function names to catalog objects.
Type checking — verifying types and applying automatic coercion if needed.

The result is a logical plan describing what needs to be done, but not how.

3. Optimization (AngaraPlan)

The AngaraPlan optimizer converts the logical plan into a physical one, choosing the most efficient execution strategy.

Cost-based optimizer (CBO): decisions are made based on statistics (AngaraStat) — row counts, value distributions, index availability.

Key optimizer decisions:

Decision	Options
Access path	Full table scan, B+tree index scan, BRIN scan
Join method	Hash join, nested loop join
Join order	Reordering tables to minimize intermediate results

Robust planning: the optimizer is resilient to estimation errors — with a significant discrepancy between estimated and actual rows, the plan remains viable (avoids worst-case behavior).

LEO (Learning Optimizer): feedback loop — after query execution, actual statistics are used to improve future estimates.

4. Execution (AngaraFlow)

The AngaraFlow executor runs the physical plan in an iterator/streaming model (Volcano): each operator requests the next row from its child operator.

Core Operators:

Operator	Description
Scan	Reading rows from the heap (full scan) or an index (index scan)
Filter	Applying `WHERE` predicates
Hash Join	Joining via a hash table; Grace hash join for large datasets
Nested Loop	Joining via nested loops (for small tables or index lookups)
Group By	Aggregation (`GROUP BY`, `HAVING`)
Sort	Sorting; external sort for data that doesn’t fit in memory
Limit	Limiting the number of rows

AngaraStat (Statistics)

The optimizer uses statistics from system tables to estimate plan costs.

sys.table_stats

Column	Description
`row_count`	Estimated number of rows in the table
`mutation_epoch`	Mutation counter (to identify stale statistics)

sys.column_stats

Column	Description
`ndv`	Number of distinct values (HyperLogLog)
`min_value` / `max_value`	Value range boundaries
`null_count`	Number of NULL values
`histogram`	Value distribution (equi-height histogram)
`mcv`	Most common values (and their frequencies)

Managing Statistics Level

ALTER TABLE t SET (stats_level_max = 2);

Level	What is gathered
0	Only row_count and mutation_epoch
1	+ NDV, min/max, null_count
2	+ Histograms, MCV (reservoir sampling)
3	+ Extended statistics (reserved)

A higher level gives the optimizer more information but increases statistics gathering time.

Query Diagnostics

EXPLAIN

Viewing the execution plan without executing the query:

EXPLAIN SELECT * FROM orders WHERE customer_id = 42;

Variants:

EXPLAIN ANALYZE SELECT ...; -- executes the query, shows actual rows/time
EXPLAIN (BUFFERS) SELECT ...; -- + page read statistics
EXPLAIN (FORMAT JSON) SELECT ...; -- JSON output

System Views

View	Description
`angara_stat_activity`	Currently active queries
`angara_stat_statements`	Aggregated statistics by query type
`angara_top_queries`	Top queries by execution time

Slow Query Log

Enabled via environment variables:

ANGARABASE_LOG_MIN_DURATION_MS=100 # log queries taking longer than 100 ms
ANGARABASE_LOG_QUERY_TEXT=1 # include the query text in the log

More details — Diagnostics.

Future

Planned query processing improvements:

Component	Description
AngaraVector	Vectorized/SIMD execution — processing batches by column instead of row-at-a-time
AngaraParallel	Morsel-driven parallelism — parallel execution across multiple cores
AngaraAdapt	Adaptive processing — switching strategies during execution

Concepts (What to read next)

Indexes — how the optimizer chooses an index for SELECT/UPDATE.
Catalog and Metadata — where statistics used by the planner are stored.
Transactions and MVCC — how the isolation level affects the execution plan.

How-to (What to do)

Diagnostics — EXPLAIN ANALYZE, slow-query log, how to read a plan.
Tracing — distributed tracing of parse → plan → execute phases.
Logging — ANGARABASE_LOG_QUERY_TEXT, ANGARABASE_LOG_MIN_DURATION_MS.

Reference

SQL Overview — what parts of the standard are supported at the planner level.
System Views sys.* — sys.queries, sys.column_stats, sys.health.

Keyboard shortcuts

AngaraBook