Statistics and ANALYZE
This section describes how to work with the statistics collection subsystem in AngaraBase.
ANALYZE
The ANALYZE command collects statistics about data distribution in tables, which the optimizer uses to build efficient query execution plans.
Drift Detection
When running ANALYZE, AngaraBase uses a drift detection mechanism to
minimize unnecessary writes to the system catalog. If the new
distinct_estimate value changed by less than 10% compared with the current
stored value, the statistics update for that column is skipped.
This avoids unnecessary I/O and cached-plan invalidation for minor data changes.
Column Statistics and distinct_estimate
The main metric for estimating selectivity is distinct_estimate (analogous to
n_distinct in PostgreSQL). It estimates the number of unique values in a column.
- If
distinct_estimateequals the total number of rows, the column is unique. - If
distinct_estimateis small compared with the row count, the column has low cardinality.
The optimizer (CBO) uses this data to calculate predicate selectivity.
Cardinality-Aware Index Scan
Starting with version 0.6.5.2, AngaraBase uses an improved algorithm for choosing between
IndexScan and SeqScan that accounts for column cardinality.
Previously, the planner could incorrectly choose an index for columns with few unique values because of strict minimum-selectivity limits. That limit has now been removed, and CBO correctly computes cost for low-cardinality columns.
How it works:
- The planner computes filter selectivity from
distinct_estimate. - If selectivity exceeds
[execution].index_cardinality_threshold(default 0.15), the planner prefersSeqScanfor the corresponding gate (“low cardinality” inEXPLAIN; see alsoindex_scan_selectivity_thresholdfor the “low selectivity” reason). EXPLAINprints the selection reason:seq scan chosen: low cardinality (0.1328).
Example:
If a 1,000,000-row table has only 3 values in the status column,
the selectivity of the status = 'ACTIVE' filter is about 0.33. Because 0.33 > 0.15,
the database chooses a sequential scan, since reading a third of the table through
an index would be slower because of random I/O.
Multicolumn Statistics
Multicolumn statistics allow the optimizer to account for correlation between several columns, which is critical for complex predicates.
New in 0.6.4.18: Multicolumn statistics collected by
ANALYZEare now persisted to disk (sys_catalog snapshot protocol v4). Statistics survive server restarts. To force a fresh collection, runANALYZE <table>again.
Viewing Statistics
Statistics are available through system views (sys_catalog).