Statistics and ANALYZE

This section describes how to work with the statistics collection subsystem in AngaraBase.

ANALYZE

The ANALYZE command collects statistics about data distribution in tables, which the optimizer uses to build efficient query execution plans.

When running ANALYZE, AngaraBase uses a drift detection mechanism to minimize unnecessary writes to the system catalog. If the new distinct_estimate value changed by less than 10% compared with the current stored value, the statistics update for that column is skipped.

This avoids unnecessary I/O and cached-plan invalidation for minor data changes.

Column Statistics and distinct_estimate

The main metric for estimating selectivity is distinct_estimate (analogous to n_distinct in PostgreSQL). It estimates the number of unique values in a column.

If distinct_estimate equals the total number of rows, the column is unique.
If distinct_estimate is small compared with the row count, the column has low cardinality.

The optimizer (CBO) uses this data to calculate predicate selectivity.

Cardinality-Aware Index Scan

Starting with version 0.6.5.2, AngaraBase uses an improved algorithm for choosing between IndexScan and SeqScan that accounts for column cardinality.

Previously, the planner could incorrectly choose an index for columns with few unique values because of strict minimum-selectivity limits. That limit has now been removed, and CBO correctly computes cost for low-cardinality columns.

How it works:

The planner computes filter selectivity from distinct_estimate.
If selectivity exceeds [execution].index_cardinality_threshold (default 0.15), the planner prefers SeqScan for the corresponding gate (“low cardinality” in EXPLAIN; see also index_scan_selectivity_threshold for the “low selectivity” reason).
EXPLAIN prints the selection reason: seq scan chosen: low cardinality (0.1328).

Example: If a 1,000,000-row table has only 3 values in the status column, the selectivity of the status = 'ACTIVE' filter is about 0.33. Because 0.33 > 0.15, the database chooses a sequential scan, since reading a third of the table through an index would be slower because of random I/O.

Multicolumn Statistics

Multicolumn statistics allow the optimizer to account for correlation between several columns, which is critical for complex predicates.

New in 0.6.4.18: Multicolumn statistics collected by ANALYZE are now persisted to disk (sys_catalog snapshot protocol v4). Statistics survive server restarts. To force a fresh collection, run ANALYZE <table> again.

Viewing Statistics

Statistics are available through system views (sys_catalog).

AngaraBook

Statistics and ANALYZE

ANALYZE

Drift Detection

Column Statistics and distinct_estimate

Cardinality-Aware Index Scan

Multicolumn Statistics

Viewing Statistics

Keyboard shortcuts

AngaraBook

Statistics and ANALYZE

ANALYZE

Drift Detection

Column Statistics and distinct_estimate

Cardinality-Aware Index Scan

Multicolumn Statistics

Viewing Statistics