Skip to content

Index-resident, incrementally-maintained summaries (BRIN-style); remove side table + REINDEX#2

Merged
bitner merged 7 commits into
mainfrom
summary-perf
Jun 23, 2026
Merged

Index-resident, incrementally-maintained summaries (BRIN-style); remove side table + REINDEX#2
bitner merged 7 commits into
mainfrom
summary-perf

Conversation

@bitner

@bitner bitner commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Replaces the side-table + mark-stale + REINDEX maintenance model with self-maintaining, index-resident summaries — the way BRIN works. Follow-up to #1.

What changed

  • Summaries live in the index's own metapage (src/index_storage.rs): page I/O via the Generic WAL API + a compact versioned (de)serialization. No side table.
  • ambuild scans the leaf and writes the summary to the metapage.
  • The planner reads each partition's summary directly from its index page (cached per plan), instead of an SPI load of a side table.
  • aminsert widens the summary in place as rows arrive — like BRIN. Because the summary only ever has to be over-inclusive, these updates need no MVCC: an insert within range writes nothing; one that extends it grows min/max (scalars, in-memory) or the extent (range/geometry, via the type's union). Pruning stays correct and active across inserts with no REINDEX.
  • Removed: the table_range_summary table, the stale flag + per-transaction memo, and the sql_drop cleanup event trigger (DROP INDEX frees the summary with the index). Deletes leave the summary conservatively wide (safe); VACUUM/REINDEX re-tighten.

Load-bearing dependency discovered

The planner only puts indisvalid indexes into rel->indexlist, which is where the new read path looks. So a table_range index must stay valid for pruning to engage. This never mattered on main (it read a side table by relid) but matters now; it's documented in read_index_summary. (Surfaced while benchmarking against a stale dev DB carrying a pre-main "hide indexes" trigger that marked indexes invalid.)

Planning-cost optimization

The pathlist hook runs once per partition and was re-resolving the btree compare proc (3 syscache lookups) and re-parsing the query constant for every partition. Both are identical across a column's partitions, so they're now memoized per top-level plan. Warm planning for a non-key predicate at 2,000 partitions dropped ~139 ms → ~80 ms (our per-partition overhead roughly halved).

Benchmarks (warm, EXPLAIN ANALYZE, pg18)

vs the old side-table design (main) — no regression. Reading each partition's summary from its index page is as cheap as main's batched side-table load; planning within ~1–3%, execution identical. Pruning's win remains execution: e.g. ~0.4 ms vs ~100 ms at 300×8k-row partitions.

vs native declarative pruning (two identical columns: pk = partition key, nk = same values, non-key + our index):

Partitions Native (pk) planning Ours (nk) planning
2,000 0.15 ms ~80 ms (post-optimization)
10,000 0.29 ms ERROR: out of shared memory

Scaling limitations (documented honestly)

Native pruning is ~constant-time because it prunes on the partition key's sorted bounds before partitions are locked/opened. A non-key predicate forces PG to expand and lock every partition, which is O(n) and exhausts the lock table around ~10k partitions (raise max_locks_per_transaction to push that out). This is inherent to PG's planner — non-key pruning can't be made sub-linear from public hooks. table_range targets the hundreds-to-low-thousands of sizeable partitions, non-key-predicate sweet spot, where it wins big on execution. True tens-of-thousands scaling would need pre-expansion pruning (a core-PG hook), tracked separately.

Tests

29 tests pass on pg18 (insert-correctness tests verify out-of-range inserts widen the page summary so new rows are found and pruning stays active). Production build, clippy -D warnings, and fmt all clean. New tests cover the raw page round-trip and that ambuild persists the summary.

bitner and others added 7 commits June 22, 2026 17:07
First step toward BRIN-style, in-index incremental summaries (no side table, no
stale flag, no REINDEX). Adds index_storage.rs: write/read a length-prefixed byte
blob in the index's own metapage (block 0), updated in place and WAL-logged via the
Generic WAL API. Because a table_range summary only needs to be over-inclusive,
these page updates need no MVCC/transactionality.

Round-trip is proven by a pg_test (caught and fixed the Generic-WAL page-hole
zeroing by setting pd_lower = pd_upper). Existing 26 tests unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the IndexSummary/ColSummary types (one entry per indexed column: attnum,
minmax-vs-overlap kind, type name, min/max text, null flags) and a compact,
versioned byte format persisted into the index metapage via the page-I/O layer.
Pure-Rust round-trip + bad-input tests pass on the host.

This is the shared currency for the next stages: ambuild builds an IndexSummary
and writes it; the planner reads it per partition; aminsert reads/widens/writes it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (BRIN-style)

Replace the side table + mark-stale + REINDEX model with self-maintaining,
index-resident summaries:

- ambuild now writes the per-partition summary to the index's own metapage
  (index_storage.rs), keyed by nothing but the index itself.
- The planner reads each partition's summary from its index page (cached per plan)
  instead of an SPI load of a side table.
- aminsert widens the metapage summary in place as rows arrive — no MVCC needed
  because the summary only has to be over-inclusive. An insert within range writes
  nothing; one that extends it grows min/max (scalars, in memory) or the extent
  (range/geometry, via the type's union). Pruning stays correct AND active across
  inserts with no REINDEX.
- Remove the table_range_summary side table, the stale flag + per-txn memo, and the
  sql_drop cleanup event trigger (DROP INDEX frees the summary with the index).
  Deletes leave the summary conservatively wide (safe); VACUUM/REINDEX re-tighten.

29 tests pass on pg18; production build, clippy -D warnings, and fmt all clean.
README and module docs updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on build)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The planner only puts valid indexes into rel->indexlist, which is where the
pathlist hook reads the per-partition summary from. Note this load-bearing
dependency so a future change (or an external DDL hook) that invalidates the
index doesn't silently disable pruning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pathlist hook runs once per partition, and for a column predicate it was
re-resolving the btree compare proc (three syscache lookups) and re-parsing the
query constant for every partition. Both are identical across all partitions of
a column, so memoize them per top-level plan (cleared in clear_cache).

Cuts our per-partition planning overhead roughly in half: at 2000 partitions,
warm planning for a non-key-column predicate drops from ~139ms to ~80ms. This
does not change the O(partitions) scaling (PG still expands every partition for
a non-key predicate) but materially widens the range of partition counts where
pruning's execution win outweighs its planning cost.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pg_sys::TupleDescAttr is only bound on PG18 (where it became an inline C
function); on PG13-17 it is a macro bindgen does not surface, and PG18 also
moved attributes to compact_attrs. Add a version-gated att_typid() helper:
TupleDescAttr on pg18, direct .attrs access on earlier versions. Fixes the
pg16/pg17/postgis CI build failures (only pg18 was exercised locally).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bitner bitner merged commit 5e26e77 into main Jun 23, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant