Index-resident, incrementally-maintained summaries (BRIN-style); remove side table + REINDEX#2
Merged
Conversation
First step toward BRIN-style, in-index incremental summaries (no side table, no stale flag, no REINDEX). Adds index_storage.rs: write/read a length-prefixed byte blob in the index's own metapage (block 0), updated in place and WAL-logged via the Generic WAL API. Because a table_range summary only needs to be over-inclusive, these page updates need no MVCC/transactionality. Round-trip is proven by a pg_test (caught and fixed the Generic-WAL page-hole zeroing by setting pd_lower = pd_upper). Existing 26 tests unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the IndexSummary/ColSummary types (one entry per indexed column: attnum, minmax-vs-overlap kind, type name, min/max text, null flags) and a compact, versioned byte format persisted into the index metapage via the page-I/O layer. Pure-Rust round-trip + bad-input tests pass on the host. This is the shared currency for the next stages: ambuild builds an IndexSummary and writes it; the planner reads it per partition; aminsert reads/widens/writes it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (BRIN-style) Replace the side table + mark-stale + REINDEX model with self-maintaining, index-resident summaries: - ambuild now writes the per-partition summary to the index's own metapage (index_storage.rs), keyed by nothing but the index itself. - The planner reads each partition's summary from its index page (cached per plan) instead of an SPI load of a side table. - aminsert widens the metapage summary in place as rows arrive — no MVCC needed because the summary only has to be over-inclusive. An insert within range writes nothing; one that extends it grows min/max (scalars, in memory) or the extent (range/geometry, via the type's union). Pruning stays correct AND active across inserts with no REINDEX. - Remove the table_range_summary side table, the stale flag + per-txn memo, and the sql_drop cleanup event trigger (DROP INDEX frees the summary with the index). Deletes leave the summary conservatively wide (safe); VACUUM/REINDEX re-tighten. 29 tests pass on pg18; production build, clippy -D warnings, and fmt all clean. README and module docs updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on build) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The planner only puts valid indexes into rel->indexlist, which is where the pathlist hook reads the per-partition summary from. Note this load-bearing dependency so a future change (or an external DDL hook) that invalidates the index doesn't silently disable pruning. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pathlist hook runs once per partition, and for a column predicate it was re-resolving the btree compare proc (three syscache lookups) and re-parsing the query constant for every partition. Both are identical across all partitions of a column, so memoize them per top-level plan (cleared in clear_cache). Cuts our per-partition planning overhead roughly in half: at 2000 partitions, warm planning for a non-key-column predicate drops from ~139ms to ~80ms. This does not change the O(partitions) scaling (PG still expands every partition for a non-key predicate) but materially widens the range of partition counts where pruning's execution win outweighs its planning cost. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pg_sys::TupleDescAttr is only bound on PG18 (where it became an inline C function); on PG13-17 it is a macro bindgen does not surface, and PG18 also moved attributes to compact_attrs. Add a version-gated att_typid() helper: TupleDescAttr on pg18, direct .attrs access on earlier versions. Fixes the pg16/pg17/postgis CI build failures (only pg18 was exercised locally). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the side-table + mark-stale + REINDEX maintenance model with self-maintaining, index-resident summaries — the way BRIN works. Follow-up to #1.
What changed
src/index_storage.rs): page I/O via the Generic WAL API + a compact versioned (de)serialization. No side table.ambuildscans the leaf and writes the summary to the metapage.aminsertwidens the summary in place as rows arrive — like BRIN. Because the summary only ever has to be over-inclusive, these updates need no MVCC: an insert within range writes nothing; one that extends it grows min/max (scalars, in-memory) or the extent (range/geometry, via the type's union). Pruning stays correct and active across inserts with no REINDEX.table_range_summarytable, thestaleflag + per-transaction memo, and thesql_dropcleanup event trigger (DROP INDEXfrees the summary with the index). Deletes leave the summary conservatively wide (safe);VACUUM/REINDEXre-tighten.Load-bearing dependency discovered
The planner only puts
indisvalidindexes intorel->indexlist, which is where the new read path looks. So a table_range index must stay valid for pruning to engage. This never mattered onmain(it read a side table by relid) but matters now; it's documented inread_index_summary. (Surfaced while benchmarking against a stale dev DB carrying a pre-main"hide indexes" trigger that marked indexes invalid.)Planning-cost optimization
The pathlist hook runs once per partition and was re-resolving the btree compare proc (3 syscache lookups) and re-parsing the query constant for every partition. Both are identical across a column's partitions, so they're now memoized per top-level plan. Warm planning for a non-key predicate at 2,000 partitions dropped ~139 ms → ~80 ms (our per-partition overhead roughly halved).
Benchmarks (warm,
EXPLAIN ANALYZE, pg18)vs the old side-table design (
main) — no regression. Reading each partition's summary from its index page is as cheap asmain's batched side-table load; planning within ~1–3%, execution identical. Pruning's win remains execution: e.g. ~0.4 ms vs ~100 ms at 300×8k-row partitions.vs native declarative pruning (two identical columns:
pk= partition key,nk= same values, non-key + our index):ERROR: out of shared memoryScaling limitations (documented honestly)
Native pruning is ~constant-time because it prunes on the partition key's sorted bounds before partitions are locked/opened. A non-key predicate forces PG to expand and lock every partition, which is O(n) and exhausts the lock table around ~10k partitions (raise
max_locks_per_transactionto push that out). This is inherent to PG's planner — non-key pruning can't be made sub-linear from public hooks. table_range targets the hundreds-to-low-thousands of sizeable partitions, non-key-predicate sweet spot, where it wins big on execution. True tens-of-thousands scaling would need pre-expansion pruning (a core-PG hook), tracked separately.Tests
29 tests pass on pg18 (insert-correctness tests verify out-of-range inserts widen the page summary so new rows are found and pruning stays active). Production build, clippy
-D warnings, and fmt all clean. New tests cover the raw page round-trip and thatambuildpersists the summary.