Skip to content

Match CHECK constraint exclusion planning speed (per-plan compilation + backend summary cache)#3

Merged
bitner merged 2 commits into
mainfrom
prune-perf
Jun 23, 2026
Merged

Match CHECK constraint exclusion planning speed (per-plan compilation + backend summary cache)#3
bitner merged 2 commits into
mainfrom
prune-perf

Conversation

@bitner

@bitner bitner commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Goal of this branch: make non-key pruning as cheap to plan as PostgreSQL's own constraint exclusion. Done — table_range now plans on par with (warm, slightly faster than) CHECK constraint exclusion, via two optimizations grounded in a review of the Postgres source.

How constraint exclusion works (the target)

relation_excluded_by_constraintsget_relation_constraints (plancat.c) reads each partition's CHECK from the relcache (rd_att->constr->check) via table_open(relid, NoLock) — the partition is already locked and cached, so it does zero extra I/O per partition — then proves contradiction with predicate_refuted_by (predtest.c) on in-memory expression trees. Per-partition cost ~5 µs.

Where our time went (measured)

A diagnostic attribution at 2,000 partitions split our original ~31 µs/partition: the index-page read was only ~7 µs; evaluation was ~23 µs, dominated by work identical across all partitions but redone per partitionbtree_strategy catalog lookups, getTypeInputInfo/fmgr_info setup inside every parse/compare, and constant rendering.

Two optimizations

1. Per-plan compilation (commit 1). Resolve the compare function, type-input function, and operator strategy once per plan (cached FmgrInfos + memos), reused across partitions. Planner-only; the aminsert path keeps the uncached helpers. → ~88 ms → ~43 ms; eval becomes effectively free.

2. Backend summary cache (commit 2, src/summary_cache.rs). Cache each index's deserialized summary for the life of the backend, so warm/repeated plans skip the per-partition index open + metapage read + deserialize entirely (shared via Rc with the per-plan cache). Coherence rests on the over-inclusive invariant: a cached summary is safe unless it is narrower than reality. Only aminsert widens; when it does it sends a relcache invalidation, and a registered callback drops the cached copy everywhere — locally at the next command boundary, in other backends at the widening transaction's commit (matching row visibility). Narrowing operations (delete, vacuum re-tighten) need no invalidation. Widen-invalidations are coalesced to one per index per transaction so bulk loads don't thrash. → ~43 ms → ~34 ms.

Result (2,000 partitions, warm, same session)

Planning
CHECK constraint exclusion ~37 ms
table_range before this branch ~84 ms
table_range after ~34 ms
no pruning (baseline expansion) ~26 ms

Per-partition planning cost fell from ~31 µs → ~3–4 µs — on par with constraint exclusion, and warm we actually beat it (we serve a cached summary; constraint exclusion re-parses each CHECK every plan). At 300 partitions, pruning-on planning (~4 ms) is now ~equal to pruning-off. A cold first plan still reads each page; every plan after is cached.

29 tests pass (including the insert-correctness tests, which exercise widen→invalidate); production build, clippy, and fmt clean. README performance/scaling sections and bench/benchmark.sql updated, including the CHECK constraint-exclusion comparison.

Tradeoff (documented)

The cache benefits read-mostly/repeated planning. For append-heavy workloads where nearly every insert widens the summary, the per-widen relcache invalidation is real cost (coalesced per transaction). Correctness is unconditional; the perf benefit is workload-dependent.

bitner and others added 2 commits June 23, 2026 10:48
… constraint exclusion

Investigation into why table_range planning is slower than PostgreSQL's own
constraint exclusion (the built-in way to prune on a non-key column via a
data-range CHECK per partition).

How constraint exclusion works (src/backend/optimizer/util/plancat.c):
relation_excluded_by_constraints -> get_relation_constraints reads each
partition's CHECK expressions straight from the relcache
(relation->rd_att->constr->check) via table_open(..., NoLock) -- the partition
is already locked and cached from planning, so it does zero extra I/O and zero
extra locking per partition, then predicate_refuted_by proves contradiction on
the in-memory expression trees.

Attribution of our ~31us/partition overhead at 2,000 partitions (via a temporary
diagnostic) was surprising: the index-page read+deserialize is only ~7us; the
*evaluation* was ~23us -- dominated by work that is identical across every
partition but was being redone for each one: btree_strategy (3 syscache lookups),
getTypeInputInfo and fmgr_info setup inside every text_to_datum / OidFunctionCall,
and constant rendering.

Fix: resolve those once per top-level plan and reuse across partitions --
- FMGR_MEMO: a palloc'd FmgrInfo per function, so each compare / input-function
  call skips fmgr_info's syscache lookup (FunctionCall2Coll / InputFunctionCall);
- INPUT_INFO_MEMO: getTypeInputInfo result per type;
- STRATEGY_MEMO: btree strategy per (operator, left type).
These caches are planner-only and cleared per plan; the aminsert path keeps using
the uncached datum_cmp / text_to_datum (no cross-statement cache to invalidate).

Result at 2,000 partitions (warm, same session): full planning ~88ms -> ~43ms;
eval is now effectively free (full ~= read-only ~= traversal-only). Versus CHECK
constraint exclusion (~33ms) we went from ~2.6x to ~1.3x; the residual ~5us/part
is the per-partition index-page read+deserialize. 29 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…traint exclusion

Backend-lifetime cache (src/summary_cache.rs) keyed by index OID, so warm/repeated
plans skip the per-partition index open + metapage read + deserialize entirely and
serve the summary from memory (shared via Rc with the per-plan cache).

Coherence rests on the over-inclusive invariant: a cached summary is safe as long as
it is never *narrower* than the data. Only aminsert widens a summary; when it does it
calls CacheInvalidateRelcacheByRelid on the index, and a registered relcache callback
drops the cached copy in every backend (locally at the next command boundary, in other
backends at the widening txn's commit -- matching row visibility). Operations that only
narrow (delete, vacuum re-tighten) need no invalidation. Widen-invalidations are
coalesced to one per index per transaction so bulk loads don't thrash.

Result at 2,000 partitions (warm, same session): planning ~43ms -> ~34ms, which now
matches -- and slightly beats -- CHECK constraint exclusion (~37ms), because we serve
a cached summary while constraint exclusion re-parses each CHECK every plan. At 300
partitions, pruning-on planning (~4ms) is now ~equal to pruning-off. Combined with the
earlier per-plan compilation, per-partition planning cost fell from ~31us to ~3-4us.
A cold first plan still reads each page; every plan after is cached. 29 tests pass.

README performance/scaling sections and benchmark numbers updated accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bitner bitner changed the title Compile predicate evaluation per plan (~2x faster planning; close gap to CHECK exclusion) Match CHECK constraint exclusion planning speed (per-plan compilation + backend summary cache) Jun 23, 2026
@bitner bitner merged commit 5210c50 into main Jun 23, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant