Match CHECK constraint exclusion planning speed (per-plan compilation + backend summary cache)#3
Merged
Conversation
… constraint exclusion Investigation into why table_range planning is slower than PostgreSQL's own constraint exclusion (the built-in way to prune on a non-key column via a data-range CHECK per partition). How constraint exclusion works (src/backend/optimizer/util/plancat.c): relation_excluded_by_constraints -> get_relation_constraints reads each partition's CHECK expressions straight from the relcache (relation->rd_att->constr->check) via table_open(..., NoLock) -- the partition is already locked and cached from planning, so it does zero extra I/O and zero extra locking per partition, then predicate_refuted_by proves contradiction on the in-memory expression trees. Attribution of our ~31us/partition overhead at 2,000 partitions (via a temporary diagnostic) was surprising: the index-page read+deserialize is only ~7us; the *evaluation* was ~23us -- dominated by work that is identical across every partition but was being redone for each one: btree_strategy (3 syscache lookups), getTypeInputInfo and fmgr_info setup inside every text_to_datum / OidFunctionCall, and constant rendering. Fix: resolve those once per top-level plan and reuse across partitions -- - FMGR_MEMO: a palloc'd FmgrInfo per function, so each compare / input-function call skips fmgr_info's syscache lookup (FunctionCall2Coll / InputFunctionCall); - INPUT_INFO_MEMO: getTypeInputInfo result per type; - STRATEGY_MEMO: btree strategy per (operator, left type). These caches are planner-only and cleared per plan; the aminsert path keeps using the uncached datum_cmp / text_to_datum (no cross-statement cache to invalidate). Result at 2,000 partitions (warm, same session): full planning ~88ms -> ~43ms; eval is now effectively free (full ~= read-only ~= traversal-only). Versus CHECK constraint exclusion (~33ms) we went from ~2.6x to ~1.3x; the residual ~5us/part is the per-partition index-page read+deserialize. 29 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…traint exclusion Backend-lifetime cache (src/summary_cache.rs) keyed by index OID, so warm/repeated plans skip the per-partition index open + metapage read + deserialize entirely and serve the summary from memory (shared via Rc with the per-plan cache). Coherence rests on the over-inclusive invariant: a cached summary is safe as long as it is never *narrower* than the data. Only aminsert widens a summary; when it does it calls CacheInvalidateRelcacheByRelid on the index, and a registered relcache callback drops the cached copy in every backend (locally at the next command boundary, in other backends at the widening txn's commit -- matching row visibility). Operations that only narrow (delete, vacuum re-tighten) need no invalidation. Widen-invalidations are coalesced to one per index per transaction so bulk loads don't thrash. Result at 2,000 partitions (warm, same session): planning ~43ms -> ~34ms, which now matches -- and slightly beats -- CHECK constraint exclusion (~37ms), because we serve a cached summary while constraint exclusion re-parses each CHECK every plan. At 300 partitions, pruning-on planning (~4ms) is now ~equal to pruning-off. Combined with the earlier per-plan compilation, per-partition planning cost fell from ~31us to ~3-4us. A cold first plan still reads each page; every plan after is cached. 29 tests pass. README performance/scaling sections and benchmark numbers updated accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal of this branch: make non-key pruning as cheap to plan as PostgreSQL's own constraint exclusion. Done — table_range now plans on par with (warm, slightly faster than)
CHECKconstraint exclusion, via two optimizations grounded in a review of the Postgres source.How constraint exclusion works (the target)
relation_excluded_by_constraints→get_relation_constraints(plancat.c) reads each partition'sCHECKfrom the relcache (rd_att->constr->check) viatable_open(relid, NoLock)— the partition is already locked and cached, so it does zero extra I/O per partition — then proves contradiction withpredicate_refuted_by(predtest.c) on in-memory expression trees. Per-partition cost ~5 µs.Where our time went (measured)
A diagnostic attribution at 2,000 partitions split our original ~31 µs/partition: the index-page read was only ~7 µs; evaluation was ~23 µs, dominated by work identical across all partitions but redone per partition —
btree_strategycatalog lookups,getTypeInputInfo/fmgr_infosetup inside every parse/compare, and constant rendering.Two optimizations
1. Per-plan compilation (commit 1). Resolve the compare function, type-input function, and operator strategy once per plan (cached
FmgrInfos + memos), reused across partitions. Planner-only; theaminsertpath keeps the uncached helpers. → ~88 ms → ~43 ms; eval becomes effectively free.2. Backend summary cache (commit 2,
src/summary_cache.rs). Cache each index's deserialized summary for the life of the backend, so warm/repeated plans skip the per-partition index open + metapage read + deserialize entirely (shared viaRcwith the per-plan cache). Coherence rests on the over-inclusive invariant: a cached summary is safe unless it is narrower than reality. Onlyaminsertwidens; when it does it sends a relcache invalidation, and a registered callback drops the cached copy everywhere — locally at the next command boundary, in other backends at the widening transaction's commit (matching row visibility). Narrowing operations (delete, vacuum re-tighten) need no invalidation. Widen-invalidations are coalesced to one per index per transaction so bulk loads don't thrash. → ~43 ms → ~34 ms.Result (2,000 partitions, warm, same session)
CHECKconstraint exclusionPer-partition planning cost fell from ~31 µs → ~3–4 µs — on par with constraint exclusion, and warm we actually beat it (we serve a cached summary; constraint exclusion re-parses each
CHECKevery plan). At 300 partitions, pruning-on planning (~4 ms) is now ~equal to pruning-off. A cold first plan still reads each page; every plan after is cached.29 tests pass (including the insert-correctness tests, which exercise widen→invalidate); production build, clippy, and fmt clean. README performance/scaling sections and
bench/benchmark.sqlupdated, including the CHECK constraint-exclusion comparison.Tradeoff (documented)
The cache benefits read-mostly/repeated planning. For append-heavy workloads where nearly every insert widens the summary, the per-widen relcache invalidation is real cost (coalesced per transaction). Correctness is unconditional; the perf benefit is workload-dependent.