fix: cap UploadPartCopy pump chunk at MAX_BUFFER_SIZE (concurrent-backup OOM)#101
Open
ServerSideHannes wants to merge 1 commit into
Open
fix: cap UploadPartCopy pump chunk at MAX_BUFFER_SIZE (concurrent-backup OOM)#101ServerSideHannes wants to merge 1 commit into
ServerSideHannes wants to merge 1 commit into
Conversation
Root cause of the concurrent-backup OOM, found via memray on a live pod: the top resident allocations were copy.py:398 (64MB) + crypto.py:354 (64MB) -- the scylla dedup UploadPartCopy path. _pump_copy_chunks sized its buffer from calculate_optimal_part_size, which returns up to 64MB for large sources. So each copy of a large SSTable buffered a 64MB chunk, copied it (bytes(buf[:chunk_size])) and re-encrypted it (~150-190MB resident) while the request limiter only reserved copy_pipeline_peak (~32MB, it assumes 8MB streaming). A handful of concurrent dedup copies of >=80MB parts therefore blew past the pod memory limit even though the governor read well under budget -- exactly the prod signature (RSS 512, governed ~48). Cap the pump chunk at MAX_BUFFER_SIZE so copies stream in 8MB chunks like every other path and stay matched to their reservation. Reproduced locally at prod config (512Mi/64MB): concurrent UploadPartCopy of 90-120MB sources pinned RSS at 511.9MiB (the wall); with the cap the same load peaks 321MiB. Streaming-copy round-trip tests still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Cap the UploadPartCopy pump chunk at
MAX_BUFFER_SIZE(8MB). This is the actual concurrent-backup OOM root cause — found by running memray on a live prod pod.Root cause (memray on prod, then reproduced locally)
The top resident allocations on a live pod under backup load were:
i.e. the scylla dedup
UploadPartCopypath, not uploads or GETs (which earlier fixes covered)._pump_copy_chunkssized its buffer fromcalculate_optimal_part_size, which returns up to 64MB for large sources. So each copy of a large SSTable:data = bytes(buf[:chunk_size])copies it (another 64MB), then re-encrypts it (another 64MB) → ~150–190MB resident per copy,copy_pipeline_peak(~32MB — it assumes 8MB streaming).So a handful of concurrent dedup copies of ≥80MB parts blew past the pod limit even though the governor read well under budget — exactly the prod signature (RSS ~512,
governed_active~48, few in-flight).Fix
chunk_size = min(calculate_optimal_part_size(...), MAX_BUFFER_SIZE)— copies now stream in 8MB chunks like every other path, matched to their reservation.Proof (local, prod config 512Mi / 64MB budget)
Concurrent
UploadPartCopyof 90–120MB sources + GETs:Streaming-copy round-trip tests pass; new
test_copy_chunk_bounded.pypins the invariant.Why this was missed until now
Earlier repros used
CopyObject(whole-object path) and single-load-type floods, which the governor bounds to ~300MB locally. The prod driver isUploadPartCopyof large SSTables under dedup — only visible once memray attributed native/resident memory on a live pod. Stacks on #97 (copy govern), #98 (GET), #99/#100 (concurrency cap + debug mode).