Skip to content

fix: cap UploadPartCopy pump chunk at MAX_BUFFER_SIZE (concurrent-backup OOM)#101

Open
ServerSideHannes wants to merge 1 commit into
mainfrom
fix/copy-chunk-memory
Open

fix: cap UploadPartCopy pump chunk at MAX_BUFFER_SIZE (concurrent-backup OOM)#101
ServerSideHannes wants to merge 1 commit into
mainfrom
fix/copy-chunk-memory

Conversation

@ServerSideHannes

Copy link
Copy Markdown
Owner

What

Cap the UploadPartCopy pump chunk at MAX_BUFFER_SIZE (8MB). This is the actual concurrent-backup OOM root cause — found by running memray on a live prod pod.

Root cause (memray on prod, then reproduced locally)

The top resident allocations on a live pod under backup load were:

64.0 MB  crypto.py:354   (re-encrypt)
64.0 MB  copy.py:398     (chunk buffer)   ← handlers/multipart/copy.py
25.7 MB  copy.py:393

i.e. the scylla dedup UploadPartCopy path, not uploads or GETs (which earlier fixes covered). _pump_copy_chunks sized its buffer from calculate_optimal_part_size, which returns up to 64MB for large sources. So each copy of a large SSTable:

  • buffers a 64MB chunk, then data = bytes(buf[:chunk_size]) copies it (another 64MB), then re-encrypts it (another 64MB) → ~150–190MB resident per copy,
  • while the request limiter reserved only copy_pipeline_peak (~32MB — it assumes 8MB streaming).

So a handful of concurrent dedup copies of ≥80MB parts blew past the pod limit even though the governor read well under budget — exactly the prod signature (RSS ~512, governed_active ~48, few in-flight).

Fix

chunk_size = min(calculate_optimal_part_size(...), MAX_BUFFER_SIZE) — copies now stream in 8MB chunks like every other path, matched to their reservation.

Proof (local, prod config 512Mi / 64MB budget)

Concurrent UploadPartCopy of 90–120MB sources + GETs:

peak RSS
before 511.9 MiB (the wall)
after 321 MiB

Streaming-copy round-trip tests pass; new test_copy_chunk_bounded.py pins the invariant.

Why this was missed until now

Earlier repros used CopyObject (whole-object path) and single-load-type floods, which the governor bounds to ~300MB locally. The prod driver is UploadPartCopy of large SSTables under dedup — only visible once memray attributed native/resident memory on a live pod. Stacks on #97 (copy govern), #98 (GET), #99/#100 (concurrency cap + debug mode).

Root cause of the concurrent-backup OOM, found via memray on a live pod: the top
resident allocations were copy.py:398 (64MB) + crypto.py:354 (64MB) -- the
scylla dedup UploadPartCopy path.

_pump_copy_chunks sized its buffer from calculate_optimal_part_size, which
returns up to 64MB for large sources. So each copy of a large SSTable buffered a
64MB chunk, copied it (bytes(buf[:chunk_size])) and re-encrypted it (~150-190MB
resident) while the request limiter only reserved copy_pipeline_peak (~32MB, it
assumes 8MB streaming). A handful of concurrent dedup copies of >=80MB parts
therefore blew past the pod memory limit even though the governor read well
under budget -- exactly the prod signature (RSS 512, governed ~48).

Cap the pump chunk at MAX_BUFFER_SIZE so copies stream in 8MB chunks like every
other path and stay matched to their reservation.

Reproduced locally at prod config (512Mi/64MB): concurrent UploadPartCopy of
90-120MB sources pinned RSS at 511.9MiB (the wall); with the cap the same load
peaks 321MiB. Streaming-copy round-trip tests still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant