Releases: ServerSideHannes/s3proxy-python
Release list
2026.7.1
Fixes
- COPY of large multipart-encrypted objects failed with
InvalidTag(#104)._iter_multipart_plaintextdecrypted each whole client part as a single AES-GCM seal, but a client part expands into multiple internal parts, each a sequence of independent frames. Any source whose parts held more than one frame (internal parts >8MB, e.g. ScyllaDB backups) failed to copy. The reader now walks internal parts → frames and decrypts one frame at a time, matching the GET path. Also bounds copy source-read peak memory to O(frame).
2026.6.16
feat: memory debug mode (RSS vs tracked heap + top allocations) (#100)
Diagnostic to pin the s3proxy OOM root cause. Gated by S3PROXY_MEMORY_DEBUG (alias S3PROXY_TRACEMALLOC), zero overhead when unset. Every interval logs real RSS vs Python-tracked heap vs untracked gap vs governor active bytes, then the top live allocations by call site.
One dump settles which world the OOM is in:
- large untracked gap -> C-level transport buffers (uvicorn/httptools), fix at HTTP/LB layer
- small gap -> Python, top list names the exact line
Usage: extraConfig { S3PROXY_MEMORY_DEBUG: "1" } + raise pod memory to ~1-2Gi so it survives to dump; read MEMORY_DEBUG / MEMORY_DEBUG_TOP under real backup load; revert.
No behavior change unless enabled.
2026.6.15
fix(chart): cap per-pod backend concurrency at the frontproxy (maxconn) (#99)
Stops the upload-side concurrent-backup OOM (dominant cause on 2026.6.14). uvicorn buffers each in-flight request body off the socket before the app's memory limiter runs, so a backup flood piles up bodies in the HTTP server's C-level buffers (governor reads ~64MB while RSS hits 512Mi+ -> OOMKilled). That memory is ungovernable from the app layer.
Fix: haproxy now caps in-flight requests PER pod (maxconn, default 40) and queues the excess (timeout queue) instead of overrunning a pod. Chart values: frontproxy.maxConnPerPod, frontproxy.timeouts.queue.
Verified locally at prod config (512Mi/64MB, 2026.6.14 app): direct 128x16MB PUT flood OOM-killed the pod (exit 137); via haproxy maxconn 40 -> 256/256 ok, peak 335MiB, no OOM. haproxy queues rather than rejects, so clients see success.
Completes the OOM fix set: 2026.6.13 (#97 copy), 2026.6.14 (#98 streaming-GET), 2026.6.15 (#99 upload concurrency cap).
2026.6.14
fix: hold GET memory reservation for the whole streaming-response lifetime (#98)
The dominant concurrent-backup OOM. Streaming GET responses released their memory reservation before the body was sent, so concurrent downloads ran ungoverned — each holding an 8MB decrypted frame in the send buffer (N×8MB → OOMKill, exit 137) while the limiter read ~budget.
Fix: hold the reservation for the whole stream lifetime (admission control); drop the now-redundant per-frame acquires.
Verified at prod config (512Mi/64MB): the 90-concurrent multipart GET flood that OOM-killed the pod (0/180) now completes 180/180 at ~325MiB; realistic upload+GET mix 106/106 at ~305MiB. Profiler: tracked memory 812MB→112MB, live frames 90→11.
Stacks on 2026.6.13 (#97, copy crash + copy-OOM).
2026.6.13
fix: govern copy memory + fix passthrough-copy ClientResponse.read crash (#97)
- Gate server-side copies (CopyObject / UploadPartCopy) through the memory limiter so a Scylla dedup flood can't OOM the pod (was: ungoverned concurrent decrypt+re-encrypt → exit 137).
- Fix _iter_copy_source: body.content.read(n) instead of body.read(n) (aiohttp ClientResponse.read() takes no size arg → every passthrough copy 500'd with TypeError).
Verified locally: 64-concurrent copy flood at a 256MiB cap OOM-killed the pod before (0/64 ok), now peaks ~195MiB with 64/64 ok.
2026.6.12
Make the s3proxy container's startup/liveness/readiness probes configurable via .Values (defaults unchanged). Lets a deployment raise the liveness timeout so a busy single-event-loop worker is not restarted under upload load (the kill -> retry -> crashloop cascade). App code identical to 2026.6.11.
2026.6.11
2026.6.10
Fixes
- Route V1 ListObjects to the list handler instead of raw-forwarding (#93). Completes the V1 fix from 2026.6.9.
_dispatch_bucketwas raw-forwarding any bucket GET withoutlist-type/delete/uploads/locationstraight to the backend, so a V1 ListObjects (?prefix&delimiter&max-keys&encoding-type, nolist-type=2) was sent verbatim to Hetzner → HTTP 400, never reaching the V1→V2 translation added in #92. A bucket GET whose query is only listing params now falls through to the list handler; genuine sub-resource GETs (acl,versioning, …) still forward.
This is the fix that actually unblocks Scylla backups and Postgres retention against Hetzner.
Image: ghcr.io/serversidehannes/s3proxy-python:2026.6.10
Chain: 2026.6.8 (#91 V2 token, #88 parallel HEAD) → 2026.6.9 (#92 V1→V2 in handler) → 2026.6.10 (#93 route V1 to handler).
2026.6.9
Fixes
- Serve V1 ListObjects via the backend's V2 API (#92). Hetzner Object Storage only implements ListObjectsV2 and rejects legacy V1
ListObjectswith HTTP 400. The proxy forwarded V1 client requests as V1, breaking every V1 client: scylla-manager's bundled rclone 1.51.0 (all Scylla backups failed at the list step) andbarman-cloud-backup-delete(CNPG retention failed with BadRequest, so Postgres backups completed but old ones were never pruned).handle_list_objects_v1now calls the backend'slist_objects_v2, mapping the client's V1marker→ V2StartAfter(stateless, lossless for recursive listings) and synthesizingNextMarkerfrom the largest raw backend key when truncated.
Image: ghcr.io/serversidehannes/s3proxy-python:2026.6.9
Builds on 2026.6.8 (V2 continuation-token fix #91, parallel list HEAD #88).
2026.6.8
Fixes
-
Don't URL-encode V2 continuation tokens under
encoding-type=url(#91).ListObjectsV2continuation tokens are opaque cursors, not keys — the S3 spec only URL-encodesKey/Prefix/Delimiter/StartAfter, and clients never URL-decode the token. The serializer was runningNextContinuationToken/ContinuationTokenthrough_encode_key(), so underencoding-type=urla key-shaped backend token (…/data_0007.tar→…%2F…) could not round-trip: the backend never advanced, the same page repeated, and clients aborted with "the same next token was received twice." This wedged CNPGbarman-cloudbase backups and retention on multi-page catalogs. Now emitted XML-escaped only; V1NextMarker(a real key) is still URL-encoded. -
Parallelize per-object HEAD on list-objects (#88). Resolving the SSE plaintext size/etag did one sequential
head_objectper key; a recursive list of up to max-keys objects stacked into a multi-second stall that tripped client timeouts and hung ClickHouse/Postgres backups at the S3 list step. HEADs now run concurrently, bounded byLIST_HEAD_CONCURRENCY(50), preserving output order and the per-object fallback to the listed size/etag.
Image: ghcr.io/serversidehannes/s3proxy-python:2026.6.8