Skip to content

Releases: ServerSideHannes/s3proxy-python

2026.7.1

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 01 Jul 17:19
bc0bd36

Fixes

  • COPY of large multipart-encrypted objects failed with InvalidTag (#104). _iter_multipart_plaintext decrypted each whole client part as a single AES-GCM seal, but a client part expands into multiple internal parts, each a sequence of independent frames. Any source whose parts held more than one frame (internal parts >8MB, e.g. ScyllaDB backups) failed to copy. The reader now walks internal parts → frames and decrypts one frame at a time, matching the GET path. Also bounds copy source-read peak memory to O(frame).

2026.6.16

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 01 Jul 05:54
953bcac

feat: memory debug mode (RSS vs tracked heap + top allocations) (#100)

Diagnostic to pin the s3proxy OOM root cause. Gated by S3PROXY_MEMORY_DEBUG (alias S3PROXY_TRACEMALLOC), zero overhead when unset. Every interval logs real RSS vs Python-tracked heap vs untracked gap vs governor active bytes, then the top live allocations by call site.

One dump settles which world the OOM is in:

  • large untracked gap -> C-level transport buffers (uvicorn/httptools), fix at HTTP/LB layer
  • small gap -> Python, top list names the exact line

Usage: extraConfig { S3PROXY_MEMORY_DEBUG: "1" } + raise pod memory to ~1-2Gi so it survives to dump; read MEMORY_DEBUG / MEMORY_DEBUG_TOP under real backup load; revert.

No behavior change unless enabled.

2026.6.15

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 30 Jun 18:14
98235b5

fix(chart): cap per-pod backend concurrency at the frontproxy (maxconn) (#99)

Stops the upload-side concurrent-backup OOM (dominant cause on 2026.6.14). uvicorn buffers each in-flight request body off the socket before the app's memory limiter runs, so a backup flood piles up bodies in the HTTP server's C-level buffers (governor reads ~64MB while RSS hits 512Mi+ -> OOMKilled). That memory is ungovernable from the app layer.

Fix: haproxy now caps in-flight requests PER pod (maxconn, default 40) and queues the excess (timeout queue) instead of overrunning a pod. Chart values: frontproxy.maxConnPerPod, frontproxy.timeouts.queue.

Verified locally at prod config (512Mi/64MB, 2026.6.14 app): direct 128x16MB PUT flood OOM-killed the pod (exit 137); via haproxy maxconn 40 -> 256/256 ok, peak 335MiB, no OOM. haproxy queues rather than rejects, so clients see success.

Completes the OOM fix set: 2026.6.13 (#97 copy), 2026.6.14 (#98 streaming-GET), 2026.6.15 (#99 upload concurrency cap).

2026.6.14

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 30 Jun 17:33
33e4d6c

fix: hold GET memory reservation for the whole streaming-response lifetime (#98)

The dominant concurrent-backup OOM. Streaming GET responses released their memory reservation before the body was sent, so concurrent downloads ran ungoverned — each holding an 8MB decrypted frame in the send buffer (N×8MB → OOMKill, exit 137) while the limiter read ~budget.

Fix: hold the reservation for the whole stream lifetime (admission control); drop the now-redundant per-frame acquires.

Verified at prod config (512Mi/64MB): the 90-concurrent multipart GET flood that OOM-killed the pod (0/180) now completes 180/180 at ~325MiB; realistic upload+GET mix 106/106 at ~305MiB. Profiler: tracked memory 812MB→112MB, live frames 90→11.

Stacks on 2026.6.13 (#97, copy crash + copy-OOM).

2026.6.13

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 30 Jun 16:24
fab3774

fix: govern copy memory + fix passthrough-copy ClientResponse.read crash (#97)

  • Gate server-side copies (CopyObject / UploadPartCopy) through the memory limiter so a Scylla dedup flood can't OOM the pod (was: ungoverned concurrent decrypt+re-encrypt → exit 137).
  • Fix _iter_copy_source: body.content.read(n) instead of body.read(n) (aiohttp ClientResponse.read() takes no size arg → every passthrough copy 500'd with TypeError).

Verified locally: 64-concurrent copy flood at a 256MiB cap OOM-killed the pod before (0/64 ok), now peaks ~195MiB with 64/64 ok.

2026.6.12

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 30 Jun 12:13
381ac2c

Make the s3proxy container's startup/liveness/readiness probes configurable via .Values (defaults unchanged). Lets a deployment raise the liveness timeout so a busy single-event-loop worker is not restarted under upload load (the kill -> retry -> crashloop cascade). App code identical to 2026.6.11.

2026.6.11

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 30 Jun 11:02
b2343b0

Fix: list responses emit LastModified as RFC3339 Z (millisecond) instead of +00:00. rclone 1.51.0 (scylla-manager-agent) rejected +00:00 with 'cannot parse "+00:00" as "Z"', failing every Scylla backup list. Completes the V1-list fix chain (#91 to #94).

2026.6.10

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 30 Jun 10:14
0cddafc

Fixes

  • Route V1 ListObjects to the list handler instead of raw-forwarding (#93). Completes the V1 fix from 2026.6.9. _dispatch_bucket was raw-forwarding any bucket GET without list-type/delete/uploads/location straight to the backend, so a V1 ListObjects (?prefix&delimiter&max-keys&encoding-type, no list-type=2) was sent verbatim to Hetzner → HTTP 400, never reaching the V1→V2 translation added in #92. A bucket GET whose query is only listing params now falls through to the list handler; genuine sub-resource GETs (acl, versioning, …) still forward.

This is the fix that actually unblocks Scylla backups and Postgres retention against Hetzner.

Image: ghcr.io/serversidehannes/s3proxy-python:2026.6.10

Chain: 2026.6.8 (#91 V2 token, #88 parallel HEAD) → 2026.6.9 (#92 V1→V2 in handler) → 2026.6.10 (#93 route V1 to handler).

2026.6.9

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 30 Jun 09:52
cb04a8b

Fixes

  • Serve V1 ListObjects via the backend's V2 API (#92). Hetzner Object Storage only implements ListObjectsV2 and rejects legacy V1 ListObjects with HTTP 400. The proxy forwarded V1 client requests as V1, breaking every V1 client: scylla-manager's bundled rclone 1.51.0 (all Scylla backups failed at the list step) and barman-cloud-backup-delete (CNPG retention failed with BadRequest, so Postgres backups completed but old ones were never pruned). handle_list_objects_v1 now calls the backend's list_objects_v2, mapping the client's V1 marker → V2 StartAfter (stateless, lossless for recursive listings) and synthesizing NextMarker from the largest raw backend key when truncated.

Image: ghcr.io/serversidehannes/s3proxy-python:2026.6.9

Builds on 2026.6.8 (V2 continuation-token fix #91, parallel list HEAD #88).

2026.6.8

Choose a tag to compare

@ServerSideHannes ServerSideHannes released this 30 Jun 07:38
86077c6

Fixes

  • Don't URL-encode V2 continuation tokens under encoding-type=url (#91). ListObjectsV2 continuation tokens are opaque cursors, not keys — the S3 spec only URL-encodes Key/Prefix/Delimiter/StartAfter, and clients never URL-decode the token. The serializer was running NextContinuationToken/ContinuationToken through _encode_key(), so under encoding-type=url a key-shaped backend token (…/data_0007.tar…%2F…) could not round-trip: the backend never advanced, the same page repeated, and clients aborted with "the same next token was received twice." This wedged CNPG barman-cloud base backups and retention on multi-page catalogs. Now emitted XML-escaped only; V1 NextMarker (a real key) is still URL-encoded.

  • Parallelize per-object HEAD on list-objects (#88). Resolving the SSE plaintext size/etag did one sequential head_object per key; a recursive list of up to max-keys objects stacked into a multi-second stall that tripped client timeouts and hung ClickHouse/Postgres backups at the S3 list step. HEADs now run concurrently, bounded by LIST_HEAD_CONCURRENCY (50), preserving output order and the per-object fallback to the listed size/etag.

Image: ghcr.io/serversidehannes/s3proxy-python:2026.6.8