HYPERFLEET-1306 - fix: add preStop hook and rollout strategy to API deployment by tirthct · Pull Request #268 · openshift-hyperfleet/hyperfleet-api

tirthct · 2026-07-01T04:00:33Z

Summary

Add preStop lifecycle hook (sleep 5) to the API deployment to prevent EOF errors during rolling updates
Add explicit rollout strategy (maxUnavailable: 0, maxSurge: 1) for zero-downtime upgrades
Add terminationGracePeriodSeconds: 30 to ensure graceful shutdown completes before SIGKILL

Root Cause

During helm upgrade --wait (e.g., when E2E tests add/remove adapters), the old API pod receives SIGTERM and starts httpServer.Shutdown() immediately — before the GKE LoadBalancer removes it from its backend pool. Requests routed to the old pod during this window receive EOF.

Confirmed by /e2e-debug investigation of tier2-nightly run 2071949524003196928: the UpgradeAPIRequiredAdapters helper's readiness check (ListClusters GET on port 8000) passed, but the very next request (CreateCluster POST, 62ms later) got EOF — the LoadBalancer routed it to the old pod mid-shutdown.

What changed

File	Change
`charts/templates/deployment.yaml`	Added `strategy` block (from values), `terminationGracePeriodSeconds` (from values), `lifecycle` block (from values)
`charts/values.yaml`	Added `lifecycle.preStop` (sleep 5, enabled by default), `strategy` (maxUnavailable: 0), `terminationGracePeriodSeconds` (30s)
`charts/README.md`	Auto-regenerated by helm-docs

Design decisions

preStop: sleep 5 — gives the LoadBalancer 5 seconds to drain the old pod before SIGTERM is delivered. Follows the adapter chart's commented-out pattern but enabled by default for the API (most critical component).
maxUnavailable: 0, maxSurge: 1 — ensures the new pod is fully Ready before the old one starts terminating. Matches the adapter chart's strategy exactly.
terminationGracePeriodSeconds: 30 — must be > preStop (5s) + API server shutdown (10s hardcoded in api_server.go:127) + buffer. Matches the adapter chart and the graceful shutdown standard.
Enabled by default — unlike the adapter chart which uses lifecycle: {}, the API chart ships with the preStop hook active. The API is the single entry point for all clients; any EOF during rollout is a user-visible error.
All three values are configurable and can be overridden or disabled via Helm values.

Test plan

make test-helm — all 18 Helm chart tests pass (lint + kubeconform across 18 value combinations)
Verify rendered template contains preStop, strategy, and terminationGracePeriodSeconds blocks
Deploy to dev cluster and run helm upgrade — verify zero EOF errors during rollout
Run tier2-nightly E2E (stuck_deletion test) — should pass without EOF on cluster create after API upgrade

…eployment

openshift-ci · 2026-07-01T04:00:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign crizzo71 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-07-01T04:00:49Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added configurable rollout, lifecycle, and shutdown settings for the Kubernetes deployment.
- Enabled zero-downtime rolling updates by default and added support for graceful pod termination.
Documentation
- Updated the Helm chart documentation to describe the new rollout, lifecycle, and termination options.

Walkthrough

CWE-400 (Uncontrolled Resource Consumption via improper shutdown) mitigation attempt: Helm chart now templates optional lifecycle.preStop, strategy, and terminationGracePeriodSeconds fields in the Deployment spec/container spec, gated by if conditionals on .Values. Default values.yaml sets a 5s sleep preStop hook, RollingUpdate strategy (maxUnavailable: 0, maxSurge: 1), and terminationGracePeriodSeconds: 30. README values table updated to document these three new keys.

Estimated code review effort: 2 (Simple) | ~10 minutes

Sequence Diagram(s)

Not applicable — no multi-component control-flow interaction; this is static Helm manifest/config templating.

Related Issues: None cited in diff.

Related PRs: None cited in diff.

Suggested labels: helm, deployment, supply-chain-review

Suggested reviewers: Chart maintainers with Kubernetes rollout/lifecycle-hook review authority. Verify preStop sleep command isn't shell-injectable (CWE-78) — confirm it's a static literal, not templated from unsanitized .Values input, since exec.command accepts arbitrary shell strings.

Poem:
Thirty seconds now to drain the pipe,
A sleep of five before SIGTERM's swipe,
maxUnavailable zero, no downtime gripe,
But check that exec.command's not a hidden snipe (CWE-78) —
static strings only, no injected type.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
No Hardcoded Secrets	⚠️ Warning	FAIL: charts/values.yaml hardcodes `database.postgresql.password: "hyperfleet-dev-password"` (also surfaced in README), a CWE-798 hard-coded credential.	Move the PostgreSQL password out of chart defaults: use a placeholder like `CHANGE_ME`, or require a Secret/external secret and keep helm-docs from rendering a real password.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the main change: adding a preStop hook and rollout strategy to the API deployment.
Description check	✅ Passed	The description directly describes the same Helm deployment rollout and shutdown changes in the diff.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output	✅ Passed	No slog/log/zap/fmt.Print statements were added in the touched Helm files; secret/token/password strings are config/docs only, so no CWE-532 log leak.
No Weak Cryptography	✅ Passed	No CWE-327/328 weak-crypto use: no md5/des/rc4/SHA1, ECB, custom crypto, or secret compares; only sha256sum checksum annotations.
No Injection Vectors	✅ Passed	No CWE-89/78/79/502 sinks added: touched chart files contain only hardcoded Helm values/YAML; searches found no exec.Command, fmt.Sprintf-in-query, template.HTML, or yaml.Unmarshal.
No Privileged Containers	✅ Passed	No privileged=true, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation:true, or root USER/runAsUser:0 were added in the touched chart templates; defaults remain non-root.
No Pii Or Sensitive Data In Logs	✅ Passed	CWE-532: the PR only changes Helm rollout values/docs; no slog/logr/zap/fmt.Print* logging added in modified files.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/values.yaml`:
- Around line 281-288: The `strategy` values block is documented with an
incorrect `@default` comment that contradicts the committed `rollingUpdate`
settings. Update the `# --`/`@default` annotation near `strategy` in values.yaml
so it matches the actual default shipped by the chart (`rollingUpdate.maxSurge`
and `rollingUpdate.maxUnavailable`), ensuring helm-docs and `verify-helm-docs`
render the README accurately.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 155ed27a-904f-498c-b98c-f080b9d6aded

📥 Commits

Reviewing files that changed from the base of the PR and between 8f20481 and 8acf20f.

📒 Files selected for processing (3)

charts/README.md
charts/templates/deployment.yaml
charts/values.yaml

🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

openshift-hyperfleet/architecture (manual)
openshift-hyperfleet/hyperfleet-api (manual)
openshift-hyperfleet/hyperfleet-sentinel (manual)
openshift-hyperfleet/hyperfleet-adapter (manual)
openshift-hyperfleet/hyperfleet-broker (manual)

coderabbitai · 2026-07-01T04:04:54Z

+# -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime
+# during rolling updates — the old pod stays until the new one is Ready.
+# @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge)
+strategy:
+  rollingUpdate:
+    maxSurge: 1
+    maxUnavailable: 0
+  type: RollingUpdate


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

@default annotation misdocuments the actual shipped default.

Line 283's @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge) will render into the README values table via helm-docs, but the actual committed default (lines 285-288) is maxSurge: 1, maxUnavailable: 0 — the opposite of the annotation's claim. Anyone reading the README will believe strategy defaults to k8s's built-in 25/25 behavior when it doesn't.

As per path instructions, "values.yaml has sensible defaults" and helm-chart-conventions.md requires the # --/@default comment to accurately describe the value so verify-helm-docs/make test-helm output stays trustworthy.

📝 Proposed fix

# -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime # during rolling updates — the old pod stays until the new one is Ready. -# `@default` -- Kubernetes default (25% maxUnavailable, 25% maxSurge) strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0 type: RollingUpdate

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime

# during rolling updates — the old pod stays until the new one is Ready.

# @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge)

strategy:

rollingUpdate:

maxSurge: 1

maxUnavailable: 0

type: RollingUpdate

# -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime

# during rolling updates — the old pod stays until the new one is Ready.

strategy:

rollingUpdate:

maxSurge: 1

maxUnavailable: 0

type: RollingUpdate

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@charts/values.yaml` around lines 281 - 288, The `strategy` values block is documented with an incorrect `@default` comment that contradicts the committed `rollingUpdate` settings. Update the `# --`/`@default` annotation near `strategy` in values.yaml so it matches the actual default shipped by the chart (`rollingUpdate.maxSurge` and `rollingUpdate.maxUnavailable`), ensuring helm-docs and `verify-helm-docs` render the README accurately.

Source: Path instructions

hyperfleet-ci-bot · 2026-07-01T04:25:48Z

Risk Score: 0 — `risk/low`

Signal	Detail	Points
PR size	39 lines	+0
Sensitive paths	none	+0

_{Computed by hyperfleet-risk-scorer}

HYPERFLEET-1306 - fix: add preStop hook and rollout strategy to API d…

8acf20f

…eployment

openshift-ci Bot requested review from pnguyen44 and rh-amarin July 1, 2026 04:00

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

hyperfleet-ci-bot Bot added the risk/low label Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HYPERFLEET-1306 - fix: add preStop hook and rollout strategy to API deployment#268

HYPERFLEET-1306 - fix: add preStop hook and rollout strategy to API deployment#268
tirthct wants to merge 1 commit into
openshift-hyperfleet:mainfrom
tirthct:hyperfleet-1306

tirthct commented Jul 1, 2026

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Uh oh!

hyperfleet-ci-bot Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tirthct commented Jul 1, 2026

Summary

Root Cause

What changed

Design decisions

Test plan

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

hyperfleet-ci-bot Bot commented Jul 1, 2026

Risk Score: 0 — risk/low

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Risk Score: 0 — `risk/low`