Skip to content

HYPERFLEET-1306 - fix: add preStop hook and rollout strategy to API deployment#268

Open
tirthct wants to merge 1 commit into
openshift-hyperfleet:mainfrom
tirthct:hyperfleet-1306
Open

HYPERFLEET-1306 - fix: add preStop hook and rollout strategy to API deployment#268
tirthct wants to merge 1 commit into
openshift-hyperfleet:mainfrom
tirthct:hyperfleet-1306

Conversation

@tirthct

@tirthct tirthct commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add preStop lifecycle hook (sleep 5) to the API deployment to prevent EOF errors during rolling updates
  • Add explicit rollout strategy (maxUnavailable: 0, maxSurge: 1) for zero-downtime upgrades
  • Add terminationGracePeriodSeconds: 30 to ensure graceful shutdown completes before SIGKILL

Root Cause

During helm upgrade --wait (e.g., when E2E tests add/remove adapters), the old API pod receives SIGTERM and starts httpServer.Shutdown() immediately — before the GKE LoadBalancer removes it from its backend pool. Requests routed to the old pod during this window receive EOF.

Confirmed by /e2e-debug investigation of tier2-nightly run 2071949524003196928: the UpgradeAPIRequiredAdapters helper's readiness check (ListClusters GET on port 8000) passed, but the very next request (CreateCluster POST, 62ms later) got EOF — the LoadBalancer routed it to the old pod mid-shutdown.

What changed

File Change
charts/templates/deployment.yaml Added strategy block (from values), terminationGracePeriodSeconds (from values), lifecycle block (from values)
charts/values.yaml Added lifecycle.preStop (sleep 5, enabled by default), strategy (maxUnavailable: 0), terminationGracePeriodSeconds (30s)
charts/README.md Auto-regenerated by helm-docs

Design decisions

  • preStop: sleep 5 — gives the LoadBalancer 5 seconds to drain the old pod before SIGTERM is delivered. Follows the adapter chart's commented-out pattern but enabled by default for the API (most critical component).
  • maxUnavailable: 0, maxSurge: 1 — ensures the new pod is fully Ready before the old one starts terminating. Matches the adapter chart's strategy exactly.
  • terminationGracePeriodSeconds: 30 — must be > preStop (5s) + API server shutdown (10s hardcoded in api_server.go:127) + buffer. Matches the adapter chart and the graceful shutdown standard.
  • Enabled by default — unlike the adapter chart which uses lifecycle: {}, the API chart ships with the preStop hook active. The API is the single entry point for all clients; any EOF during rollout is a user-visible error.
  • All three values are configurable and can be overridden or disabled via Helm values.

Test plan

  • make test-helm — all 18 Helm chart tests pass (lint + kubeconform across 18 value combinations)
  • Verify rendered template contains preStop, strategy, and terminationGracePeriodSeconds blocks
  • Deploy to dev cluster and run helm upgrade — verify zero EOF errors during rollout
  • Run tier2-nightly E2E (stuck_deletion test) — should pass without EOF on cluster create after API upgrade

@openshift-ci openshift-ci Bot requested review from pnguyen44 and rh-amarin July 1, 2026 04:00
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign crizzo71 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added configurable rollout, lifecycle, and shutdown settings for the Kubernetes deployment.
    • Enabled zero-downtime rolling updates by default and added support for graceful pod termination.
  • Documentation
    • Updated the Helm chart documentation to describe the new rollout, lifecycle, and termination options.

Walkthrough

CWE-400 (Uncontrolled Resource Consumption via improper shutdown) mitigation attempt: Helm chart now templates optional lifecycle.preStop, strategy, and terminationGracePeriodSeconds fields in the Deployment spec/container spec, gated by if conditionals on .Values. Default values.yaml sets a 5s sleep preStop hook, RollingUpdate strategy (maxUnavailable: 0, maxSurge: 1), and terminationGracePeriodSeconds: 30. README values table updated to document these three new keys.

Estimated code review effort: 2 (Simple) | ~10 minutes

Sequence Diagram(s)

Not applicable — no multi-component control-flow interaction; this is static Helm manifest/config templating.

Related Issues: None cited in diff.

Related PRs: None cited in diff.

Suggested labels: helm, deployment, supply-chain-review

Suggested reviewers: Chart maintainers with Kubernetes rollout/lifecycle-hook review authority. Verify preStop sleep command isn't shell-injectable (CWE-78) — confirm it's a static literal, not templated from unsanitized .Values input, since exec.command accepts arbitrary shell strings.

Poem:
Thirty seconds now to drain the pipe,
A sleep of five before SIGTERM's swipe,
maxUnavailable zero, no downtime gripe,
But check that exec.command's not a hidden snipe (CWE-78) —
static strings only, no injected type.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
No Hardcoded Secrets ⚠️ Warning FAIL: charts/values.yaml hardcodes database.postgresql.password: "hyperfleet-dev-password" (also surfaced in README), a CWE-798 hard-coded credential. Move the PostgreSQL password out of chart defaults: use a placeholder like CHANGE_ME, or require a Secret/external secret and keep helm-docs from rendering a real password.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the main change: adding a preStop hook and rollout strategy to the API deployment.
Description check ✅ Passed The description directly describes the same Helm deployment rollout and shutdown changes in the diff.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output ✅ Passed No slog/log/zap/fmt.Print statements were added in the touched Helm files; secret/token/password strings are config/docs only, so no CWE-532 log leak.
No Weak Cryptography ✅ Passed No CWE-327/328 weak-crypto use: no md5/des/rc4/SHA1, ECB, custom crypto, or secret compares; only sha256sum checksum annotations.
No Injection Vectors ✅ Passed No CWE-89/78/79/502 sinks added: touched chart files contain only hardcoded Helm values/YAML; searches found no exec.Command, fmt.Sprintf-in-query, template.HTML, or yaml.Unmarshal.
No Privileged Containers ✅ Passed No privileged=true, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation:true, or root USER/runAsUser:0 were added in the touched chart templates; defaults remain non-root.
No Pii Or Sensitive Data In Logs ✅ Passed CWE-532: the PR only changes Helm rollout values/docs; no slog/logr/zap/fmt.Print* logging added in modified files.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/values.yaml`:
- Around line 281-288: The `strategy` values block is documented with an
incorrect `@default` comment that contradicts the committed `rollingUpdate`
settings. Update the `# --`/`@default` annotation near `strategy` in values.yaml
so it matches the actual default shipped by the chart (`rollingUpdate.maxSurge`
and `rollingUpdate.maxUnavailable`), ensuring helm-docs and `verify-helm-docs`
render the README accurately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 155ed27a-904f-498c-b98c-f080b9d6aded

📥 Commits

Reviewing files that changed from the base of the PR and between 8f20481 and 8acf20f.

📒 Files selected for processing (3)
  • charts/README.md
  • charts/templates/deployment.yaml
  • charts/values.yaml
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment thread charts/values.yaml
Comment on lines +281 to +288
# -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime
# during rolling updates — the old pod stays until the new one is Ready.
# @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge)
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

@default annotation misdocuments the actual shipped default.

Line 283's @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge) will render into the README values table via helm-docs, but the actual committed default (lines 285-288) is maxSurge: 1, maxUnavailable: 0 — the opposite of the annotation's claim. Anyone reading the README will believe strategy defaults to k8s's built-in 25/25 behavior when it doesn't.

As per path instructions, "values.yaml has sensible defaults" and helm-chart-conventions.md requires the # --/@default comment to accurately describe the value so verify-helm-docs/make test-helm output stays trustworthy.

📝 Proposed fix
 # -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime
 # during rolling updates — the old pod stays until the new one is Ready.
-# `@default` -- Kubernetes default (25% maxUnavailable, 25% maxSurge)
 strategy:
   rollingUpdate:
     maxSurge: 1
     maxUnavailable: 0
   type: RollingUpdate
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime
# during rolling updates — the old pod stays until the new one is Ready.
# @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge)
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
# -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime
# during rolling updates — the old pod stays until the new one is Ready.
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@charts/values.yaml` around lines 281 - 288, The `strategy` values block is
documented with an incorrect `@default` comment that contradicts the committed
`rollingUpdate` settings. Update the `# --`/`@default` annotation near
`strategy` in values.yaml so it matches the actual default shipped by the chart
(`rollingUpdate.maxSurge` and `rollingUpdate.maxUnavailable`), ensuring
helm-docs and `verify-helm-docs` render the README accurately.

Source: Path instructions

@hyperfleet-ci-bot

Copy link
Copy Markdown

Risk Score: 0 — risk/low

Signal Detail Points
PR size 39 lines +0
Sensitive paths none +0

Computed by hyperfleet-risk-scorer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant