HYPERFLEET-1306 - fix: add preStop hook and rollout strategy to API deployment#268
HYPERFLEET-1306 - fix: add preStop hook and rollout strategy to API deployment#268tirthct wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughSummary by CodeRabbit
WalkthroughCWE-400 (Uncontrolled Resource Consumption via improper shutdown) mitigation attempt: Helm chart now templates optional Estimated code review effort: 2 (Simple) | ~10 minutes Sequence Diagram(s)Not applicable — no multi-component control-flow interaction; this is static Helm manifest/config templating. Related Issues: None cited in diff. Related PRs: None cited in diff. Suggested labels: helm, deployment, supply-chain-review Suggested reviewers: Chart maintainers with Kubernetes rollout/lifecycle-hook review authority. Verify Poem: 🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@charts/values.yaml`:
- Around line 281-288: The `strategy` values block is documented with an
incorrect `@default` comment that contradicts the committed `rollingUpdate`
settings. Update the `# --`/`@default` annotation near `strategy` in values.yaml
so it matches the actual default shipped by the chart (`rollingUpdate.maxSurge`
and `rollingUpdate.maxUnavailable`), ensuring helm-docs and `verify-helm-docs`
render the README accurately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 155ed27a-904f-498c-b98c-f080b9d6aded
📒 Files selected for processing (3)
charts/README.mdcharts/templates/deployment.yamlcharts/values.yaml
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift-hyperfleet/architecture(manual)openshift-hyperfleet/hyperfleet-api(manual)openshift-hyperfleet/hyperfleet-sentinel(manual)openshift-hyperfleet/hyperfleet-adapter(manual)openshift-hyperfleet/hyperfleet-broker(manual)
| # -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime | ||
| # during rolling updates — the old pod stays until the new one is Ready. | ||
| # @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge) | ||
| strategy: | ||
| rollingUpdate: | ||
| maxSurge: 1 | ||
| maxUnavailable: 0 | ||
| type: RollingUpdate |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
@default annotation misdocuments the actual shipped default.
Line 283's @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge) will render into the README values table via helm-docs, but the actual committed default (lines 285-288) is maxSurge: 1, maxUnavailable: 0 — the opposite of the annotation's claim. Anyone reading the README will believe strategy defaults to k8s's built-in 25/25 behavior when it doesn't.
As per path instructions, "values.yaml has sensible defaults" and helm-chart-conventions.md requires the # --/@default comment to accurately describe the value so verify-helm-docs/make test-helm output stays trustworthy.
📝 Proposed fix
# -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime
# during rolling updates — the old pod stays until the new one is Ready.
-# `@default` -- Kubernetes default (25% maxUnavailable, 25% maxSurge)
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime | |
| # during rolling updates — the old pod stays until the new one is Ready. | |
| # @default -- Kubernetes default (25% maxUnavailable, 25% maxSurge) | |
| strategy: | |
| rollingUpdate: | |
| maxSurge: 1 | |
| maxUnavailable: 0 | |
| type: RollingUpdate | |
| # -- Deployment rollout strategy. `maxUnavailable: 0` ensures zero-downtime | |
| # during rolling updates — the old pod stays until the new one is Ready. | |
| strategy: | |
| rollingUpdate: | |
| maxSurge: 1 | |
| maxUnavailable: 0 | |
| type: RollingUpdate |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@charts/values.yaml` around lines 281 - 288, The `strategy` values block is
documented with an incorrect `@default` comment that contradicts the committed
`rollingUpdate` settings. Update the `# --`/`@default` annotation near
`strategy` in values.yaml so it matches the actual default shipped by the chart
(`rollingUpdate.maxSurge` and `rollingUpdate.maxUnavailable`), ensuring
helm-docs and `verify-helm-docs` render the README accurately.
Source: Path instructions
Risk Score: 0 —
|
| Signal | Detail | Points |
|---|---|---|
| PR size | 39 lines | +0 |
| Sensitive paths | none | +0 |
Computed by hyperfleet-risk-scorer
Summary
preStoplifecycle hook (sleep 5) to the API deployment to prevent EOF errors during rolling updatesmaxUnavailable: 0, maxSurge: 1) for zero-downtime upgradesterminationGracePeriodSeconds: 30to ensure graceful shutdown completes before SIGKILLRoot Cause
During
helm upgrade --wait(e.g., when E2E tests add/remove adapters), the old API pod receives SIGTERM and startshttpServer.Shutdown()immediately — before the GKE LoadBalancer removes it from its backend pool. Requests routed to the old pod during this window receive EOF.Confirmed by
/e2e-debuginvestigation of tier2-nightly run2071949524003196928: theUpgradeAPIRequiredAdaptershelper's readiness check (ListClustersGET on port 8000) passed, but the very next request (CreateClusterPOST, 62ms later) got EOF — the LoadBalancer routed it to the old pod mid-shutdown.What changed
charts/templates/deployment.yamlstrategyblock (from values),terminationGracePeriodSeconds(from values),lifecycleblock (from values)charts/values.yamllifecycle.preStop(sleep 5, enabled by default),strategy(maxUnavailable: 0),terminationGracePeriodSeconds(30s)charts/README.mdDesign decisions
preStop: sleep 5— gives the LoadBalancer 5 seconds to drain the old pod before SIGTERM is delivered. Follows the adapter chart's commented-out pattern but enabled by default for the API (most critical component).maxUnavailable: 0, maxSurge: 1— ensures the new pod is fully Ready before the old one starts terminating. Matches the adapter chart's strategy exactly.terminationGracePeriodSeconds: 30— must be > preStop (5s) + API server shutdown (10s hardcoded inapi_server.go:127) + buffer. Matches the adapter chart and the graceful shutdown standard.lifecycle: {}, the API chart ships with the preStop hook active. The API is the single entry point for all clients; any EOF during rollout is a user-visible error.Test plan
make test-helm— all 18 Helm chart tests pass (lint + kubeconform across 18 value combinations)preStop,strategy, andterminationGracePeriodSecondsblockshelm upgrade— verify zero EOF errors during rollout