feat(vmpool): add VirtualMachinePool for group VM management by fl64 · Pull Request #2572 · deckhouse/virtualization

fl64 · 2026-07-02T09:00:02Z

Description

DVP has no primitive to manage a group of identical virtual machines whose count changes over time. Every "I need N identical VMs and the number varies" scenario — CI runner fleets, VDI desktop pools — is solved with orchestration outside the platform: users write their own controller/scripts that create and delete VirtualMachines, watch their number, recreate lost ones and clean up after them. This duplicates logic and is error-prone around races and node failures.

This PR introduces VirtualMachinePool (paid editions only, EE/SE+): a namespaced resource that declaratively keeps a requested number of identical VMs and integrates with kubectl scale, HPA and KEDA through the standard scale subresource. Its template is an ordinary VirtualMachineSpec, so a replica is no different from a manually created VM.

This is a draft. The feature is delivered incrementally within this single PR; phases land as separate commits. Already implemented:

CRD VirtualMachinePool with the scale and status subresources, gated behind the VirtualMachinePool module feature gate (default off, locked off in CE).

Controller that maintains the replica count: creates replicas from the template, replaces disappeared ones, scales down (youngest-first for now), and reports status (replicas, readyReplicas, selector, Available/Progressing). It is cache-lag-safe via a ReplicaSet-style expectations tracker, so a lagging informer cache cannot double-create anonymous replicas.

Planned in later phases of this PR: scaleDownPolicy + a /scale guard webhook, addressed scale-down (scaleDownWith), in-place template propagation, and reusable disks.

One implementation note: the controller ships only in paid editions (compiled under the EE build tag), while the CRD/API is installed in every edition; the feature gate stays locked off in CE, so the resource simply does nothing there.

Why do we need it, and what problem does it solve?

Two mass scenarios suffer most: CI/CD runners (GitLab Runner autoscaling expects a backend that can "give me N more" and reclaim idle ones) and VDI pools (warm desktops that self-heal on node failure). Without a group primitive, DVP cannot serve these natively and each team reinvents the orchestration, usually with bugs in race and failure handling. VirtualMachinePool gives users a native, declarative backend for autoscaling fleets of VMs without writing their own replica controller.

What is the expected result?

With the VirtualMachinePool feature gate enabled (EE/SE+):

Create a VirtualMachinePool with spec.replicas: N and a spec.virtualMachineTemplate — the controller converges the number of VirtualMachines to N.
kubectl scale virtualmachinepool/<name> --replicas=M (or HPA/KEDA) scales the pool to M.
Deleting or losing a replica triggers a replacement once the old object is gone; a member in Stopped is kept, not duplicated.
kubectl get virtualmachinepool and .status report replicas / readyReplicas and the Available / Progressing conditions.

Checklist

The code is covered by unit tests.
e2e tests passed.
Documentation updated according to the changes.
Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: vmpool
type: feature
summary: "Add VirtualMachinePool (EE/SE+) for declarative group management of virtual machines, scalable via the standard scale subresource, HPA and KEDA."
impact_level: low

Introduce the VirtualMachinePool API type (namespaced, group virtualization.deckhouse.io/v1alpha2) with the scale and status subresources, generated deepcopy/client/lister/informer code and the CRD manifest. Gate the resource behind the VirtualMachinePool module feature gate (EE/SE+, default off; locked off in CE). No controller behaviour yet — the type and gate are the scaffold for the pool controller. Part of the VirtualMachinePool implementation (ADR: architecture-decision-records dvp/2026-06-29-vmpool.md). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the VirtualMachinePool controller skeleton behind the EE build tag (//go:build EE) and the VirtualMachinePool feature gate: handler-chain reconciler with an empty chain and a primary watch on the resource. It is wired into the controller manager through build-tagged enterprise shims (setup_enterprise_{ee,ce}.go); the CE build compiles a no-op. No reconcile behaviour yet — replica maintenance, template propagation and reusable disks land in the follow-up slices. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

… tag EE is the default shipped edition (werf.inc.yaml builds with -tags $MODULE_EDITION, default EE), but the unit-test task ran ginkgo without a build tag, so //go:build EE code was never exercised by the unit suite. Run ginkgo with --tags EE so enterprise code and its tests are covered. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add an in-memory, thread-safe expectations tracker (EE) modelled on the Kubernetes ReplicaSet UIDTrackingControllerExpectations: creations are counted, deletions tracked by UID, with a TTL safety valve. The pool reconciler will use it to avoid double-creating anonymous replicas while the informer cache lags behind a Create/Delete. Covered by unit tests (race-clean). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Implement the pool's core reconcile: list members by the managed pool-uid label + controllerRef, create missing replicas from the template (managed labels + controller ownerReference, GenerateName naming) and remove surplus ones, then publish status (replicas, readyReplicas, selector, Available/Progressing conditions). Every create/delete is guarded by the expectations tracker, and a member VirtualMachine watcher re-enqueues the owning pool and records observed creations/deletions — so a lagging informer cache cannot double-create anonymous replicas. Terminating members count toward a scale-down (invariant 2), so a replica already leaving is not over-replaced. Covered by unit tests (fake client, race-clean). The controller stays behind //go:build EE and the feature gate. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the required spec.scaleDownPolicy enum (NewestFirst / OldestFirst / Explicit) and honour it when the pool is scaled down anonymously via the scale subresource: NewestFirst removes the youngest replicas first, OldestFirst the oldest, and Explicit removes nothing anonymously (such pools shrink only by addressed removal). The scale-subresource guard that rejects anonymous shrink under Explicit is added next. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add a validating webhook on the virtualmachinepools/scale subresource that rejects a replicas decrease when the pool's scaleDownPolicy is Explicit, pointing the user to scaleDownWith for addressed removal. Growth and no-op scale updates are always allowed. The webhook is registered only in EE builds and self-gates on the VirtualMachinePool feature gate; its ValidatingWebhookConfiguration entry is rendered only when the gate is enabled. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the VirtualMachinePool meta object and the VirtualMachinePoolScaleDownWith body type (targets to remove) to the subresources.virtualization.deckhouse.io API group, with generated deepcopy/conversion/openapi. This is the type surface for the addressed scale-down handle; the aggregated-apiserver REST storage and wiring follow. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Register the virtualmachinepools resource and its scaleDownWith subresource in the existing aggregated apiserver (group subresources.virtualization.deckhouse.io). The handler validates that every target belongs to the pool, deletes them and atomically decrements spec.replicas on the main resource — bypassing the /scale guard, which is what lets Explicit pools shrink by address. The meta-object itself is not served (Get returns NotFound). Enterprise-only: the REST/storage live under //go:build EE and are wired into the apiserver group through a build-tagged hook; the CE build adds nothing. A write-capable client is threaded from the apiserver config. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Let the aggregated apiserver's service account get/update VirtualMachinePool (the scaleDownWith handler decrements spec.replicas) and reach the pool subresources. Grant the Editor cluster role management of VirtualMachinePool, its scale subresource (kubectl scale / HPA) and the scaleDownWith handle for addressed removal. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the template-hash label (revision marker, not part of the member selector) stamped on every created replica, and report the rollout in status: desiredTemplateHash, updatedReplicas and the Synced condition (True once all live replicas are on the current virtualMachineTemplate). This makes the rollout observable at pool level. In-place patching of existing replicas on a template change follows. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add a template handler that patches each live replica's spec to the current virtualMachineTemplate and marks it on the new revision once applied. Re-patching is avoided with a patched-template-hash annotation (not a spec diff, which the apiserver mutates by defaulting), and the template-hash label is advanced only when the replica is not awaiting a restart, so status.updatedReplicas / restartPendingReplicas and the Synced condition (RolloutInProgress vs RestartPendingApproval) reflect what has effectively landed. Hot/cold is decided by the VM layer. Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Replace time.Unix(1_700_000_000, 0) with time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC) in the pool tests — same deterministic clock, but self-explanatory. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Replace the inline dates with a single documented package-level referenceTime var per test package, and drop the clock/when aliases. A comment states the value is arbitrary — tests use only relative offsets and never read the wall clock — so the real-world date is irrelevant. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add spec.virtualDiskTemplates: each entry describes a per-replica disk with a reclaim policy — Delete (default; the disk belongs to its VirtualMachine and is removed with it) or Retain (the disk belongs to the pool, outlives the replica and is reused on scale-up), plus keep (warm buffer) and ttl for Retain disks. This is the schema for reusable disks; the reconcile behaviour (creation, reuse selection, GC) follows. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add an idempotent, self-healing disks handler: for every live member it ensures each Delete-policy virtualDiskTemplate disk exists (owned by the VirtualMachine, named <vm>-<template>, so it cascades away with the replica) and is referenced in the member's blockDeviceRefs. Also fix the template handler to merge block device refs when it patches a member's spec, so per-replica disk refs the pool attached are not wiped by a template change. Retain (reusable) disks come next. Covered by unit tests, including that a template patch keeps disk refs. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Extend the disks handler to Retain-policy templates: a member reuses a free pool-owned disk of the template (Ready and referenced by no live member) or, if none is free, gets a newly created pool-owned disk (named <pool>-<template>-<rand>) that outlives the replica. A per-pass guard prevents handing the same free disk to two members in one reconcile; the authoritative in-use signal is the members' blockDeviceRefs, not the platform InUse condition. Covered by unit tests (create, reuse-free, skip-busy). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The disks handler now ages free Retain disks: it stamps a free-since annotation when a disk leaves every member's blockDeviceRefs (the authoritative free signal — the platform InUse condition is unreliable, it flips on Stop) and clears it on reuse. Disks outside the warm buffer (keep newest) and older than the ttl are deleted with a resourceVersion precondition. free-since is persisted on the disk so the ttl survives controller restarts (in-memory timing would reset every restart and leak disks). Covered by unit tests. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Add the fallback for reuse-disk collisions: if two live members reference the same pool-owned disk (a cross-pass race after a controller restart), detach it from all but the keeper (the member with BlockDevicesReady, or the lexicographically smallest name) so the others get a fresh disk on the next reconcile — the in-pass guard already prevents the common case. Also add edge-case tests: a Stopped member is counted and neither replaced nor duplicated (invariant 4); nil replicas mean zero; a non-Ready free disk is not reused; free-since is cleared on reuse; disks are not managed for a Terminating member. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The virtualization-controller service account could not list/watch VirtualMachinePool, so the pool controller failed to start its watch and never reconciled. Add virtualmachinepools (+ status, + finalizers) to the controller ClusterRole. Found by in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

The virtualization-api binary was built without -tags $MODULE_EDITION, so the EE-only aggregated-apiserver registration (compiled under //go:build EE) was dropped and the virtualmachinepools/scaleDownWith subresource returned 404. Build the apiserver with the edition tag like the controller, so the enterprise subresource is served in EE builds. Found by in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Reuse-disk selection required Ready, so a freshly created disk (still WaitForFirstConsumer / provisioning) was never considered free and a new one was created on every reconcile until the first bound — creating a burst of surplus disks. Reuse any free pool-owned disk, preferring a Ready one but otherwise attaching a still-provisioning one (attaching is what makes a WaitForFirstConsumer disk bind), and create a new disk only when none is free. Failed/terminating disks are skipped. Found by in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

…data The template metadata embedded metav1.ObjectMeta, which controller-gen renders as an opaque object, so setting template.metadata.labels was rejected by strict decoding. Use a curated metadata struct with labels and annotations so the CRD schema exposes them. Found by in-cluster testing. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Emit ReplicaSet-style events on the VirtualMachinePool so scaling is visible in kubectl describe / kubectl get events: SuccessfulCreate / FailedCreate on replica creation and SuccessfulDelete / FailedDelete on removal. FailedCreate surfaces admission errors (e.g. an invalid template) directly on the pool instead of only in controller logs. Messages follow the user-facing text conventions (English, full resource names, no internals). Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Assert SuccessfulCreate is emitted per created replica, and that a failed creation emits FailedCreate and un-does the expectation (via an interceptor client that rejects Create) so the pool is not wedged. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

fl64 added 5 commits July 2, 2026 11:14

github-actions Bot assigned fl64 Jul 2, 2026

fl64 added 20 commits July 2, 2026 12:07

test(vmpool): use a readable fixed date instead of a raw unix timestamp

84b1d2b

Replace time.Unix(1_700_000_000, 0) with time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC) in the pool tests — same deterministic clock, but self-explanatory. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vmpool): add VirtualMachinePool for group VM management#2572

feat(vmpool): add VirtualMachinePool for group VM management#2572
fl64 wants to merge 25 commits into
mainfrom
feat/vmpool/implementation

fl64 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fl64 commented Jul 2, 2026

Description

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

Changelog entries

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant