bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338
bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338michael-chou359 wants to merge 2 commits into
Conversation
| if old_client is not None and old_client is not new_client: | ||
| task = asyncio.create_task( | ||
| self._close_mongodb_client_after_delay(old_client) | ||
| ) | ||
| # Keep a strong reference until done so the task is not GC'd mid-flight. | ||
| self._mongodb_close_tasks.add(task) | ||
| task.add_done_callback(self._mongodb_close_tasks.discard) |
There was a problem hiding this comment.
This refresh swaps only GlobalDependencies.mongodb_database, but long-lived consumers can keep using AsyncDatabase and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and MongoDBCRUDRepository.__init__ stores self.collection = db[collection_name]. After the first refresh, those activities still use collections bound to old_client; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.
Artifacts
Repro: pytest harness for cached Mongo collection across OIDC refresh
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: verbose pytest output showing cached collection uses the closed superseded client
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/config/dependencies.py
Line: 280-286
Comment:
**Old clients still used**
This refresh swaps only `GlobalDependencies.mongodb_database`, but long-lived consumers can keep using `AsyncDatabase` and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and `MongoDBCRUDRepository.__init__` stores `self.collection = db[collection_name]`. After the first refresh, those activities still use collections bound to `old_client`; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.
How can I resolve this? If you propose a fix, please make it concise.
Issue:
There is a legacy cron job that restarts the agentex pod every 30 minutes in order to refresh the OIDC token for GCP. Because this cron job restarts the entire prod, while the pod is down, Compass workflows can lose connectivity and record polling errors.
Fix:
We will instead rebuild only the mongo client in process every 45 min instead of the chron job. Originally, the chron job was created as a legacy process for refreshing oidc token, but killing the whole pod was overkill. The overall process is kept alive by refreshing only the mongo client which houses the token
We have a drain and close strategy. After we create our new mongo client, new requests are redirected there and we let in flight queries in the old client finish naturally so nothing gets dropped.
Greptile Summary
This PR replaces a blunt 30-minute cron job that restarted the entire agentex pod (for GCP OIDC token refresh) with an in-process MongoDB client swap on a 45-minute cycle. The fix builds and validates a fresh
AsyncMongoClient, atomically swaps it intoGlobalDependencies, and closes the superseded client after a 60-second drain so in-flight operations are never dropped.GlobalDependenciesgainsrefresh_mongodb_client(), a background_mongodb_oidc_refresh_loop, and proper cleanup hooks inforce_reload/async_shutdown. The gate onMONGODB-OIDCin the URI means non-OIDC deployments (DocumentDB, plain auth) are never affected.RetentionCleanupActivitiesswitches from startup-captured repository/use-case instances to a per-run factory, so activities always resolvemongodb_databasefrom the live singleton after each refresh rather than holding a reference to a collection bound to a closed client.Confidence Score: 5/5
Safe to merge — the core refresh path is well-gated (OIDC-only, positive interval, non-None client), the build-validate-swap-drain sequence is correct, and the factory pattern in RetentionCleanupActivities properly addresses stale-collection exposure.
Both issues flagged by previous reviewers (failed-ping client leak, stale collection references) are addressed cleanly in this revision. The remaining observations are all minor operational improvements and do not affect correctness or data safety.
agentex/src/config/dependencies.py — the async_shutdown path leaves pending drain tasks to be cleaned up by event-loop cancellation rather than explicitly awaited; worth a second read if the deployment does graceful drains under load.
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant RefreshLoop as OIDC Refresh Loop participant Deps as GlobalDependencies participant NewClient as New MongoClient participant OldClient as Old MongoClient participant Activities as RetentionCleanupActivities RefreshLoop->>Deps: refresh_mongodb_client() Deps->>NewClient: _build_mongodb_client(uri) Deps->>NewClient: admin.command("ping") alt ping fails Deps->>NewClient: close() to prevent leak NewClient-->>Deps: exception re-raised, old client kept else ping succeeds Deps->>Deps: swap mongodb_client and mongodb_database Deps->>OldClient: schedule close after 60s drain OldClient->>OldClient: close() end Activities->>Deps: use_case_factory() per activity run Deps-->>Activities: TaskRetentionUseCase with current mongodb_database Activities->>NewClient: Mongo ops always use refreshed client%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant RefreshLoop as OIDC Refresh Loop participant Deps as GlobalDependencies participant NewClient as New MongoClient participant OldClient as Old MongoClient participant Activities as RetentionCleanupActivities RefreshLoop->>Deps: refresh_mongodb_client() Deps->>NewClient: _build_mongodb_client(uri) Deps->>NewClient: admin.command("ping") alt ping fails Deps->>NewClient: close() to prevent leak NewClient-->>Deps: exception re-raised, old client kept else ping succeeds Deps->>Deps: swap mongodb_client and mongodb_database Deps->>OldClient: schedule close after 60s drain OldClient->>OldClient: close() end Activities->>Deps: use_case_factory() per activity run Deps-->>Activities: TaskRetentionUseCase with current mongodb_database Activities->>NewClient: Mongo ops always use refreshed clientReviews (2): Last reviewed commit: "comment fixes" | Re-trigger Greptile
Context used:
Learned From
scaleapi/scaleapi#126388