Skip to content

bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338

Open
michael-chou359 wants to merge 2 commits into
mainfrom
mc/oidc-refresh
Open

bug: replace cron job restarting agentex pod for oidc refresh with mongoclient refresh#338
michael-chou359 wants to merge 2 commits into
mainfrom
mc/oidc-refresh

Conversation

@michael-chou359

@michael-chou359 michael-chou359 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Issue:

There is a legacy cron job that restarts the agentex pod every 30 minutes in order to refresh the OIDC token for GCP. Because this cron job restarts the entire prod, while the pod is down, Compass workflows can lose connectivity and record polling errors.

Fix:

We will instead rebuild only the mongo client in process every 45 min instead of the chron job. Originally, the chron job was created as a legacy process for refreshing oidc token, but killing the whole pod was overkill. The overall process is kept alive by refreshing only the mongo client which houses the token

We have a drain and close strategy. After we create our new mongo client, new requests are redirected there and we let in flight queries in the old client finish naturally so nothing gets dropped.

Greptile Summary

This PR replaces a blunt 30-minute cron job that restarted the entire agentex pod (for GCP OIDC token refresh) with an in-process MongoDB client swap on a 45-minute cycle. The fix builds and validates a fresh AsyncMongoClient, atomically swaps it into GlobalDependencies, and closes the superseded client after a 60-second drain so in-flight operations are never dropped.

  • GlobalDependencies gains refresh_mongodb_client(), a background _mongodb_oidc_refresh_loop, and proper cleanup hooks in force_reload/async_shutdown. The gate on MONGODB-OIDC in the URI means non-OIDC deployments (DocumentDB, plain auth) are never affected.
  • RetentionCleanupActivities switches from startup-captured repository/use-case instances to a per-run factory, so activities always resolve mongodb_database from the live singleton after each refresh rather than holding a reference to a collection bound to a closed client.
  • New unit and integration tests cover the build-validate-swap-drain ordering, the failed-ping cleanup path, loop lifecycle gating, and end-to-end round-trips against a real Mongo container.

Confidence Score: 5/5

Safe to merge — the core refresh path is well-gated (OIDC-only, positive interval, non-None client), the build-validate-swap-drain sequence is correct, and the factory pattern in RetentionCleanupActivities properly addresses stale-collection exposure.

Both issues flagged by previous reviewers (failed-ping client leak, stale collection references) are addressed cleanly in this revision. The remaining observations are all minor operational improvements and do not affect correctness or data safety.

agentex/src/config/dependencies.py — the async_shutdown path leaves pending drain tasks to be cleaned up by event-loop cancellation rather than explicitly awaited; worth a second read if the deployment does graceful drains under load.

Important Files Changed

Filename Overview
agentex/src/config/dependencies.py Core of the fix: adds refresh_mongodb_client(), a background OIDC refresh loop, and graceful drain-and-close of superseded clients; previous reviewer concerns about client leaks on failed ping and stale collections are both addressed in this version
agentex/src/temporal/activities/retention_cleanup_activities.py Switches from startup-captured use case/repository to per-activity factory invocation, ensuring activities always hold collections bound to the current Mongo client after a refresh
agentex/src/temporal/run_worker.py Passes a lambda factory to RetentionCleanupActivities so build_task_retention_use_case is called at activity runtime against the singleton GlobalDependencies, picking up the refreshed mongodb_database
agentex/src/config/environment_variables.py Adds MONGODB_OIDC_REFRESH_INTERVAL_SECONDS env var with a 2700s (45 min) default; value is duplicated between the Pydantic default and the os.environ.get fallback string, consistent with the existing pattern in this file
agentex/tests/unit/config/test_mongodb_oidc_refresh.py Comprehensive unit tests covering the gate check, swap ordering, failed-ping cleanup, loop lifecycle, and env var parsing — all without a real MongoDB
agentex/tests/integration/config/test_mongodb_oidc_refresh_integration.py Integration test that proves the build-validate-swap-drain cycle works end-to-end against a real Mongo container, verifying data written before the swap is readable after it and that the old client is closed
agentex/tests/unit/temporal/test_retention_cleanup_activities.py Updates all existing tests to the new factory-based constructor and adds a dedicated test verifying that the use case is resolved fresh on each activity invocation
agentex/tests/integration/config/init.py Empty init.py to make the new integration/config directory a Python package

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant RefreshLoop as OIDC Refresh Loop
    participant Deps as GlobalDependencies
    participant NewClient as New MongoClient
    participant OldClient as Old MongoClient
    participant Activities as RetentionCleanupActivities

    RefreshLoop->>Deps: refresh_mongodb_client()
    Deps->>NewClient: _build_mongodb_client(uri)
    Deps->>NewClient: admin.command("ping")
    alt ping fails
        Deps->>NewClient: close() to prevent leak
        NewClient-->>Deps: exception re-raised, old client kept
    else ping succeeds
        Deps->>Deps: swap mongodb_client and mongodb_database
        Deps->>OldClient: schedule close after 60s drain
        OldClient->>OldClient: close()
    end

    Activities->>Deps: use_case_factory() per activity run
    Deps-->>Activities: TaskRetentionUseCase with current mongodb_database
    Activities->>NewClient: Mongo ops always use refreshed client
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant RefreshLoop as OIDC Refresh Loop
    participant Deps as GlobalDependencies
    participant NewClient as New MongoClient
    participant OldClient as Old MongoClient
    participant Activities as RetentionCleanupActivities

    RefreshLoop->>Deps: refresh_mongodb_client()
    Deps->>NewClient: _build_mongodb_client(uri)
    Deps->>NewClient: admin.command("ping")
    alt ping fails
        Deps->>NewClient: close() to prevent leak
        NewClient-->>Deps: exception re-raised, old client kept
    else ping succeeds
        Deps->>Deps: swap mongodb_client and mongodb_database
        Deps->>OldClient: schedule close after 60s drain
        OldClient->>OldClient: close()
    end

    Activities->>Deps: use_case_factory() per activity run
    Deps-->>Activities: TaskRetentionUseCase with current mongodb_database
    Activities->>NewClient: Mongo ops always use refreshed client
Loading

Reviews (2): Last reviewed commit: "comment fixes" | Re-trigger Greptile

Context used:

  • Rule used - Store magic numbers as class or instance variables... (source)

Learned From
scaleapi/scaleapi#126388

@michael-chou359 michael-chou359 requested a review from a team as a code owner June 24, 2026 16:59
Comment on lines +280 to +286
if old_client is not None and old_client is not new_client:
task = asyncio.create_task(
self._close_mongodb_client_after_delay(old_client)
)
# Keep a strong reference until done so the task is not GC'd mid-flight.
self._mongodb_close_tasks.add(task)
task.add_done_callback(self._mongodb_close_tasks.discard)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Old clients still used

This refresh swaps only GlobalDependencies.mongodb_database, but long-lived consumers can keep using AsyncDatabase and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and MongoDBCRUDRepository.__init__ stores self.collection = db[collection_name]. After the first refresh, those activities still use collections bound to old_client; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.

Artifacts

Repro: pytest harness for cached Mongo collection across OIDC refresh

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing cached collection uses the closed superseded client

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/config/dependencies.py
Line: 280-286

Comment:
**Old clients still used**

This refresh swaps only `GlobalDependencies.mongodb_database`, but long-lived consumers can keep using `AsyncDatabase` and collection objects from the old client. For example, the Temporal worker builds retention repositories once at startup, and `MongoDBCRUDRepository.__init__` stores `self.collection = db[collection_name]`. After the first refresh, those activities still use collections bound to `old_client`; once this delayed close runs, retention cleanup Mongo operations can fail even though the global dependency points at a fresh client.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment thread agentex/src/config/dependencies.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant