Skip to content

HDDS-13434. Wait for replicas in TestOzoneDebugShell chunk-info check#10600

Open
chihsuan wants to merge 1 commit into
apache:masterfrom
chihsuan:HDDS-13434
Open

HDDS-13434. Wait for replicas in TestOzoneDebugShell chunk-info check#10600
chihsuan wants to merge 1 commit into
apache:masterfrom
chihsuan:HDDS-13434

Conversation

@chihsuan

@chihsuan chihsuan commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Test-only change.

TestOzoneDebugShell.testChunkInfoVerifyPathsAreDifferent intermittently failed with expected: <3> but was: <2>. A RATIS THREE write is acknowledged on a Ratis majority, so right after the key is written one replica may not have applied the write yet. When ozone debug replicas chunk-info runs in that window, ReadContainer on the lagging datanode fails, the datanode is dropped from the response, and the test sees only 2 distinct block file paths instead of 3.

Fix. Wait until all three datanodes report the block (three distinct paths) before asserting, using GenericTestUtils.waitFor, instead of asserting on a single read taken immediately after the write. A persistent error or an unexpected path count is reported via the wait's last observed state, so genuine regressions still surface with context rather than an opaque timeout.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13434

How was this patch tested?

  • Ran the previously-flaky test locally; passes for all BucketLayout values.
  • Confirmed the wait is not vacuous: with the target temporarily set to an unreachable path count, the test times out as expected, proving the wait actually gates the assertion.
  • flaky-test-check on the fork, TestOzoneIntegrationNonHA ALL, 10x10: the OzoneDebugShell cases (including this test) passed in all 100 iterations. https://github.com/chihsuan/ozone/actions/runs/28101143264
    • One unrelated pre-existing flake in TestListKeysWithFSO (BUCKET_ALREADY_EXISTS in test setup) appeared in a single split; it is not touched by this change.

@chihsuan chihsuan marked this pull request as ready for review June 24, 2026 14:25
Comment on lines +213 to +215
AtomicInteger exitCode = new AtomicInteger(1);
AtomicInteger lastPathCount = new AtomicInteger(-1);
AtomicReference<Throwable> lastError = new AtomicReference<>();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chihsuan for working on this. I don't think these three variables need to be atomic. Could you walk me through the reasoning behind using atomic here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! @chungen0126 Yeah, these aren't for thread-safety. I picked Atomic* specifically because the lambda has to write exitCode/diagnostics back out, and a plain int/Throwable can't be reassigned inside a lambda so they need to be mutable holders.

Besides, I observed that Atomic* is the existing idiom for waitFor holders in the repo (e.g. TestSnapshotBackgroundServices).

Open to changing it if you have any suggestions. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants