Skip to content

IGNITE-28805 Fix flaky ContinuousQueryMarshallerTest#testRemoteFilterFactoryServer#13265

Open
chesnokoff wants to merge 1 commit into
apache:masterfrom
chesnokoff:ignite-28805-cq-testRemoteFilterFactoryServer
Open

IGNITE-28805 Fix flaky ContinuousQueryMarshallerTest#testRemoteFilterFactoryServer#13265
chesnokoff wants to merge 1 commit into
apache:masterfrom
chesnokoff:ignite-28805-cq-testRemoteFilterFactoryServer

Conversation

@chesnokoff

@chesnokoff chesnokoff commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Thank you for submitting the pull request to the Apache Ignite.

In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:

The Contribution Checklist

  • There is a single JIRA ticket related to the pull request.
  • The web-link to the pull request is attached to the JIRA ticket.
  • The JIRA ticket has the Patch Available state.
  • The pull request body describes changes that have been made.
    The description explains WHAT and WHY was made instead of HOW.
  • The pull request title is treated as the final commit message.
    The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
  • A reviewer has been mentioned through the JIRA comments
    (see the Maintainers list)
  • The pull request has been checked by the Teamcity Bot and
    the green visa attached to the JIRA ticket (see tab PR Check at TC.Bot - Instance 1 or TC.Bot - Instance 2)

Notes

If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.

@chesnokoff chesnokoff changed the title IGNITE-28805 Fix flaky ContinuousQueryMarshallerTest#testRemoteFilter… IGNITE-28805 Fix flaky ContinuousQueryMarshallerTest#testRemoteFilterFactoryServer Jun 22, 2026
@ignitetcbot

Copy link
Copy Markdown
Contributor

TCBot Test Analysis

Possible Blockers (0)

No blockers found.

New Tests (0)

No new tests found.


final Ignite node2 = "client".equals(node2Name) ? startClientGrid(node2Name) : startGrid(node2Name);

awaitPartitionMapExchange();

@chesnokoff chesnokoff Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test started server2 and immediately executed CQ with an initial scan query. At this moment partition map exchange could still be in progress and the initial scan sometimes did not see all existing entries.

In failed run the initial query returned only 2 entries instead of 5. Then all update events arrived, but the latch was still not fully counted down and the test failed with AssertionError: 4

The fix is to wait for pme after starting server2 and before executing the continuous query

@anton-vinogradov

Copy link
Copy Markdown
Contributor

LGTM — root cause and fix are both correct.

A couple of notes for the record:

Why only the server variant flaked: both @Tests share check(), but starting a second server triggers rebalancing/PME while starting a client moves no data. So the initial ScanQuery racing an in-progress PME cleanly explains why testRemoteFilterFactoryServer misses pre-loaded entries and the client variant doesn't. awaitPartitionMapExchange() blocks on a deterministic condition (no MOVING partitions), so this is a real fix rather than a sleep-style band-aid, and it matches the idiom already used by ~10 other tests in this package. Keeping the wait in the shared check() also hardens the client variant at no cost. 👍

Latch math checks out: 5 initial (even keys 0,2,4,6,8) + 10 updates (keys 10–19; DummyEventFilterFactory accepts all) = 15.

Minor: the numbers in the explanation don't quite reconcile — a 2-entry initial scan leaves the latch at 3 (15−2−10), but the message was AssertionError: 4, which implies either a 1-entry initial scan or that one update event was also missed in that run. awaitPartitionMapExchange() covers both, just worth confirming the missed-event path is the initial scan only and not also CQ registration.

Residual: the 5s latch.await for async CQ delivery is the only remaining timing dependency — standard and fine, flagging for completeness.


final Ignite node2 = "client".equals(node2Name) ? startClientGrid(node2Name) : startGrid(node2Name);

awaitPartitionMapExchange();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit: a one-line note on why the wait is here would protect it from being removed during a future "cleanup", since the call looks redundant out of context. Up to you — most of the other awaitPartitionMapExchange() calls in this package aren't commented either.

Suggested change
awaitPartitionMapExchange();
// Wait for rebalance to finish so the initial scan query observes all pre-loaded entries.
awaitPartitionMapExchange();

@chesnokoff chesnokoff Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to add this comment. It mostly repeats the method name becauseawaitPartitionMapExchange() already says that we wait for PME

Also, "rebalance" is a bit too narrow here. The main point is that the initial scan should not start while the topology and partition map is still changing after server2 joins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants