Skip to content

Evaluate isDelimiter once in nextToken whitespace skip#618

Merged
garydgregory merged 2 commits into
apache:masterfrom
rootvector2:delimiter-double-consume
Jun 25, 2026
Merged

Evaluate isDelimiter once in nextToken whitespace skip#618
garydgregory merged 2 commits into
apache:masterfrom
rootvector2:delimiter-double-consume

Conversation

@rootvector2

Copy link
Copy Markdown
Contributor

nextToken evaluates the side-effecting isDelimiter(c) twice on the same character when ignoreSurroundingSpaces is on: once in the leading-whitespace skip loop and again in the start-of-token check. isDelimiter consumes the trailing characters of a multi-character delimiter as it matches, so for a delimiter whose first character is whitespace the loop eats the delimiter and the second call then peeks past it, mismatches, and the empty field at that boundary is dropped. With delimiter " |" and ignoreSurroundingSpaces enabled, a | |b parsed to two fields (a, " b") instead of three (a, empty, b), and |a to one field (" a") instead of two (empty, a). Fixed by evaluating isDelimiter once in the whitespace loop and reusing that result for the start-of-token decision.

Found while auditing the multi-character delimiter look-ahead against ignoreSurroundingSpaces.

  • Read the contribution guidelines for this project.
  • Read the ASF Generative Tooling Guidance if you use Artificial Intelligence (AI).
  • I used AI to create any part of, or all of, this pull request. Which AI tool was used to create this pull request, and to what extent did it contribute?
  • Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
  • Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best practice.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Each commit in the pull request should have a meaningful subject line and body.

@garydgregory garydgregory changed the title evaluate isDelimiter once in nextToken whitespace skip Evaluate isDelimiter once in nextToken whitespace skip Jun 25, 2026

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @rootvector2
Thank you for the PR.
Would you please add a test that demonstrates why this is an issue at public API level?
TY!

exercises the empty-field-dropped regression through CSVParser, not just the lexer.
@rootvector2

Copy link
Copy Markdown
Contributor Author

Added testEmptyFieldBeforeWhitespacePrefixedMultiCharacterDelimiter in CSVParserTest. With delimiter | and ignoreSurroundingSpaces on, |a parses to ["", "a"] and a | |b to ["a", "", "b"]. Without the fix both lose the empty field (the test fails with array lengths differ, expected: <2> but was: <1>). Full mvn is green.

@garydgregory

garydgregory commented Jun 25, 2026

Copy link
Copy Markdown
Member

Merged, TY @rootvector2. Do you see any other issues with whitespace in any other configuration parameters?

@garydgregory garydgregory merged commit 85d26fc into apache:master Jun 25, 2026
16 checks passed
@rootvector2

Copy link
Copy Markdown
Contributor Author

Took a look. The double-consume was specific to nextToken's leading-whitespace skip, since that's the only place that called isDelimiter twice on the same char. parseSimpleToken and parseEncapsulatedToken each call it once per char, so they were never affected.

I also ran a whitespace-prefixed multi-char delimiter through the other whitespace-sensitive configs: trim, a tab-prefixed delimiter (\t|), a quoted token followed by such a delimiter, and a trailing-whitespace delimiter (| ). All of them parse correctly now. Nothing else stood out, but I'll keep poking and open a separate PR if something falls out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants