Fix AssertionError when parsing malformed HTML (closes #568)#599
Open
gaoflow wants to merge 2 commits into
Open
Fix AssertionError when parsing malformed HTML (closes #568)#599gaoflow wants to merge 2 commits into
gaoflow wants to merge 2 commits into
Conversation
Two separate assertions in the parser incorrectly assumed that certain
conditions can only occur during fragment parsing (innerHTML mode), but
real-world malformed markup can trigger them in a full parse:
1. ``resetInsertionMode``: when a ``<select>`` element appears in the
open-elements stack inside foreign content (e.g. inside ``<math>``),
the subsequent ``resetInsertionMode`` call encountered the element
name in the ("select", "colgroup", ...) guard and raised
``AssertionError``. Per the WHATWG spec, ``<select>`` is valid in
that position during ordinary parsing; the correct mode is
"inSelect". Remove ``select`` from the guarded set so it falls
through to the existing ``newModes`` lookup. For ``colgroup``,
``head``, and ``html``, replace the hard assert with a ``continue``
so the loop finds a better ancestor rather than crashing.
2. ``InTablePhase.processEOF``: malformed markup such as
``<table><svg><html>`` can leave ``<html>`` as the current node
while the parser is in "in table" mode without being in innerHTML
mode. Replace the assertion with a ``parseError`` call so the
parser reports the condition and stops cleanly.
Reproduces crashes reported in issue html5lib#568 (oss-fuzz / Beautiful Soup
test cases):
* ``b'-<math><sElect><mi><sElect><sElect>'`` → AssertionError
* ``b'\xc3\xb1<table><svg><html>'`` → AssertionError
Author
|
Pushed Verification run locally:
I also checked the current AppVeyor failure. It looks unrelated to this parser change: the Python 3.7 jobs fail before tests at |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two
AssertionErrorcrashes triggered by malformed HTML during a regular (non-fragment) parse, originally reported in #568 via Google's oss-fuzz / Beautiful Soup.Bug 1 —
resetInsertionModecrashes on<select>inside foreign contentWhen a
<select>element appears in the open-elements stack inside a foreign-content element (e.g. inside<math>), the subsequent call toresetInsertionModeencountered the element name in the("select", "colgroup", ...)guard on line 369–370 and raised anAssertionError:Per the WHATWG HTML5 parsing spec (§8.2.8.1 "reset the insertion mode appropriately"),
<select>is not restricted to the innerHTML case — the correct mode is"inSelect"(or"inSelectInTable"if a table ancestor is present). The fix removes"select"from the asserted set so it falls through to the existingnewModeslookup.For
"colgroup","head", and"html"(which genuinely should only appear during fragment parsing), the hardassertis replaced with acontinueso the loop seeks a better ancestor instead of crashing.Bug 2 —
InTablePhase.processEOFcrashes when<html>is current nodeMalformed markup such as
ñ<table><svg><html>can leave<html>as the current node while the parser is in "in table" mode outside of an innerHTML context. The old code assertedself.parser.innerHTMLat that point:The fix emits a
parseErrorfor the non-innerHTML path and then lets the parser stop cleanly, consistent with what the spec requires on EOF.Before / After
Test plan
html5lib.parse(b'-<math><sElect><mi><sElect><sElect>')— wasAssertionError, now returns documenthtml5lib.parse(b'\xc3\xb1<table><svg><html>')— wasAssertionError, now returns document<select>forms,<table>, and MathML documents continue to parse correctlyThis pull request was prepared with the assistance of AI, under my direction and review.