Skip to content

Fix AssertionError when parsing malformed HTML (closes #568)#599

Open
gaoflow wants to merge 2 commits into
html5lib:masterfrom
gaoflow:fix/malformed-html-assertion-crash
Open

Fix AssertionError when parsing malformed HTML (closes #568)#599
gaoflow wants to merge 2 commits into
html5lib:masterfrom
gaoflow:fix/malformed-html-assertion-crash

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 25, 2026

Copy link
Copy Markdown

Summary

Fixes two AssertionError crashes triggered by malformed HTML during a regular (non-fragment) parse, originally reported in #568 via Google's oss-fuzz / Beautiful Soup.

Bug 1 — resetInsertionMode crashes on <select> inside foreign content

When a <select> element appears in the open-elements stack inside a foreign-content element (e.g. inside <math>), the subsequent call to resetInsertionMode encountered the element name in the ("select", "colgroup", ...) guard on line 369–370 and raised an AssertionError:

b'-<math><sElect><mi><sElect><sElect>'
# → AssertionError (html5parser.py resetInsertionMode line 370)

Per the WHATWG HTML5 parsing spec (§8.2.8.1 "reset the insertion mode appropriately"), <select> is not restricted to the innerHTML case — the correct mode is "inSelect" (or "inSelectInTable" if a table ancestor is present). The fix removes "select" from the asserted set so it falls through to the existing newModes lookup.

For "colgroup", "head", and "html" (which genuinely should only appear during fragment parsing), the hard assert is replaced with a continue so the loop seeks a better ancestor instead of crashing.

Bug 2 — InTablePhase.processEOF crashes when <html> is current node

Malformed markup such as ñ<table><svg><html> can leave <html> as the current node while the parser is in "in table" mode outside of an innerHTML context. The old code asserted self.parser.innerHTML at that point:

b'\xc3\xb1<table><svg><html>'
# → AssertionError (html5parser.py InTablePhase.processEOF line 1699)

The fix emits a parseError for the non-innerHTML path and then lets the parser stop cleanly, consistent with what the spec requires on EOF.

Before / After

import html5lib

# Both raise AssertionError before this PR:
html5lib.parse(b'-<math><sElect><mi><sElect><sElect>')
html5lib.parse(b'\xc3\xb1<table><svg><html>')

# Both return a document tree after this PR (with appropriate parse errors recorded).

Test plan

  • html5lib.parse(b'-<math><sElect><mi><sElect><sElect>') — was AssertionError, now returns document
  • html5lib.parse(b'\xc3\xb1<table><svg><html>') — was AssertionError, now returns document
  • Normal HTML, <select> forms, <table>, and MathML documents continue to parse correctly

This pull request was prepared with the assistance of AI, under my direction and review.

gaoflow added 2 commits June 25, 2026 14:12
Two separate assertions in the parser incorrectly assumed that certain
conditions can only occur during fragment parsing (innerHTML mode), but
real-world malformed markup can trigger them in a full parse:

1. ``resetInsertionMode``: when a ``<select>`` element appears in the
   open-elements stack inside foreign content (e.g. inside ``<math>``),
   the subsequent ``resetInsertionMode`` call encountered the element
   name in the ("select", "colgroup", ...) guard and raised
   ``AssertionError``.  Per the WHATWG spec, ``<select>`` is valid in
   that position during ordinary parsing; the correct mode is
   "inSelect".  Remove ``select`` from the guarded set so it falls
   through to the existing ``newModes`` lookup.  For ``colgroup``,
   ``head``, and ``html``, replace the hard assert with a ``continue``
   so the loop finds a better ancestor rather than crashing.

2. ``InTablePhase.processEOF``: malformed markup such as
   ``<table><svg><html>`` can leave ``<html>`` as the current node
   while the parser is in "in table" mode without being in innerHTML
   mode.  Replace the assertion with a ``parseError`` call so the
   parser reports the condition and stops cleanly.

Reproduces crashes reported in issue html5lib#568 (oss-fuzz / Beautiful Soup
test cases):
  * ``b'-<math><sElect><mi><sElect><sElect>'``  → AssertionError
  * ``b'\xc3\xb1<table><svg><html>'``           → AssertionError
@gaoflow

gaoflow commented Jun 25, 2026

Copy link
Copy Markdown
Author

Pushed d9ac326 adding focused regression coverage for both malformed inputs from the PR description.

Verification run locally:

  • PYTHONPATH=$PWD uv run --with-requirements requirements-test.txt --with 'setuptools<80' python -m pytest html5lib/tests/test_parser2.py -q
  • PYTHONPATH=$PWD uv run --with-requirements requirements-test.txt python -m flake8 html5lib/html5parser.py html5lib/tests/test_parser2.py
  • git diff --check

I also checked the current AppVeyor failure. It looks unrelated to this parser change: the Python 3.7 jobs fail before tests at py -VV with “No suitable Python runtime found”, and the Python 2.7 optional job fails while installing lxml 5.0.2 because Windows cannot build it without VC++ 9.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant