Skip to content

Adsk Contrib - Fix various Actions problems#2324

Open
doug-walker wants to merge 13 commits into
AcademySoftwareFoundation:mainfrom
autodesk-forks:walker/rtd-fix
Open

Adsk Contrib - Fix various Actions problems#2324
doug-walker wants to merge 13 commits into
AcademySoftwareFoundation:mainfrom
autodesk-forks:walker/rtd-fix

Conversation

@doug-walker

@doug-walker doug-walker commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

This PR attempts to get our Actions back into the green. Numerous changes were required.

  1. The CI Action was failing because RTD no longer supports ubuntu-20.04.
  • Update build OS to ubuntu-lts-latest.
  • Update requested Python from 3.11 to 3.14.
  1. The Dependencies-latest Action was failing due to the windows-2025-vs2026 runner image apparently no longer having the DirectX headers available.
  • Have the action install the headers via vcpkg.
  • Adjust share/cmake/modules/FindDirectX-Headers.cmake based on the following suggestion from claude:

When find_package(directx-headers CONFIG QUIET) finds the vcpkg-installed package, it creates the Microsoft::DirectX-Headers imported target with INTERFACE_INCLUDE_DIRECTORIES set. We now pull that path into DirectX-Headers_INCLUDE_DIR before find_package_handle_standard_args runs — so the required-variable check passes and DirectX-Headers_FOUND stays TRUE.

This also means the target-creation block at line 69 (if(DirectX-Headers_FOUND AND NOT TARGET Microsoft::DirectX-Headers AND DirectX-Headers_INCLUDE_DIR)) will correctly skip when the vcpkg config already created Microsoft::DirectX-Headers.

@num3ric, please review the DirectX fix, thank you!

  1. Removed the pin of the Linux VFX CY2023 CI to the 2023.2 container. All Linux runs use the new containers that JF released on June 29th, based on Conan 2.

  2. The new Linux containers broke numerous builds and exposed various long-standing undetected bugs in the CI. Apparently although we have numerous CI jobs that set use-oiio: 'ON', apparently none of these actually did that. On Linux, this was because OpenImageIO, although present, was not found and the build silently fell back to the OpenEXR path. With the new containers, OIIO is found but the link fails for the OIIO apps since it requires a dynamic OCIO library with a different version and namespace to the one being compiled. This is due to an intentional decision in the containers to delete that version of OCIO that was used to compile OIIO to avoid the possibility of it causing other problems. I'm not sure that is necessary, but claude was able to work around it by downloading that library from the public/anonymous ASWF Conan remote. (A new aswf-ocio-version variable is added when use-oiio is on to support this.) It only downloads the older OCIO if use-oiio is on, so at least with the other builds there is no risk of having two OCIO libraries on the system. This new install step is in share/ci/scripts/linux/dnf/install_aswf_opencolorio.sh.

Unfortunately, this occasionally fails with:

  /usr/bin/docker pull aswf/ci-ocio:2024
  Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  1. On Mac, the install_oiio.sh was never being run, so the build ignored use-oiio and just used OpenEXR. I removed the ability to choose an OIIO version from that installer since brew doesn't support that for OIIO other than building from source, which we don't have time for in the CI. I modified ci_workflow.yml so that it now does the brew install, if use-oiio is on. I added a timer and verified that this only takes about 30 extra seconds.

  2. On Windows, there is no install_oiio.sh script at all and the runners don't have it, so I turned use-oiio off for all builds and added a comment. In the future, someone could investigate installing OIIO via vcpkg, though that may be more trouble than it's worth based on experience with the .bat scripts we already provide that often break related to vcpkg issues.

  3. I modified share/cmake/modules/FindExtPackages.cmake so that if OCIO_USE_OIIO_FOR_APPS is ON but OIIO is not found, it fails immediately with an error message rather than falling back to OpenEXR. Going forward, this should prevent us from thinking we're testing OIIO when we're actually not.

  4. The dependencies_latest.yml workflow is crashing sporadically in OpOptimizers_test.cpp:invlut_pair_identities in the CPU tests that use AVX512. I've discovered that the windows-2025-vs2026 image only occasionally runs on a CPU that supports AVX512, so these tests don't often run. I have not determined yet if it crashes every time it runs on one of the AVX512 machines. I added a test to tests/cpu/ops/lut3d/Lut3DOpCPU_tests.cpp because claude suspects that there is problem with the remainder path in LUT3D tetrahedral interpolation. The test should provide more data than the simple illegal instruction crash we're currently getting. Tagging @markreidvfx for visibility.

The new test disproved the out-of-bounds memory theory and points to a known issue with certain AMD processors, including the one used in the runner. From claude:

AMD Erratum 1485 affects exactly this hardware class — Zen4-based EPYC 8004/9004 series (your runner's EPYC 9V45 is a Zen4 Genoa-family EPYC chip). The bug: when SMT is enabled and STIBP is not properly enabled, Zen4 can corrupt its own instruction stream during speculative execution, causing a genuinely spurious #UD (illegal instruction) exception — on code that is completely correct. It was originally discovered via "random illegal instruction" crashes during ordinary compilation (nothing to do with AVX512 specifically), which is exactly why it can hit a totally different, much simpler instruction sequence than your first crash (dim=2, numPixels=1, the very first call) instead of something tied to a specific opcode like gather.

This lines up with every piece of evidence we've gathered:

  • "Illegal", not access violation — matches #UD exactly, not a memory-safety fault (consistent with my guard pages finding nothing).
  • Fails only on this specific cloud runner, passes on your local machine — this is a speculative-execution/microcode-state bug tied to whether STIBP is correctly enabled for this specific CPU+OS+hypervisor combination, not your code. Your local machine is almost certainly a different CPU (not this exact Zen4 EPYC SKU under this exact mitigation misconfiguration).
  • Reproduces in complete isolation, immediately, on the very first minimal call — consistent with a hardware-state bug that can trigger on essentially any dense instruction sequence once the precondition (SMT on, STIBP off) is met, rather than something specific to LUT interpolation logic.

The upstream fix (Linux kernel commit f454b18e (https://patchew.org/linux/169701622768.3135.17489375930381616520.tip-bot2@tip-bot2/), Oct 2023) is a kernel-level workaround that force-enables STIBP on affected Zen4 chips — it's not something fixable in OCIO's source at all. The open question is whether Windows Server 2025 (or Azure's Hyper-V layer, on whichever host this GH Actions runner landed on) has an equivalent mitigation, and whether it's actually engaged in this VM configuration. That's infrastructure outside OCIO's or this PR's control.

I found conflicting/imprecise data on the exact Zen4 model-number ranges (they're scattered — 0x11 for Genoa, 0x60 for Raphael, 0x70 for Phoenix, 0xA0-0xAF also Zen4 — from wiki-level sources I can't fully verify), and family alone (0x19) doesn't distinguish Zen4 from Zen3. Rather than risk a wrong or incomplete family/model rule from low-confidence data, I'll match on the exact brand string we have first-hand evidence for from the actual crash log — precise, defensible, and easy to extend later if this shows up on other Zen4 SKUs.

Given that the illegal instruction crash has been happening on multiple runs that I've done, I decided to turn off AVX512 support for the EPYC 9V45 chip. This applies even outside the context of CI. My reasoning is that I'd rather give up AVX512 performance rather than crash applications. One issue is that other chips have this problem too but apparently there is not a simple way of flagging just those chips. We should keep our ears open for reports of crashes on AMD chips and may need to widen the ones we flag. As claude wrote above, it's a shame we need to do this since STIBP may be properly enabled in many cases. But again, the priority needs to be on avoiding crashes.

Signed-off-by: Doug Walker <doug.walker@autodesk.com>
@doug-walker doug-walker changed the title Update ReadTheDocs build OS Adsk Contrib - Update ReadTheDocs build OS Jun 28, 2026
Signed-off-by: Doug Walker <doug.walker@autodesk.com>
Signed-off-by: Doug Walker <doug.walker@autodesk.com>
@doug-walker doug-walker changed the title Adsk Contrib - Update ReadTheDocs build OS Adsk Contrib - Update ReadTheDocs build OS and DirectX header install Jun 28, 2026
@cozdas

cozdas commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

There is this warning in the readthedocs page about using the ubuntu-lts-latest tag. I think it's fine but still wanted to mention.

Warning
Using ubuntu-lts-latest may break your builds unexpectedly if your project isn’t compatible with the newest Ubuntu LTS version when it’s updated by Read the Docs.

Signed-off-by: Doug Walker <doug.walker@autodesk.com>
Signed-off-by: Doug Walker <doug.walker@autodesk.com>
The aswf/ci-ocio container images intentionally omit OpenColorIO since
the whole point of the image is to build it from source. But the
container's prebuilt OpenImageIO is dynamically linked against a
specific OpenColorIO release that isn't present, causing link
failures for ociolutimage/ocioconvert/ociodisplay whenever
OCIO_USE_OIIO_FOR_APPS=ON. Fetch that matching release from the ASWF
Conan remote and stage it on the linker path before configuring, so
apps link against both the from-source build (used directly) and the
container's expected release (used transitively via OpenImageIO).

Signed-off-by: Doug Walker <doug.walker@autodesk.com>
This reverts commit fa79c58.

Signed-off-by: Doug Walker <doug.walker@autodesk.com>
Previously, requesting OCIO_USE_OIIO_FOR_APPS=ON while OpenImageIO
could not be found silently fell back to OpenEXR (or skipped building
ociolutimage/ocioconvert/ociodisplay entirely), with only a WARNING.
This let CI silently stop exercising the OIIO code path in those apps
without failing, masking the fact that OIIO wasn't actually being
tested. Turn this into a hard configure-time error instead, so a
missing OpenImageIO is caught immediately when it was explicitly
requested.

Signed-off-by: Doug Walker <doug.walker@autodesk.com>
Signed-off-by: Doug Walker <doug.walker@autodesk.com>
Signed-off-by: Doug Walker <doug.walker@autodesk.com>
@doug-walker doug-walker changed the title Adsk Contrib - Update ReadTheDocs build OS and DirectX header install Adsk Contrib - Fix various Actions problems Jul 2, 2026
@doug-walker doug-walker requested a review from cozdas July 2, 2026 03:45
Signed-off-by: Doug Walker <doug.walker@autodesk.com>
Signed-off-by: Doug Walker <doug.walker@autodesk.com>
Signed-off-by: Doug Walker <doug.walker@autodesk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants