gui/rename: fix script error that sometimes caused the script to malfunction when started from the Launcher.#1595
gui/rename: fix script error that sometimes caused the script to malfunction when started from the Launcher.#1595SilasD wants to merge 2 commits into
Conversation
resolved conflict in changelog.txt
|
good job on the fix, i was looking at this into why this was crashing, and i thought id share what i found because its quite a fascinating c++ edge case it turns out this isnt actually a lua race condition or an interpreter conflict. in Lua 5.3 (which DFHack uses), variables defined in a the real culprit is c++ implementation defined behavior for enums. if an when when the GUI rendered the list, it executed the closure and accessed your fix to change the logic to just wanted to share the root cause for future reference, thanks for fixing this |
|
that, hmmm. |
This error apparently was an interpreter context conflict. The overlay manager runs scripts in core context but the Launcher does not, so if the script was started from the Launcher, there is a race condition with a variable.
I was able to trigger this bug in four or five different sessions, so it's very much worth fixing.
The script abort was caused by attempting to index into a vector[-1].
I fixed this by adding a limit check before the vector access, around line 650 in the new version.
I preemptively added a limit check to one other suspicious vector access, around line 550 in the new version.
I also did very minor code cleanup, mostly changing some integer constants into symbolic constants.
I added LuaLS annotations added to many functions so I could figure out what the code is doing.
The LuaLS DFHack definitions are from a manually-augmented copy of vallode's definitions, which I intend to properly fork and make available Real Soon Now.
The error presented as the
gui/renamewindow not working or locking up, and the overlay manager printing stack traces over and over.One copy of a stack trace:
the faulting code:
Analysis: The code faulted in the anonymous callback function, so we don't have an accurate line number, only the line number of the start of the function definition.
The fault was indexing an int32_t vector by -1. Thus we can rule out:
language.words[word], which is a vector ofdf.language_word..forms[...], which is a container of typedf.language_words_flags.self.target.name.parts_of_speech[val], which is a container of typedf.part_of_speech.The remaining possibility is
self.target.name.words[val].self.target.name.wordsis of typedf.language_name.int32_t[7], and is indexed by enum typedf.language_name_component.valmust have been set todf.language_name_component.NONE, which is-1.It is not clear how this happened. The code attempts to deal with this case with this line of code:
However, the faulting code is in a callback. There is apparently a codepath where the callback can be entered while
valis-1.This makes some sense, because the anonymous function won't be recompiled every time the outer function runs, and it also binds to
valas an upvalue.However, I don't see how the callback could run while the outer function is executing the
forloop. They should be in the same interpreter context, the core context.At any rate, I added a limit check:
(Later:) Oh. Scripts started from the Launcher do not run in core context:
So there very much can be race conditions and the like. This may need further study.