HANG in linux.sigmask test every few hundred runs
PR #5482 hit a hang in the linux.sigmask test once. Xref #2112 static_signal hang.
https://github.com/DynamoRIO/dynamorio/runs/6299579362?check_suite_focus=true
154/426 Test #89: code_api|linux.sigmask .......................................***Failed Required regular expression not found. Regex=[^sending 10
in handler for signal 10
sending 28 with value
in handler for signal 28 from -1 value 0xdeadbeef
in handler for signal 12
init thread now inside handler: setting up itimer
done with itimer; exiting
all done
$
] 90.00 sec
sending 10
in handler for signal 10
sending 28 with value
I managed to reproduce once in every few hundred runs in an Ubuntu20 VM debug build. In gdb it's hard to tell what happened: The aux thread is at post-do_syscall with rax==-514. That's -ERESTARTNOHAND. Maybe just an artifact from gdb's ptrace attach. gdb is not doing well with symbols; did not figure much else out but didn't spend a lot of time.
Can't repro in debug build on glaptop. Reproduced it on glaptop in release build after ~300 runs. Also reproduces in HEAD so not limited to that PR:
00:41|bruening@bruening:~/dr/git/build_x64_rel_tests
$ ninja && ctest --repeat-until-fail 5000 -R sigmask\$
<...>
Test #86: code_api|linux.sigmask ........... Passed 0.03 sec
Start 86: code_api|linux.sigmask
Test #86: code_api|linux.sigmask ........... Passed 0.03 sec
Start 86: code_api|linux.sigmask
Test #86: code_api|linux.sigmask ...........***Failed Required regular expression not found. Regex=[^sending 10
in handler for signal 10
sending 28 with value
in handler for signal 28 from -1 value 0xdeadbeef
in handler for signal 12
all done
$
] 90.00 sec
0% tests passed, 1 tests failed out of 1
Total Test time (real) = 134.47 sec
The following tests FAILED:
86 - code_api|linux.sigmask (Failed)
Errors while running CTest
Output from these tests are in: /home/bruening/dr/git/build_x64_rel_tests/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
It could be a bug in the test, as it grabs locks in its signal handler -- or a bug in DR.