I was banging my head against the wall trying to figure out why my supposedly “robust” mutexes were not, in fact, acting very robust. Since it took me so long to figure it out I thought I would write down what happened in case it saves someone else some time. Also I learned about a case where mutex-robustness won’t save you (which hopefully won’t matter much in practice).
I’ve been experimenting with communicating across processes with shared-memory on Linux, and trying to protect access to the shared-memory with a pthread_mutex.
A cool thing you can do is flag the mutex as “robust” - which means that if the process or thread holding the mutex doesn’t unlock the mutex (because the process crashed or was terminated) the next acquirer of the mutex will be told about the situation (instead of hanging forever on the lock) and be given a chance to clean up whatever was left inconsistent.
However this wasn’t working at all for me in my application - and what was more maddening was every “toy” program I wrote to try out the “robust” mutex APIs worked perfectly.
My application consistent of a mutex plus some data-structures all living in a segment of shared-memory. The basics of all of the operations in my application were:
mmap
the shared memoryMy robustness test skipped steps 4 and 5 - meaning that the mutex should have been left orphaned, and the next process to come along should have been told what happened.
What happened instead was that the next process to come along would hang on step (3) forever - waiting for the dead process to exit the mutex.
It turns out that mutex-robustness is implemented in Linux by tracking a per-thread list of mutexes currently held by the process. Then when a thread (or process) dies, the kernel goes through the list and marks each mutex as dead.
This linked-list is kept in user-memory in thread-local storage, which means if your program unmaps the memory the mutex lives in prior to (erroneously) exiting (aka step 6 in my test above!), the entry in this linked-list is bogus and your mutex never gets marked dead1.
So! This won’t matter much in practice, however it does (slightly) limit the effectiveness of mutex-robustness is guarding against programmer-errors, where your own code forgot to call un-lock.
It’s slightly more complex than this - a mutex in the Linux implementation↩︎