Minix is designed for fault tolerance, however, it is difficult to debug, test and evaluate. Previously, faults were injected only in drivers to demonstrate that Minix (unlike Linux, Windows, etc.) can survive crashing drivers. We are currently extending fault tolerance to other parts of the system. For instance, we are able to survive crashes in parts of networking stack. How transparent (to the user and applications) is recovery from a crash depends on which part crashes. Crash in a packet filter is almost unnoticeable, on the other hand, crash in TCP breaks all established connections. To evaluate the robustness of the system, we need to inject thousands to millions of faults which cannot be done manually. Also the fault injection process is getting more complex. The fault injection tools need to inspect running, binaries change them and report whether the faults were executed and lead to a crash. Since a fault may not only lead to a crash but to a hang in an infinite loop, we detect such behaviour by the means of periodic heart beats. The monitor cannot tell the difference between a component which is in such a loop from a component which is stopped for fault injection (which takes a while).
The goal of this project is to integrate the fault injection in the core system and evaluate it. Since Minix is a multiserver system with functionality distributed in various servers, there are many implementation challenges.