Computer Science Department
School of Computer Science, Carnegie Mellon University
Improving Device Driver Reliability through
Olatunji O. Ruwase
Device drivers are Operating Systems (OS) extensions that enable the use of I/O devices in computing systems. However, studies have identified drivers as an Achilles' heel of system reliability, their high fault rate accounting for a significant portion of system failures. Consequently, significant effort has been directed towards improving system robustness by protecting system components (e.g., OS kernel, I/O devices, etc.) from the harmful effects of driver faults. In contrast to prior techniques which focused on preventing unsafe driver interactions (e.g., with the OS kernel), my thesis is that checking a driver’s execution for correctness violations results in the detection and mitigation of more faults.
To validate this thesis, I present Guardrail, a flexible and powerful framework that enables instruction-grained dynamic analysis (e.g., data race detection) of unmodified kernel-mode driver binaries to safeguard I/O operations and devices from driver faults. Guardrail decouples the analysis tool from driver execution to improve performance, and runs it in user-space to simplify the deployment of new tools. Moreover, Guardrail leverages virtualization to be transparent to both the driver and device, and enable support for arbitrary driver/device combinations.
To demonstrate Guardrail's generality, I implemented three novel dynamic checking tools within the framework for detecting memory faults, data races and DMA faults in drivers. These tools found 25 serious bugs, including previously unknown bugs, in Linux storage and network drivers. Some of the bugs existed in several Linux (and driver) releases, suggesting their elusiveness to existing approaches. Guardrail easily detected these bugs using common driver workloads.
Finally, I present an evaluation of using Guardrail to protect network and storage I/O operations from memory faults, data races and DMA faults in drivers. The results show that with hardware-assisted logging for decoupling the heavyweight analyses from driver execution, standard I/O workloads generally experienced negligible slowdown on their end-to-end performance.
In conclusion, Guardrail's high fidelity fault detection and efficient monitoring performance makes it a promising approach for improving the resilience of computing systems to the wide variety of driver faults.