Computer Science Department
School of Computer Science, Carnegie Mellon University
The Blind Men and the Elephant:
Google's MapReduce framework enables distributed, data-intensive, parallel applications by decomposing a massive job into smaller (Map and Reduce) tasks and a massive data-set into smaller partitions, such that each task processes a different partition in parallel. However, performance problems in a distributed MapReduce system can be hard to diagnose and to localize to a specific node or a set of nodes. On the other hand, the structure of large number of nodes performing similar tasks naturally affords us opportunities for observing the system from multiple viewpoints.
We present a "Blind Men and the Elephant" (BliMeE) framework in which we exploit this structure, and demonstrate how problems in a MapReduce sys- tem can be diagnose by corroborating the multiple viewpoints. More specifically, we present algorithms within the BliMeE framework based on OS-level performance counters, on white-box metrics extracted from logs, and on application-level heartbeats. We show that our BliMeE algorithms are able to capture a variety of faults including resource hogs and application hangs, and to localize the fault to subsets of slave nodes in the MapReduce system.
In addition, we discuss how the diagnostic algorithms' outcomes can be further synthesized in a repeated application of the BliMeE approach. We present a simple supervised learning technique which allows us to identify a fault if it has been previously observed.