Computer Science Department
School of Computer Science, Carnegie Mellon University
CPU Performance Counter-Based
Keith A. Bare
Faults that occur in distributed software systems, such as e-commerce applications, are often costly in terms of lost revenue, but can be difficult to discover manually. Problem diagnosis tools attempt to detect, and often localize, faults that occur in distributed software systems. The goal is to detect problems soon after they occur, and rapidly notify an operator or automatically fix the issue as quickly possible.
Trade-offs are involved in designing such tools. A good tool should be accurate, and should have low overheads, to minimize adverse effects to the monitored system. Often there is a trade-off between these two goals. Application-level data can often lead to very accurate, fine-grained diagnoses, but at a high cost in terms of reduced system performance. Metrics collected from the operating system are less expensive to collect, but usually are only suitable for coarse fault localization, usually to a specific machine.
This thesis explores a data source that has only had limited use in problem diagnosis tools: CPU performance counters. Instrumentation based on these performance counters can be collected with very low overheads, and provides information with expressive power similar to data collected from the operating system. This data source is evaluated experimentally, in conjunction with a variety of simple analysis algorithms, via synthetic fault-injection experiments against a realistic 3-tier auction web-application.
Experimental results indicate that CPU performance counter-based approaches are able to consistently detect, and in some instances localize faults, even when only simple analyses are performed. Given the low cost to collect data from the counters, and the fact that this data need not be tied to a specific application or operating system, this work demonstrates the viability of such approaches to problem diagnosis.