HP DCPI tool

»

DCPI

Site information

» Send us your comments

Installation

» Download DCPI
» Installing DCPI

Product information

» Frequently asked questions
» Documentation
» Publications
customer times newsletter link

Continuous Profiling

(It's 10:43; Do You Know Where Your Cycles Are?)

Jennifer Anderson, Lance Berc, Jeff Dean,
Sanjay Ghemawat, Monika Henzinger, Shun-Tak Leung, Dick Sites,
Mark Vandevoorde, Carl Waldspurger, and William E. Weihl


Digital Systems Research Center
Palo Alto, CA 94301 USA

Processors are getting faster (600 MHz and climbing) and issue widths are increasing (4- and 8-way becoming common), yet application performance is not keeping pace. On large commercial applications, average CPI (cycles-per-instruction) numbers may be as high as 4 or 5. With 8-way issue, a CPI of 5 means that only 1 issue slot in every 40 is being put to good use!

It is common to blame such problems on memory, and in fact many applications spend many cycles waiting for memory. But other problems -- e.g., branch mispredicts -- also waste cycles, and independent of the general causes, if one hopes to improve the performance of a particular application, one needs to know which instructions are stalling and why.

The Digital Continuous Profiling Infrastructure provides an efficient and accurate way of answering such questions. It uses the Alpha hardware performance counters to obtain high-frequency samples of various events (cycles, imisses, branch mispredicts, etc.). Samples are then processed by a suite of analysis tools that accurately characterize where the time is being spent in a complex workload, from the fraction of cycles spent in each executable image to the CPI for each instruction and the reasons for any static or dynamic stalls.

Both the data collection subsystem and the analysis tools have interesting novel features. The data collection subsystem uses the hardware performance counters to sample program counters periodically, recording the samples in on-disk profile files. The system is designed to run continuously in the background on production systems; for this to be practical, the overhead must be very low. The system as currently implemented imposes an average overhead ranging from 1 to 3% depending on the workload, yet sustains a high sampling rate (about one sample every every 62K cycles on average when monitoring cycles, or about 5200 samples per second on a 333-MHz processor). This permits continuous operation, and improves the quality of the profiles by minimizing the perturbation of the system induced by profiling.

The data collection system is transparent: it works with unmodified executables, with no need to recompile, relink, or make any other changes. It is also comprehensive: it collects profiles for all code that runs on the system, including applications, shared libraries, and the kernel. (PALcode on the Alpha is uninterruptible; events that occur in PALcode are still counted, but the samples show up elsewhere, adding a small amount of noise to the sample data.)

Identifying and classifying processor stalls at the level of individual instructions is also a major challenge. The Alpha performance counters, like those in many other modern processors, can count a variety of events. However, the interrupts for an event that causes a performance-counter overflow are delivered several cycles after the event happens (6 cycles on the 21164), causing the samples to land on an instruction some time after the one relevant to the event. This makes the samples for most events less useful for the kind of fine-grained analysis we want to produce.

Fortunately, the counter-overflow interrupts for a few events (e.g., instruction-cache misses) do land on the relevant instruction, and in particular counting cycles yields sample counts that give a reasonable statistical picture of the total time each instruction spent waiting to issue: the sample count for an individual instruction when monitoring cycles is proportional to the total time that instruction spent at the head of the issue queue. These time-biased samples alone are useful in pinpointing which instructions in a workload consume the most time, but they do not directly tell why. A suite of analysis tools uses a detailed machine model and a set of heuristics to convert time-biased samples into the average CPI for each individual instruction, the number of times the instruction was executed, and explanations for any static or dynamic stalls.

Other tools -- e.g., Intel's VTune and SGI's Speedshop -- use performance counters to sample the occurrences of various events. However, they suffer from the same problem as the performance counters on the Alpha: samples for most events land on nearby instructions, not the ones that caused the events. As a result, they cannot give an accurate picture of the CPI for each instruction, the number of times each instruction was executed, or the reasons for stalls. Such information is available from simulators, but simulators have serious limitations for analyzing the performance of real systems, not least of which is their massive overhead.

Our profiling system has been running on Digital Alpha processors under Tru64 UNIX since September 1996, and was publicly released in December 1996.  A port has been done for Alpha/NT and is in progress for OpenVMS. The system has already been used to analyze and improve the performance of a wide range of complex commercial applications, including graphics systems, databases, industry benchmark suites, and compilers. For example, our tools pinpointed a performance problem in a commercial database system; fixing the problem reduced the response time of an SQL query from 180 to 14 hours. In another example, our tools' fine-grained instruction-level analyses identified opportunities to improve optimized code produced by Digtal's compiler, speeding up the mgrid SPECfp benchmark by 15%.

Our tools can be used directly by programmers; they are also intended to be used to drive profile-based optimizations in compilers, linkers, post-linkers, and run-time optimization tools. Work is underway to feed the output of our tools into Digital's optimizing backend and into the OM post-linker optimization framework. In addition, we are beginning to explore new optimizations that leverage the detailed instruction-level information provided by our tools.