dcpicalc - Analyze performance on Alpha 21064/EV4 and 21164/EV5
dcpicalc [<options>] -procedures procedure-name-list -- image-file
dcpicalc [<options>] procedure-name image-file
Dcpicalc generates the control flow graph of the specified procedure(s)
in the specified image file. Using profiles collected by dcpid(1) and
stored in the specified profile files, dcpicalc augments the graph
with estimated execution counts of basic blocks, cycles-per-instruction for
instructions, possible explanations for stalls, and other useful information.
The resulting flow graph is printed to standard output. The output can be
converted to postscript by dcpi2ps(1).
The first command syntax allows you to specify multiple procedures. dcpicalc
concatenates the outputs for the individual procedures, starting each with
a line of the form
; PROC procedure-name
Analyzing multiple procedures at a time is typically much more efficient than
invoking the command once per procedure, although dcpicalc reports
exactly the same information in both cases. The -procedures option can
be mixed with the other options. The list of procedures is terminated by "--" or
another option. The second command syntax can name only one procedure.
- Print information about options.
- Output the machine code, in hex, for each instruction.
- -cutoff n
- Omit basic blocks taking less than n% of the time spent
in the procedure. The instructions of these basic blocks are not
printed. When the output is piped through dcpi2ps(1),
these basic blocks appear as tiny boxes with only block names. Note
that n is a floating point number between 0 and 100 (inclusive).
The default value is 0: no blocks are omitted.
- -procedures procedure-name-list
- Analyze the specified procedures. The list is terminated by "--" or
- Print program version information.
PROFILE SELECTION FLAGS
By default, this command automatically finds all of the relevant profile
files. The following options can be used to guide the search for the profile
- -db <directory name>
- Search for profile files in the specified profile database directory.
The directory name should be the same name as the one specified when dcpid was
started. If this option is not specified, the directory name is obtained
from the DCPIDB environment variable. If neither this option,
nor the DCPIDB environment variable are set, the name of the directory
used by the last invocation of dcpid on this machine is used.
If none of these methods succeed in finding the appropriate directory,
and no explicit set of profile files is provided via the -profiles option,
then the command fails.
- -epoch latest
- Search for profile files in the latest epoch. This is the default.
- -epoch latest-k
- Search for profile files in the "k+1"th oldest epoch. For example,
search in the third oldest epoch if -epoch latest-2 is specified.
- -epoch <name>
- Search for profile files in the named epoch. The epoch name should
be the name of a subdirectory corresponding to a single epoch within
the profile database directory. Epoch subdirectory names usually take
the form YYYYMMDDHHMM (year-month-day-hours-minutes). For example,
an epoch started on June 11, 2002 at 22:33 would be named 200206112233.
If an epoch is given a symbolic name by creating a symbol link to the
actual epoch directory, then the symbolic name can also be used as an
argument to the -epoch option.
- -epoch all
- Search for profile files in all epochs.
- -ihost <hostnames...> --
- Include just those profile files associated with the specified
host names. The list of host names must be terminated either
via -- or by the end of the option list. The command prints
an error message and fails if both the -ihost and -ehost
options are specified.
- -ehost <hostnames...> --
of host names
must be terminated
-- or by
the end of
- -label <label>
- -profiles <file
the -profiles option
the -profile option;
STATISTIC SELECTION FLAGS
Different kinds of performance counter statistics are available on various
models of Alpha CPUs. Alpha 21064/EV4, 21164/EV5 and 21264/EV6 CPUs have
traditional aggregate event counters. Alpha 21264A/EV67 and later processors
have a mix of some traditional aggregate event counters and newer ProfileMe
counters which allow accurate and precise instruction execution profiles
on out-of-order processors. (See dcpiprofileme(1) for
more information on ProfileMe statistics.)
The default statistic selection on an aggregate counter machine is to select
all the aggregate events. The default on a ProfileMe machine is to select
ProfileMe retire delay, retire count, !retired (i.e. aborted) count, !notrap
(i.e. trap) count, and aggregate cycles.
The options below can be used to select various statistics when available.
Use -event for aggregate statistics and -pm for ProfileMe
statistics. Note: there can be multiple, mixed -event and -pm specifications.
You can also specify the ratio of two statistics (written as stat1::stat2).
- -pm pm_stat(+pm_stat)
- Select the specified ProfileMe statistic plus any added in by optional +pm_stat specifications.
For example, select various trap statistics by specifying the option -pm
- -pm default(+pm_stat)
- Select the default set of ProfileMe statistics plus those added in by +pm_stat specifications.
At least one additional statistic is mandatory; -pm default without
modifications is extraneous and not allowed. The additional ProfileMe statistics
will take the place of the aggregate cycles statistic which is selected
- -pm all(-pm_stat)
- Select all ProfileMe statistics less those subtracted out. You can
repeat the optional -pm_stat specification to deselect multiple
ProfileMe statistics. Note: there are a lot of ProfileMe statistics.
Unless you deselect a bunch of them, this will select more statistics
than are appropriate for human consumption.
- -event ag_stat(+ag_stat)
- Select the specified aggregate statistic plus any added in by optional +ag_stat specifications.
For example, select cycles, icache misses, and data cache misses when
the option -event cycles+imiss+dmiss is specified.
- -event all(-ag_stat)
- Select all aggregate statistics less those subtracted out. You can
repeat the optional -ag_stat specification to deselect multiple
- Select profile events corresponding to all event types, both aggregate
and ProfileMe. However, if there are ProfileMe events, this will produce
a large number of statistics, which in most cases will not be useful.
EXECUTION COUNT AND STALL ANALYSIS FLAGS
The following options can be used to control the heuristics for estimating
execution counts and identifying the causes of stalls.
- Generate low, medium, and high confidence data.
- Generate medium and high confidence data. (default)
- Generate only high confidence data.
- -cross_procedure [optimistic | pessimistic | selective]
- Choose what assumption to make when a procedure call boundary
is encountered while looking for reasons to explain dynamic
stalls. A procedure call boundary is either a call made by
the procedure being analyzed or the beginning or end of that
procedure. With pessimistic, assume that whatever
happens outside the analyzed procedure can cause a dynamic
stall inside it. With optimistic, assume that it cannot.
With selective, the assumption is based on standard
procedure call convention. (The default is optimistic.)
- Use a (non-linear time) constraint solver to exploit global
flow constraints when estimating execution counts. The estimates
may still violate flow constraints.
- -tab foo.tab
- Get execution counts from output of dcpix(1) instead
of making estimates, which may be inaccurate.
Requires a .xct file.
- -xct foo.xct
- Get execution
output of dcpix(1) instead
a .tab file.
under dcpix(1) but
under dcpid(1) to
Dcpicalc provides information at the instruction, basic block, and procedure
level. Dcpicalc is sometimes unable to estimate the cycle-to-sample ratio
for a block. Such blocks are excluded from all summary information except
the instruction count. Dcpicalc makes no attempt to identify stalls (static
or dynamic) in such blocks. Therefore, most of the following discussion pertains
only to blocks with known cycle-to-sample ratios.
Apart from the instruction, block, and procedure level information described
below, the output also contains lines that encode the procedure's control
flow graph for use by other analysis tools (notably dcpi2ps(1),
which prints the graph in postscript). These lines start with "B" and "A" in
the first column. They are not intended for users.
Instruction Level Information
At the instruction level, dcpicalc inserts "bubbles" into the
instruction listings to identify points where the processor stalls because
it is unable to issue an instruction. Bubbles are inserted before the
stalled instruction. Here is an example.
588584 318:2e4c0000 ldq_u a2, 0(s3) 1558 1
588588 318:a79d2d70 ldq at, 11632(gp) 191855 0 1.5cy
58858c 318:4a4c00d2 extbl a2, s3, a2 164109 2 1.5cy 8584
588590 318:43920412 addq at, a2, a2 428395 1 4.0cy 8588
588594 318:2c320000 ldq_u t0, 0(a2) 227783 1 2.0cy 8590
588598 318:22520001 lda a2, 1(a2) 121068 1 1.0cy
58859c 318:48320f41 extqh t0, a2, t0 336123 1 3.0cy 8598 8594
5885a0 318:48271781 sra t0, 0x38, t0 123408 1 1.0cy
5885a4 318:41810402 addq s3, t0, t1 127442 1 1.0cy 85a0
5885a8 318:2c620000 ldq_u t2, 0(t1) 123021 1 1.0cy
5885ac 318:47ff041f bis zero, zero, zero 0 0 nop
5885b0 318:486200c4 extbl t2, t1, t3 658189 2 6.0cy 85a8
5885b4 318:47ff0403 bis zero, zero, t2 0 0
5885b8 318:48807630 zapnot t3, 0x3, a0 122504 1 1.0cy
5885bc 318:47ff041f bis zero, zero, zero 0 0 nop
5885c0 318:421fd9b1 cmplt a0, 0xfe, a1 155841 1 1.5cy
5885c4 318:e6200002 beq a1, 0x1205885d0 0 0
Each line of assembly code shows, from left to right,
- the instruction's address (hexadecimal),
- the source line number (decimal),
- the instruction's 32-bit machine code in hexadecimal (if -print_opcode)
- the instruction in mnemonics
- the number of PC samples falling at this instruction address (decimal)
- the minimum number cycles the instruction is predicted to spend at the
head of the issue queue (actual schedule may vary)
- (optionally) the average number of cycles spent at this instruction
- (optionally) the other instructions that may have caused this instruction
to stall (see details below).
Each line in the listing represents a half-cycle, which makes it easy to
see whether instructions are being dual-issued. To avoid excessively long
listings, however, dcpicalc represents a very long stall with a
large but limited number of bubbles. The actual number of stall cycles is
shown as a number along with the bubbles.
Stall cycles are either static or dynamic. Static stall cycles are those
that the processor would suffer even if there were no dynamic stalls (e.g.,
if all memory loads hit in the D-cache and all conditional branches are predicted
correctly). The rest are dynamic. The bubbles for the static and dynamic
stall cycles are shown in different columns.
In the static column (the leftmost column), bubbles have the following
- s refers to stall cycles resulting from static resource conflicts among
the instructions within the same "window" (consisting of two instructions
for Alpha 21064 and four for 21164) that the processor considers for issue
in any given cycle.
- a/b/c refer to stall cycles caused by register dependencies on previous
instructions involving, respectively, Ra/Rb/Rc of the stalled instruction.
- f refers to stall cycles caused by competition for the function units
and other internal resources in the processor.
In the dynamic column(s), there may be multiple possible explanations for
the same stall cycles; sometimes there may be none. Each explanation is represented
by a column of bubbles. In some cases, dcpicalc can compute the
maximum number of stall cycles that a particular reason can account for.
If this is less than the number of stall cycles, the column for that reason
may not extend all the way down to the stalled instruction.
The bubbles have the meanings below.
- d - D-cache miss
- D - DTB miss
- I - I-cache or ITB miss
- i - I-cache miss (but not ITB miss)
- w - write buffer overflow
- y - synchronization of memory operations (using memory barriers)
- p - branch misprediction
- f - busy function unit
- o - other (currently TRAPB, EXCB, or load-after-store replay trap)
- ? - unexplained
Several points are worth mentioning here. First, notice that there is no
symbol for ITB miss alone because an I-cache miss is possible whenever an
ITB miss is possible. Second, "other" means miscellaneous other reasons that
typically account for only a tiny percentage of stalls. Currently it includes
stalls at TRAPB or EXCB instructions, which are not issued until all previous
instructions are guaranteed to complete without traps or both traps and exceptions,
respectively. Third, the symbol "f" may appear in both the static and dynamic
columns because competition for function units may explain both static and
dynamic stalls. For example, the stall caused by a floating-point division
may be partly static, because part of it can be predicted by scheduling the
instructions, and partly dynamic, because part of it is data dependent. An "f" in
the dynamic column typically means a busy integer multiply or floating-point
For each stalled instruction, dcpicalc also lists instructions
that may have caused the stalls. This list appears at the end of the line
showing the stalled instruction. A four-digit hexadecimal address indicates
an instruction in the same basic block as the stalled instruction; a full
block name with a four-digit hexadecimal address indicates an instruction
in another basic block; a full block name without an address indicates that
the instruction potentially causing the stall is assumed to be in
another procedure, which can be a callee or the caller of the current procedure.
Note that the lists of instructions and explanations are not always exhaustive,
in part because longer stalls may hide shorter ones.
If an instruction is a nop, dcpicalc will indicate it by appending "nop" to
the line showing the instruction.
Block Level Information
At the beginning of a block, dcpicalc displays summary information
for the block. For example,
*** One cycle = 714428 samples
*** Executed 4.83 times/invocation
*** Best-case 8/9 = 0.89CPI, Actual 22/9 = 2.44CPI
*** (36% execution without dynamic stalls)
The first line is the cycle-to-sample ratio for block -- this is dcpicalc's
estimate of how many PC samples in the profiling data correspond to one cycle.
The next line is the average number of times the block is executed relative
to the number of times the entry and/or exit blocks are executed. The third
line displays the best-case and actual cycles per instruction (CPI) for the
block. The best-case scenario includes all stalls statically predictable from
the instruction stream (e.g., an Alpha 21164 cannot dual-issue consecutive
store instructions) but assumes that there are no dynamic stalls (e.g., all
load instructions hit in the D-cache). The last line above displays the best-case
cycles per instruction as a percentage of the actual.
Procedure Level Information
At the procedure level, dcpicalc displays summary information
in the entry block. This information includes the number of instructions
in the procedure, averages of the best-case and actual cycles per instruction
(computed from the per-block values weighted by block execution counts),
and a sorted list of blocks accounting for 90% of the stalls in the procedure.
Moreover, dcpicalc summarizes how the cycles are spent. Here is
a sample summary followed by line-by-line explanations.
Line 1 I-cache (not ITB) 3.5% to 7.4%
Line 2 ITB/I-cache miss 3.7% to 3.7%
Line 3 D-cache miss 25.2% to 27.2%
Line 4 DTB miss 0.0% to 1.7%
Line 5 Write buffer 0.0% to 0.0%
Line 6 Synchronization 0.0% to 0.0%
Line 7 Branch mispredict 0.7% to 2.6%
Line 8 IMUL busy 0.0% to 0.0%
Line 9 FDIV busy 0.0% to 0.0%
Line 10 Other 0.0% to 0.0%
Line 11 Unexplained stall 1.9% to 1.9%
Line 12 Unexplained gain -0.8% to -0.8%
Line 13 Subtotal dynamic 38.4%
Line 14 Slotting 6.4%
Line 15 Ra dependency 10.0%
Line 16 Rb dependency 2.9%
Line 17 Rc dependency 0.0%
Line 18 FU dependency 1.9%
Line 19 Subtotal static 21.2%
Line 20 Total stall 59.6%
Line 21 Useful 39.4%
Line 22 Nops 1.2%
Line 23 Execution 40.6%
Line 24 Net sampling error -0.2%
Line 25 Total tallied 100.0%
Line 26 (114504716, 88.8% of all samples)
- Lines 1 to 13
- show all dynamic stall cycles. See previous discussion of instruction
level information for the meanings of these categories.
For the difference between "I-cache (not ITB)" and "ITB/I-cache
miss" (lines 1 and 2), please see the earlier discussion on the corresponding
bubbles `i' and `I'.
Unexplained stall (line 11) represents stall cycles for which dcpicalc cannot
offer any plausible explanation.
Unexplained gain (line 12) occurs when instructions take fewer
cycles than even the ideal assumption. For example, since we take
dual-issue as the ideal case, if in fact three instructions are issued
(two to the integer pipelines and one to a floating point pipeline),
half a cycle would be attributed to unexplained gain. The "gain" is
shown as a negative number because all other numbers in the same
table represent cycles that are in some sense "lost" (e.g., to D-cache
misses) or "spent" (e.g., on executing instructions). These positive
numbers and the unexplained gain always add up to 100%, with the
gain shown as negative to indicate that its contribution is in an
Dcpicalc shows a range of stall cycles (as a percentage of total
cycles tallied) that could have been caused by each reason listed.
Some of the ranges may be wide if major stalls can be explained by
more than one reason. Generally, the accuracy of the analysis can
be improved using profiles for non-cycles events. Currently, dcpicalc takes
advantage of imiss, itbmiss, and dtbmiss profiles if they are specified
on the command line. Although the contributions of individual stall
reasons are reported as ranges, the subtotal for all dynamic stalls
(line 13) is not. It represents the cycles attributed to any one
or more of the reasons. Therefore, it does not depend on how stall
cycles are apportioned among alternative reasons for the same stall.
- Lines 14 to 19
- show the static stall cycles. These are stall cycles that
the processor would suffer even if there were no dynamic
stalls. For example, this assumes that a load from memory
takes only two cycles, which corresponds to a D-cache hit.
Additional stall cycles due to a cache miss are considered
dynamic. If an instruction is stalled for multiple reasons,
the static stall cycles are attributed to the last reason
preventing instruction issue. Thus, shorter stalls are hidden
by longer ones.
- Slotting (line 14)
- refers to stall cycles resulting from static resource
conflicts among the instructions within the same "window" that
the processor considers for issue in any given cycle.
- Ra/Rb/Rc dependencies (lines 15-17)
- refer to stall cycles caused by register dependencies
on previous instructions involving, respectively, Ra/Rb/Rc
of the stalled instruction.
- FU dependency (line 18)
- refers to stall cycles caused by competition for
function units and other internal resources in the
- Lines 21 to 23
- are the
Line 23 includes
line 22 includes
21 includes "useful" instructions
of them is
(of the respective
to be the
this is that
to two integer
Since dcpicalc assumes
to be the
to 100% execution),
due to sampling
Note that the time spent on "nops" is not necessarily wasted. These
operations are often inserted deliberately by the compiler's instruction
scheduler to improve instruction execution by the processor's pipeline.
If they were removed, fewer instructions would be executed, but execution
time would not necessarily decrease.
- Line 24
Typically, dcpicalc, dcipsource(1),
and dcpi2ps(1) are used together as follows:
dcpicalc -db db idle_thread /vmunix | \
dcpisource -f /src/kernel/kern/sched_prim.c | \
dcpi2ps -o idle_thread.ps
It is also possible to read the ascii output of dcpicalc directly.
PROBLEMS WITH JUMP TABLE TARGETS
During the construction of the control flow graph, dcpicalc tries
to determine the targets of all computed jumps. If it fails to do so for a
jump, it prints an error message saying that it could not compute jump table
targets. In such cases, the user can guide the operation of dcpicalc by telling dcpicalc an
upper and lower bound on the value of the index register used in the jump. dcpicalc then
uses the upper and lower bounds to determine all possible targets of the computed
To tell dcpicalc an upper bound and a lower bound on the value
of the index register of a jump in an image, create a file called ".dcpijumps" in
the current directory or in the home directory. This file should contain
lines of the form:
0x<image_id in hex> 0x<jump address in hex> <lower> <upper>
The file should contain one line for each image/compute-jump pair for which
dcpicalc could not automatically determine the targets.
Use dcpiscan to determine the image_id for an image.
When we run dcpicalc on a particular procedure, it prints the following
message to stderr:
% dcpiflow ...
0x12004bb10: could not compute jump table targets
The next step is to examine the disassembled code in the neighborhood of the
computed jump at address 0x12004bb10. (The output of either dcpilist
or dcpicalc can be used for this purpose.)
04baf8 244:41da53b6 cmpult s5, 0xd2, t8 0
04bafc 244:e6c009d9 beq t8, 0x12004e264 0
04bb00 244:a79d82c0 ldq at, -32064(gp) 0
04bb04 244:41dc0459 s4addq s5, at, t11 0
04bb08 244:a3390000 ldl t11, 0(t11) 0
04bb0c 244:433d0419 addq t11, gp, t11 0
04bb10 244:6bf903e5 jmp zero, (t11), 0x12004caa8 0
The pair of instructions cmpult/beq at 0x12004baf8 branch
away from the jump instruction if s5 is not in the range [0..0xd1].
The ldq instruction loads into register at the base address
of the jump table associated with this computed jump. The s4addq multiplies
the index register s5 by 4, adds it to the base of the jump table
to get a pointer into the jump table, and stores the resulting pointer in register t11.
The ldl instruction loads the corresponding jump table entry into t11 and
the following addq adds the gp to the value of t11 since
the jump entries are offsets from the contents of gp.
Because of the cmpult/beq instruction pair, we know that the jmp instruction
is reachable only when the index register s5 has a value in the
range [0..0xd1]. Therefore, the following entry should be placed
0x3249774100393048 0x12004bb10 0 209
(The image-id 0x3249774100393048 was determined by dcpiscan.)
Dcpicalc works only on Alpha 21064/EV4 and 21164/EV5 processors. For Alpha
21264a/EV67 and later processors, use insights gained from the ProfileMe
statistics instead. See dcpiprofileme(1).
dcpi(1), dcpi2bb(1), dcpi2pix(1), dcpi2ps(1), dcpicat(1), dcpicc(1), dcpicoverage(1), dcpictl(1), dcpid(1), dcpidiff(1), dcpidis(1), dcpiepoch(1), dcpiflow(1), dcpiflush(1), dcpikdiff(1), dcpilabel(1), dcpildlatency(1), dcpilist(1), dcpiprof(1), dcpiprofileme(1), dcpiquit(1), dcpiscan(1), dcpisource(1), dcpistats(1), dcpisumxct(1), dcpitar(1), dcpitopcounts(1), dcpitopstalls(1), dcpiuninstall(1), dcpiupcalls(1), dcpivarg(1), dcpivcat(1), dcpiversion(1), dcpivlst(1), dcpivprofiler(1), dcpiwhatcg(1), dcpix(1), dcpiformat(4), dcpiexclusions(4)
For more information, see the DCPI project home page http://h30097.www3.hp.com/dcpi.
Copyright 1996-2004, Hewlett-Packard Company.
All rights reserved.