HP DCPI tool

»

DCPI

Site information

» Send us your comments

Installation

» Download DCPI
» Installing DCPI

Product information

» Frequently asked questions
» Documentation
» Publications
customer times newsletter link

dcpicalc(1)

NAME

dcpicalc - Analyze performance on Alpha 21064/EV4 and 21164/EV5 systems

SYNOPSIS

dcpicalc [<options>] -procedures procedure-name-list -- image-file

dcpicalc [<options>] procedure-name image-file

DESCRIPTION

Dcpicalc generates the control flow graph of the specified procedure(s) in the specified image file. Using profiles collected by dcpid(1) and stored in the specified profile files, dcpicalc augments the graph with estimated execution counts of basic blocks, cycles-per-instruction for instructions, possible explanations for stalls, and other useful information. The resulting flow graph is printed to standard output. The output can be converted to postscript by dcpi2ps(1).

The first command syntax allows you to specify multiple procedures. dcpicalc concatenates the outputs for the individual procedures, starting each with a line of the form

; PROC procedure-name
Analyzing multiple procedures at a time is typically much more efficient than invoking the command once per procedure, although dcpicalc reports exactly the same information in both cases. The -procedures option can be mixed with the other options. The list of procedures is terminated by "--" or another option. The second command syntax can name only one procedure.

FLAGS

-help
Print information about options.

-print_opcode
Output the machine code, in hex, for each instruction.

-cutoff n
Omit basic blocks taking less than n% of the time spent in the procedure. The instructions of these basic blocks are not printed. When the output is piped through dcpi2ps(1), these basic blocks appear as tiny boxes with only block names. Note that n is a floating point number between 0 and 100 (inclusive). The default value is 0: no blocks are omitted.

-procedures procedure-name-list
Analyze the specified procedures. The list is terminated by "--" or another option.

-version
Print program version information.

PROFILE SELECTION FLAGS

By default, this command automatically finds all of the relevant profile files. The following options can be used to guide the search for the profile files.

-db <directory name>
Search for profile files in the specified profile database directory. The directory name should be the same name as the one specified when dcpid was started. If this option is not specified, the directory name is obtained from the DCPIDB environment variable. If neither this option, nor the DCPIDB environment variable are set, the name of the directory used by the last invocation of dcpid on this machine is used. If none of these methods succeed in finding the appropriate directory, and no explicit set of profile files is provided via the -profiles option, then the command fails.

-epoch latest
Search for profile files in the latest epoch. This is the default.

-epoch latest-k
Search for profile files in the "k+1"th oldest epoch. For example, search in the third oldest epoch if -epoch latest-2 is specified.

-epoch <name>
Search for profile files in the named epoch. The epoch name should be the name of a subdirectory corresponding to a single epoch within the profile database directory. Epoch subdirectory names usually take the form YYYYMMDDHHMM (year-month-day-hours-minutes). For example, an epoch started on June 11, 2002 at 22:33 would be named 200206112233. If an epoch is given a symbolic name by creating a symbol link to the actual epoch directory, then the symbolic name can also be used as an argument to the -epoch option.

-epoch all
Search for profile files in all epochs.

-ihost <hostnames...> --

Include just those profile files associated with the specified host names. The list of host names must be terminated either via -- or by the end of the option list. The command prints an error message and fails if both the -ihost and -ehost options are specified.

-ehost <hostnames...> --

Exclude any profile files associated with the specified host names. The list of host names must be terminated either via -- or by the end of the option list. The command prints an error message and fails if both the -ihost and -ehost options are specified.

-label <label>
Search for profile files with the specified label(s) (see dcpilabel(1)). This option can be repeated multiple times. If no labels are specified on the command line, profile file labels are ignored entirely. If any labels are specified on the command line, only profile files that have one of the specified labels are used.

-profiles <file names...> --
Use just the profile files named by the specified file names. The list of profile file names can be terminated either via --, or by the end of the option list. The command prints an error message and fails if the -profiles option is used in conjunction with any of the earlier automatic profile finding options. (Use the automatic profile lookup mechanism, or explicitly name the profile file with the -profile option; but don't do both.)

STATISTIC SELECTION FLAGS

Different kinds of performance counter statistics are available on various models of Alpha CPUs. Alpha 21064/EV4, 21164/EV5 and 21264/EV6 CPUs have traditional aggregate event counters. Alpha 21264A/EV67 and later processors have a mix of some traditional aggregate event counters and newer ProfileMe counters which allow accurate and precise instruction execution profiles on out-of-order processors. (See dcpiprofileme(1) for more information on ProfileMe statistics.)

The default statistic selection on an aggregate counter machine is to select all the aggregate events. The default on a ProfileMe machine is to select ProfileMe retire delay, retire count, !retired (i.e. aborted) count, !notrap (i.e. trap) count, and aggregate cycles.

The options below can be used to select various statistics when available. Use -event for aggregate statistics and -pm for ProfileMe statistics. Note: there can be multiple, mixed -event and -pm specifications. You can also specify the ratio of two statistics (written as stat1::stat2).

-pm pm_stat(+pm_stat)
Select the specified ProfileMe statistic plus any added in by optional +pm_stat specifications. For example, select various trap statistics by specifying the option -pm trap+replays+ldstorder+mispredict.

-pm default(+pm_stat)
Select the default set of ProfileMe statistics plus those added in by +pm_stat specifications. At least one additional statistic is mandatory; -pm default without modifications is extraneous and not allowed. The additional ProfileMe statistics will take the place of the aggregate cycles statistic which is selected by default.

-pm all(-pm_stat)
Select all ProfileMe statistics less those subtracted out. You can repeat the optional -pm_stat specification to deselect multiple ProfileMe statistics. Note: there are a lot of ProfileMe statistics. Unless you deselect a bunch of them, this will select more statistics than are appropriate for human consumption.

-event ag_stat(+ag_stat)
Select the specified aggregate statistic plus any added in by optional +ag_stat specifications. For example, select cycles, icache misses, and data cache misses when the option -event cycles+imiss+dmiss is specified.

-event all(-ag_stat)
Select all aggregate statistics less those subtracted out. You can repeat the optional -ag_stat specification to deselect multiple aggregate statistics.

-allevents
Select profile events corresponding to all event types, both aggregate and ProfileMe. However, if there are ProfileMe events, this will produce a large number of statistics, which in most cases will not be useful.

EXECUTION COUNT AND STALL ANALYSIS FLAGS

The following options can be used to control the heuristics for estimating execution counts and identifying the causes of stalls.

-conf_low
Generate low, medium, and high confidence data.

-conf_med
Generate medium and high confidence data. (default)

-conf_high
Generate only high confidence data.

-cross_procedure [optimistic | pessimistic | selective]
Choose what assumption to make when a procedure call boundary is encountered while looking for reasons to explain dynamic stalls. A procedure call boundary is either a call made by the procedure being analyzed or the beginning or end of that procedure. With pessimistic, assume that whatever happens outside the analyzed procedure can cause a dynamic stall inside it. With optimistic, assume that it cannot. With selective, the assumption is based on standard procedure call convention. (The default is optimistic.)

-do_gp
Use a (non-linear time) constraint solver to exploit global flow constraints when estimating execution counts. The estimates may still violate flow constraints.

-tab foo.tab
Get execution counts from output of dcpix(1) instead of making estimates, which may be inaccurate. Requires a .xct file.

-xct foo.xct
Get execution counts from output of dcpix(1) instead of making estimates, which may be inaccurate. Requires a .tab file.

-xct_factor num
Scales counts from .xct files by num. Useful when you run a program once under dcpix(1) but multiple (num) times under dcpid(1) to get more samples. Used in conjunction with -tab and -xct.

INTERPRETING OUTPUT

Dcpicalc provides information at the instruction, basic block, and procedure level. Dcpicalc is sometimes unable to estimate the cycle-to-sample ratio for a block. Such blocks are excluded from all summary information except the instruction count. Dcpicalc makes no attempt to identify stalls (static or dynamic) in such blocks. Therefore, most of the following discussion pertains only to blocks with known cycle-to-sample ratios.

Apart from the instruction, block, and procedure level information described below, the output also contains lines that encode the procedure's control flow graph for use by other analysis tools (notably dcpi2ps(1), which prints the graph in postscript). These lines start with "B" and "A" in the first column. They are not intended for users.

Instruction Level Information

At the instruction level, dcpicalc inserts "bubbles" into the instruction listings to identify points where the processor stalls because it is unable to issue an instruction. Bubbles are inserted before the stalled instruction. Here is an example.

 588584  318:2e4c0000 ldq_u    a2, 0(s3)          1558  1
 588588  318:a79d2d70 ldq      at, 11632(gp)    191855  0  1.5cy
   a
   a
 58858c  318:4a4c00d2 extbl    a2, s3, a2       164109  2  1.5cy   8584
   s
     d
     d
     d
     d
     d
     d
 588590  318:43920412 addq     at, a2, a2       428395  1  4.0cy   8588
   b
     ?
     ?
 588594  318:2c320000 ldq_u    t0, 0(a2)        227783  1  2.0cy   8590
   s
 588598  318:22520001 lda      a2, 1(a2)        121068  1  1.0cy  
   b
     d
     d
     d
     d
 58859c  318:48320f41 extqh    t0, a2, t0       336123  1  3.0cy   8598 8594
   s
 5885a0  318:48271781 sra      t0, 0x38, t0     123408  1  1.0cy  
   b
 5885a4  318:41810402 addq     s3, t0, t1       127442  1  1.0cy   85a0
   s
 5885a8  318:2c620000 ldq_u    t2, 0(t1)        123021  1  1.0cy  
 5885ac  318:47ff041f bis      zero, zero, zero      0  0  nop
   a
   a
     d
     d
     d
     d
     d
     d
     d
     d
 5885b0  318:486200c4 extbl    t2, t1, t3       658189  2 6.0cy   85a8
 5885b4  318:47ff0403 bis      zero, zero, t2        0  0
 5885b8  318:48807630 zapnot   t3, 0x3, a0      122504  1 1.0cy
 5885bc  318:47ff041f bis      zero, zero, zero      0  0 nop
     i
 5885c0  318:421fd9b1 cmplt    a0, 0xfe, a1     155841  1 1.5cy  
 5885c4  318:e6200002 beq      a1, 0x1205885d0       0  0 
Each line of assembly code shows, from left to right,
  • the instruction's address (hexadecimal),
  • the source line number (decimal),
  • the instruction's 32-bit machine code in hexadecimal (if -print_opcode)
  • the instruction in mnemonics
  • the number of PC samples falling at this instruction address (decimal)
  • the minimum number cycles the instruction is predicted to spend at the head of the issue queue (actual schedule may vary)
  • (optionally) the average number of cycles spent at this instruction address
  • (optionally) the other instructions that may have caused this instruction to stall (see details below).

Each line in the listing represents a half-cycle, which makes it easy to see whether instructions are being dual-issued. To avoid excessively long listings, however, dcpicalc represents a very long stall with a large but limited number of bubbles. The actual number of stall cycles is shown as a number along with the bubbles.

Stall cycles are either static or dynamic. Static stall cycles are those that the processor would suffer even if there were no dynamic stalls (e.g., if all memory loads hit in the D-cache and all conditional branches are predicted correctly). The rest are dynamic. The bubbles for the static and dynamic stall cycles are shown in different columns.

In the static column (the leftmost column), bubbles have the following meanings:

  • s refers to stall cycles resulting from static resource conflicts among the instructions within the same "window" (consisting of two instructions for Alpha 21064 and four for 21164) that the processor considers for issue in any given cycle.

  • a/b/c refer to stall cycles caused by register dependencies on previous instructions involving, respectively, Ra/Rb/Rc of the stalled instruction.

  • f refers to stall cycles caused by competition for the function units and other internal resources in the processor.

In the dynamic column(s), there may be multiple possible explanations for the same stall cycles; sometimes there may be none. Each explanation is represented by a column of bubbles. In some cases, dcpicalc can compute the maximum number of stall cycles that a particular reason can account for. If this is less than the number of stall cycles, the column for that reason may not extend all the way down to the stalled instruction.

The bubbles have the meanings below.

  • d - D-cache miss
  • D - DTB miss
  • I - I-cache or ITB miss
  • i - I-cache miss (but not ITB miss)
  • w - write buffer overflow
  • y - synchronization of memory operations (using memory barriers)
  • p - branch misprediction
  • f - busy function unit
  • o - other (currently TRAPB, EXCB, or load-after-store replay trap)
  • ? - unexplained

Several points are worth mentioning here. First, notice that there is no symbol for ITB miss alone because an I-cache miss is possible whenever an ITB miss is possible. Second, "other" means miscellaneous other reasons that typically account for only a tiny percentage of stalls. Currently it includes stalls at TRAPB or EXCB instructions, which are not issued until all previous instructions are guaranteed to complete without traps or both traps and exceptions, respectively. Third, the symbol "f" may appear in both the static and dynamic columns because competition for function units may explain both static and dynamic stalls. For example, the stall caused by a floating-point division may be partly static, because part of it can be predicted by scheduling the instructions, and partly dynamic, because part of it is data dependent. An "f" in the dynamic column typically means a busy integer multiply or floating-point divide unit.

For each stalled instruction, dcpicalc also lists instructions that may have caused the stalls. This list appears at the end of the line showing the stalled instruction. A four-digit hexadecimal address indicates an instruction in the same basic block as the stalled instruction; a full block name with a four-digit hexadecimal address indicates an instruction in another basic block; a full block name without an address indicates that the instruction potentially causing the stall is assumed to be in another procedure, which can be a callee or the caller of the current procedure. Note that the lists of instructions and explanations are not always exhaustive, in part because longer stalls may hide shorter ones.

If an instruction is a nop, dcpicalc will indicate it by appending "nop" to the line showing the instruction.

Block Level Information

At the beginning of a block, dcpicalc displays summary information for the block. For example,

 *** One cycle = 714428 samples
 *** Executed 4.83 times/invocation
 *** Best-case 8/9 =  0.89CPI, Actual 22/9 =   2.44CPI
 *** (36% execution without dynamic stalls)
The first line is the cycle-to-sample ratio for block -- this is dcpicalc's estimate of how many PC samples in the profiling data correspond to one cycle. The next line is the average number of times the block is executed relative to the number of times the entry and/or exit blocks are executed. The third line displays the best-case and actual cycles per instruction (CPI) for the block. The best-case scenario includes all stalls statically predictable from the instruction stream (e.g., an Alpha 21164 cannot dual-issue consecutive store instructions) but assumes that there are no dynamic stalls (e.g., all load instructions hit in the D-cache). The last line above displays the best-case cycles per instruction as a percentage of the actual.

Procedure Level Information

At the procedure level, dcpicalc displays summary information in the entry block. This information includes the number of instructions in the procedure, averages of the best-case and actual cycles per instruction (computed from the per-block values weighted by block execution counts), and a sorted list of blocks accounting for 90% of the stalls in the procedure.

Moreover, dcpicalc summarizes how the cycles are spent. Here is a sample summary followed by line-by-line explanations.


  Line  1    I-cache (not ITB)   3.5% to  7.4%
  Line  2     ITB/I-cache miss   3.7% to  3.7%
  Line  3         D-cache miss  25.2% to 27.2%
  Line  4             DTB miss   0.0% to  1.7%
  Line  5         Write buffer   0.0% to  0.0%
  Line  6      Synchronization   0.0% to  0.0%

  Line  7    Branch mispredict   0.7% to  2.6%
  Line  8            IMUL busy   0.0% to  0.0%
  Line  9            FDIV busy   0.0% to  0.0%
  Line 10                Other   0.0% to  0.0%

  Line 11    Unexplained stall   1.9% to  1.9%
  Line 12     Unexplained gain  -0.8% to -0.8%
            ----------------------------------------
  Line 13     Subtotal dynamic                 38.4%
    
  Line 14             Slotting       6.4%
  Line 15        Ra dependency      10.0%
  Line 16        Rb dependency       2.9%
  Line 17        Rc dependency       0.0%
  Line 18        FU dependency       1.9%
            ----------------------------------------
  Line 19      Subtotal static                 21.2%

            ----------------------------------------
  Line 20          Total stall                 59.6%
    
  Line 21               Useful      39.4%
  Line 22                 Nops       1.2%
            ----------------------------------------
  Line 23            Execution                 40.6%
    
  Line 24   Net sampling error                 -0.2%
            ----------------------------------------
  Line 25        Total tallied                100.0%
  Line 26   (114504716, 88.8% of all samples)

Lines 1 to 13

show all dynamic stall cycles. See previous discussion of instruction level information for the meanings of these categories.

For the difference between "I-cache (not ITB)" and "ITB/I-cache miss" (lines 1 and 2), please see the earlier discussion on the corresponding bubbles `i' and `I'.

Unexplained stall (line 11) represents stall cycles for which dcpicalc cannot offer any plausible explanation.

Unexplained gain (line 12) occurs when instructions take fewer cycles than even the ideal assumption. For example, since we take dual-issue as the ideal case, if in fact three instructions are issued (two to the integer pipelines and one to a floating point pipeline), half a cycle would be attributed to unexplained gain. The "gain" is shown as a negative number because all other numbers in the same table represent cycles that are in some sense "lost" (e.g., to D-cache misses) or "spent" (e.g., on executing instructions). These positive numbers and the unexplained gain always add up to 100%, with the gain shown as negative to indicate that its contribution is in an opposite direction.

Dcpicalc shows a range of stall cycles (as a percentage of total cycles tallied) that could have been caused by each reason listed. Some of the ranges may be wide if major stalls can be explained by more than one reason. Generally, the accuracy of the analysis can be improved using profiles for non-cycles events. Currently, dcpicalc takes advantage of imiss, itbmiss, and dtbmiss profiles if they are specified on the command line. Although the contributions of individual stall reasons are reported as ranges, the subtotal for all dynamic stalls (line 13) is not. It represents the cycles attributed to any one or more of the reasons. Therefore, it does not depend on how stall cycles are apportioned among alternative reasons for the same stall.

Lines 14 to 19

show the static stall cycles. These are stall cycles that the processor would suffer even if there were no dynamic stalls. For example, this assumes that a load from memory takes only two cycles, which corresponds to a D-cache hit. Additional stall cycles due to a cache miss are considered dynamic. If an instruction is stalled for multiple reasons, the static stall cycles are attributed to the last reason preventing instruction issue. Thus, shorter stalls are hidden by longer ones.

Slotting (line 14)
refers to stall cycles resulting from static resource conflicts among the instructions within the same "window" that the processor considers for issue in any given cycle.

Ra/Rb/Rc dependencies (lines 15-17)
refer to stall cycles caused by register dependencies on previous instructions involving, respectively, Ra/Rb/Rc of the stalled instruction.

FU dependency (line 18)
refers to stall cycles caused by competition for function units and other internal resources in the processor.

Lines 21 to 23

are the numbers of cycles spent on executing instructions. Line 23 includes all instructions; line 22 includes nops; line 21 includes "useful" instructions (i.e., instructions other than nops). Each of them is simply half the number of executed instructions (of the respective type) since we assume dual-issue to be the ideal case. Note: This percentage may exceed 100%. One reason for this is that the Alpha 21164 may issue floating point instructions in addition to two integer instructions per cycle. Since dcpicalc assumes dual issue to be the ideal case (corresponding to 100% execution), the extra instructions would cause this percentage to exceed 100%. Another possible cause is the existence of discrepancies due to sampling error in rarely executed code.

Note that the time spent on "nops" is not necessarily wasted. These operations are often inserted deliberately by the compiler's instruction scheduler to improve instruction execution by the processor's pipeline. If they were removed, fewer instructions would be executed, but execution time would not necessarily decrease.

Line 24

is the net discrepancy due to sampling error and inaccuracy in execution count estimates. This can give some indication of how noisy the sample data are, but since it is net discrepancy, two discrepancies of opposite signs may cancel out each other, giving a small error term. However, significant discrepancies are attributed to unexplained stall and gain (lines 11 and 12); they do not cancel out.

Line 25

is simply the sum of the subtotals. It should always be 100%.

Line 26

shows the total number of samples tallied for this summary, and its ratio to the number of all samples for this procedure. We tally only the samples falling in basic blocks whose execution counts have been determined by dcpicalc. All previous percentages in the summary are computed relative to the number of tallied samples.

TYPICAL USAGE

Typically, dcpicalc, dcipsource(1), and dcpi2ps(1) are used together as follows:

dcpicalc -db db idle_thread /vmunix | \
dcpisource -f /src/kernel/kern/sched_prim.c | \
dcpi2ps -o idle_thread.ps 
It is also possible to read the ascii output of dcpicalc directly.

PROBLEMS WITH JUMP TABLE TARGETS

During the construction of the control flow graph, dcpicalc tries to determine the targets of all computed jumps. If it fails to do so for a jump, it prints an error message saying that it could not compute jump table targets. In such cases, the user can guide the operation of dcpicalc by telling dcpicalc an upper and lower bound on the value of the index register used in the jump. dcpicalc then uses the upper and lower bounds to determine all possible targets of the computed jump.

To tell dcpicalc an upper bound and a lower bound on the value of the index register of a jump in an image, create a file called ".dcpijumps" in the current directory or in the home directory. This file should contain lines of the form:

    0x<image_id in hex>  0x<jump address in hex>  <lower>  <upper>
The file should contain one line for each image/compute-jump pair for which dcpicalc could not automatically determine the targets.

Use dcpiscan to determine the image_id for an image.

Example:

When we run dcpicalc on a particular procedure, it prints the following message to stderr:

   % dcpiflow ...
   0x12004bb10: could not compute jump table targets
The next step is to examine the disassembled code in the neighborhood of the computed jump at address 0x12004bb10. (The output of either dcpilist or dcpicalc can be used for this purpose.)
 ...
 04baf8  244:41da53b6 cmpult   s5, 0xd2, t8     0
 04bafc  244:e6c009d9 beq      t8, 0x12004e264  0
 04bb00  244:a79d82c0 ldq      at, -32064(gp)   0
 04bb04  244:41dc0459 s4addq   s5, at, t11      0
 04bb08  244:a3390000 ldl      t11, 0(t11)      0
 04bb0c  244:433d0419 addq     t11, gp, t11     0
 04bb10  244:6bf903e5 jmp      zero, (t11), 0x12004caa8 0
 ...
The pair of instructions cmpult/beq at 0x12004baf8 branch away from the jump instruction if s5 is not in the range [0..0xd1]. The ldq instruction loads into register at the base address of the jump table associated with this computed jump. The s4addq multiplies the index register s5 by 4, adds it to the base of the jump table to get a pointer into the jump table, and stores the resulting pointer in register t11. The ldl instruction loads the corresponding jump table entry into t11 and the following addq adds the gp to the value of t11 since the jump entries are offsets from the contents of gp.

Because of the cmpult/beq instruction pair, we know that the jmp instruction is reachable only when the index register s5 has a value in the range [0..0xd1]. Therefore, the following entry should be placed in .dcpijumps:

     0x3249774100393048 0x12004bb10 0 209
(The image-id 0x3249774100393048 was determined by dcpiscan.)

LIMITATIONS

Dcpicalc works only on Alpha 21064/EV4 and 21164/EV5 processors. For Alpha 21264a/EV67 and later processors, use insights gained from the ProfileMe statistics instead. See dcpiprofileme(1).

SEE ALSO

dcpi(1), dcpi2bb(1), dcpi2pix(1), dcpi2ps(1), dcpicat(1), dcpicc(1), dcpicoverage(1), dcpictl(1), dcpid(1), dcpidiff(1), dcpidis(1), dcpiepoch(1), dcpiflow(1), dcpiflush(1), dcpikdiff(1), dcpilabel(1), dcpildlatency(1), dcpilist(1), dcpiprof(1), dcpiprofileme(1), dcpiquit(1), dcpiscan(1), dcpisource(1), dcpistats(1), dcpisumxct(1), dcpitar(1), dcpitopcounts(1), dcpitopstalls(1), dcpiuninstall(1), dcpiupcalls(1), dcpivarg(1), dcpivcat(1), dcpiversion(1), dcpivlst(1), dcpivprofiler(1), dcpiwhatcg(1), dcpix(1), dcpiformat(4), dcpiexclusions(4)

For more information, see the DCPI project home page http://h30097.www3.hp.com/dcpi.

COPYRIGHT

Copyright 1996-2004, Hewlett-Packard Company. All rights reserved.