ARSC HPC Users' Newsletter 211, January 12, 2001

Afternoon Tea with Don Morton

When: Tuesday, Jan 23rd at 4pm. Where: Meet in Butrovich 108

Come discuss your cluster questions and parallel teaching challenges. Professor Morton's specialty is in building and using Unix clusters for high performance computing research and education.

Don Morton Department of Computer Science The University of Montana Missoula, Montana morton@cs.umt.edu

SGI Performance Monitoring with perfex

The perfex command on SGI R10000 and R12000 systems gives statistics from the hardware event counters, much as the hpm command does on Cray PVP platforms. It's available on the ARSC Onyx2, "video2"; the octanes; and the ERDC/ARSC Origin 3800.

Basic use of perfex is trivial. As with hpm, there's no need to recompile. Simply launch your program using the perfex command. In the following example, -a -y -mp -o px are all arguments to perfex , and ./a.out is the name of the executable:

video2$ perfex -a -y -mp -o px ./a.out

See "man perfex" for details and more options, but here's what these do:

  -a:    Report estimates for all possible events by sampling all counters.

  -y:    Estimate actual costs for various interesting events.

  -mp:   Report per-thread counts for multiprocessing programs as well as 
         (default) totals.    

  -o px: Store the per-thread counts and overall counts in separate
         files, using "px" when naming the files.  "px" will be the name 
         of the file containing overall counts.  "px.<PID>" will be the
         name of the file for the thread with process id, "<PID>".

Thus, if a.out is a multi-threading program and it runs with 8 threads, the above command will produce 9 files. Eight for individual threads and a ninth for the overall totals.

Below is an example of the overall output from such a run. There are two tables, the first with event counter values and time estimates, the second with some interpreted rates. Note that to make the first table fit this newsletter format, I removed two columns. They were:


   Minimum      Maximum
  Time (sec)   Time (sec)
  =========== ===========

These columns quantify the uncertainty in the Time estimates, inherent in the sampling method. The values in these columns always bracket the "Typical Time (sec)" values. For example, the values I removed for the first row, "0 Cycles," were actually identical to the "Typical Time" values: "67.763404" and "67.763404".

Here's the contents of the file "px":


      WARNING: Multiplexing events to project totals--
      inaccuracy possible Summary for execution of ./a.out

                            Based on 250 MHz IP27
                               MIPS R10000 CPU
                              CPU revision 3.x 

                                                                Typical
   Event Counter Name                          Counter Value   Time (sec)
=========================================================================
 0 Cycles......................................... 16940851040  67.763404
16 Cycles......................................... 16940851040  67.763404
14 ALU/FPU progress cycles........................  3061778400  12.247114
 2 Issued loads...................................  1355207200   5.420829
18 Graduated loads................................  1335813152   5.343253
25 Primary data cache misses......................    94368960   3.401057
26 Secondary data cache misses....................     5245648   1.584186
 6 Decoded branches...............................   348844000   1.395376
22 Quadwords written back from primary data cache.    80468512   1.239215
21 Graduated floating point instructions..........   209756976   0.839028
23 TLB misses.....................................     2661104   0.724778
 3 Issued stores..................................   160580672   0.642323
 7 Quadwords written back from scache.............    24940944   0.638488
19 Graduated stores...............................   157762576   0.631050
 9 Primary instruction cache misses...............       98448   0.007096
31 Store/prefetch exclusive to shared block in scache...
                                                       1379072   0.005516
24 Mispredicted branches..........................      320112   0.001818
10 Secondary instruction cache misses.............        1328   0.000401
 4 Issued store conditionals......................        8336   0.000033
20 Graduated store conditionals...................        2944   0.000012
30 Store/prefetch exclusive to clean block in scache...
                                                          1696   0.000007
 5 Failed store conditionals......................         144   0.000001
 1 Issued instructions............................  6220435040   0.000000
 8 Correctable scache data array ECC errors.......           0   0.000000
11 Instruction misprediction from scache way prediction table..
                                                       11024     0.000000
12 External interventions.........................     2920944   0.000000
13 External invalidations.........................     6162768   0.000000
15 Graduated instructions.........................  6241925184   0.000000
17 Graduated instructions.........................  6246041168   0.000000
27 Data misprediction from scache way prediction table...
                                                      30389376   0.000000
28 External intervention hits in scache...........     2913088   0.000000
29 External invalidation hits in scache...........     3438496   0.000000



Statistics
=========================================================================
Graduated instructions/cycle..................................   0.368454
Graduated floating point instructions/cycle...................   0.012382
Graduated loads & stores/cycle................................   0.088164
Graduated loads & stores/floating point instruction...........   7.120506
Mispredicted branches/Decoded branches........................   0.000918
Graduated loads/Issued loads..................................   0.985689
Graduated stores/Issued stores................................   0.982451
Data mispredict/Data scache hits..............................   0.340981
Instruction mispredict/Instruction scache hits................   0.113509
L1 Cache Line Reuse...........................................  14.826981
L2 Cache Line Reuse...........................................  16.989953
L1 Data Cache Hit Rate........................................   0.936817
L2 Data Cache Hit Rate........................................   0.944413
Time accessing memory/Total time..............................   0.172428
Time not making progress (probably waiting on memory) / Total time...     
                                                                 0.819267
L1--L2 bandwidth used (MB/s, average per process).............  63.563851
Memory bandwidth used (MB/s, average per process).............  15.797584
MFLOPS (average per process)..................................   3.095432

To try to understand all this, read SGI's on-line manual, Origin2000 and Onyx2 Performance Tuning and Optimization Guide , Chapter 4, Profiling and Analyzing Program Behavior . Like all SGI manuals, this is readily available at:

http://techpubs.sgi.com/

Here's a excerpt which describes every statistic. This is fascinating stuff for those into performance programming:

Table 4-1 : Derived Statistics Reported by perfex -y Statistic
Graduated instructions per cycle:


            When the R10000 is used to best advantage,
            this exceeds 1.0. When it is below 1.0, the
            CPU is idling some of the time.

  
Graduated floating point instructions per cycle:


            Relative density of floating-point operations
            in the program.

  
Graduated loads & stores per cycle:


            Relative density of memory-access in the
            program.

  
Graduated loads & stores per floating point instruction:


            Helps characterize the program as data
            processing versus mathematical.

  
Mispredicted branches  Decoded branches:


            Important measure of the effectiveness of
            branch prediction, and of code quality.

  
Graduated loads  Issued loads:


            When less than 1.0, shows that loads are
            being reissued because of cache misses.

  
Graduated stores  Issued stores:


            When less than 1.0, shows that stores are
            being reissued because of cache misses or
            contention between threads or between
            CPUs.

  
Data mispredictions  Data scache hits:


            The count of data misprediction from scache
            way prediction, as a fraction of all secondary
            data cache misses.

  
Instruction mispredictions  Instruction scache hits:


            The count of instruction misprediction from
            scache way prediction, as a fraction of all
            secondary instruction cache misses.

  
L1 Cache Line Reuse:


            The average number of times that a primary
            data cache line is used after it has been
            moved into the cache. Calculated as
            graduated loads plus graduated stores minus
            primary data cache misses, divided by
            primary data cache misses.

  
L2 Cache Line Reuse:


            The average number of times that a
            secondary data cache line is used after it has
            been moved into the cache. Calculated as
            primary data cache misses minus secondary
            data cache misses, divided by secondary data
            cache misses. 

  
L1 Data Cache Hit Rate:


            The fraction of data accesses satisfied from
            the L1 data cache. Calculated as 1.0 - (L1
            data cache misses  (graduated loads +
            graduated stores)). 

  
L2 Data Cache Hit Rate:


            The fraction of data accesses satisfied from
            the L2 cache. Calculated as 1.0 - (L2 data
            cache misses  primary data cache misses). 

  
Time accessing memory  Total time:


            A key measure of time spent idling, waiting
            for operands. Calculated as the sum of the
            typical costs of graduated loads and stores, L1
            data cache misses, L2 data cache misses, and
            TLB misses, all divided by the total run time
            in cycles.

  
L1-L2 bandwidth used (MBps, average per process):


            The amount of data moved between the L1
            and L2 data caches, divided by the total run
            time. The amount of data is taken as: L1 data
            cache misses times L1 cache line size, plus
            quadwords written back from L1 data cache
            times the size of a quadword (16 bytes). For
            parallel programs, the counts are aggregates
            over all threads, divided by number of
            threads. Multiply by the number of threads
            for total program bandwidth. 

  
Memory bandwidth used (MBps, average per process):


            The amount of data moved between L2 cache
            and main memory, divided by the total run
            time. The amount of data is taken as: L2 data
            cache misses times L2 cache line size, plus
            quadwords written back from L2 cache times
            the size of a quadword (16 bytes). For parallel
            programs, the counts are aggregates over all
            threads, divided by number of threads.
            Multiply by the number of threads to get the
            total program bandwidth.

  
MFLOPS (average per process):


            The ratio of graduated floating-point
            instructions and total run time. Note that a
            multiply-add carries out two operations, but
            counts as only one instruction, so this statistic
            can be an underestimate. For parallel
            programs, the counts are aggregates over all
            threads, divided by number of threads.
            Multiply by the number of threads to get the
            total program rate. 

The perfex "-mp" option gives you a quick way to check load balance. You can grep any interesting statistic out of the per-thread output files, and compare, thread by thread. Using output from the same a.out run used above, here's an example:


  video2$ egrep -i "Graduated instructions/cycle" px*

  px:           Graduated instructions/cycle.................  0.368454
  px.0000547100:Graduated instructions/cycle.................  0.273813
  px.0000554480:Graduated instructions/cycle.................  0.281495
  px.0000554482:Graduated instructions/cycle.................  0.320702
  px.0000554495:Graduated instructions/cycle.................  0.314730
  px.0000554496:Graduated instructions/cycle.................  0.340351
  px.0000554497:Graduated instructions/cycle.................  0.874657
  px.0000554529:Graduated instructions/cycle.................  0.262788
  px.0000554551:Graduated instructions/cycle.................  0.274474

From this output, PID 0000554497 was 2-3 times busier than any other thread--and was presumably making them wait. We can do another grep to see if it was doing "useful" computations, which I'll define as floating point operations, as opposed to loads/stores and other bookkeeping:


  video2$ grep "Graduated floating point instructions/cycle" px*  
  
  px:           Graduated floating point instructions/cycle..  0.012382
  px.0000547100:Graduated floating point instructions/cycle..  0.013118
  px.0000554480:Graduated floating point instructions/cycle..  0.011290
  px.0000554482:Graduated floating point instructions/cycle..  0.012554
  px.0000554495:Graduated floating point instructions/cycle..  0.013405
  px.0000554496:Graduated floating point instructions/cycle..  0.006219
  px.0000554497:Graduated floating point instructions/cycle..  0.013367
  px.0000554529:Graduated floating point instructions/cycle..  0.016233
  px.0000554551:Graduated floating point instructions/cycle..  0.013081

From these results, it looks like the useful computations are pretty evenly distributed. The imbalance is due to something else, like communication, a memory access pattern, I/O, etc...

If you need to restrict event counting to a specific portion of code, like a computational kernel, you can instrument your code manually using the routines in "libperfex".

To learn more, see man perfex , man libperfex , and:

http://techpubs.sgi.com/

OpenMP Fortran 2.0 Specification Released

We were glad to see the following announcement on:

http://www.openmp.org/

11/03/2000 The Fortran 2.0 Application Program Interface is ready! Kudos to the OpenMP Fortran sub-committee for achieving this major milestone. After over two years of effort, the OpenMP ARB is pleased to present a major enhancement to the Fortran API. Major new features in the 2.0 specification include:

  • Array reductions
  • Parallelization of F90 array syntax via the WORKSHARE directive
  • COPYPRIVATE for broadcast of sequential reads
  • Privatization of module data
  • Privatization of deferred shape and assumed shape objects
  • Portable timing routines
  • Nested locks
  • Control of the number of threads for multi-level parallelism
  • Relaxed reprivatization rules

In addition to the above, several minor features have been added and many interpretations addressed and incorporated into the text.

It will be a lot easier to parallelize Fortran 90 code once the vendors implement these changes in their compilers.

Quick-Tip Q & A



A:[[ I use "mpirun" to launch jobs on my cluster and it's a hassle. 
  [[ The command lines get elaborate and long, I make typos and forget
  [[ various options and flags.  Any Suggestions?


  From "man mpirun" on the SV1:
  -----------------------------

   Using a File for mpirun Arguments

     Because the full specification of a complex job can be lengthy, you
     can enter mpirun arguments in a file and use the -f option to specify
     the file on the mpirun command line, as in the following example:

          mpirun -f my_arguments

     The arguments file is a text file that contains argument segments.
     White space is ignored in the arguments file, so you can include
     spaces and newline characters for readability. An arguments file
     can also contain additional -f options.



Q: I don't understand the difference between the OpenMP constructs: 

       !$omp master
       ...
       !$omp end master

   and

       !$omp single
       ...
       !$omp end single

   Aren't they rather redundant?  Why use "single"?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top