ARSC HPC Users' Newsletter 387, May 30, 2008

Measuring Program Performance using PAPI and Tau

[ By: Oralee Nudson and Ed Kornkven ]

When profiling and optimizing code, it is valuable to know just how efficient your program is running at the hardware level. One tool available on midnight which enables you to measure hardware statistics is the Performance Application Programming Interface (PAPI) provided by TAU. The TAU interface to PAPI gives you access to statistics such as the number of data cache misses, Vector/SIMD instructions executed, and other hardware counters. The following steps describe how to run an instrumented Fortran MPI program to capture hardware statistics using TAU and PAPI. An example program can also be found on midnight in $SAMPLES_HOME/parallelEnvironment/tau_papi_counters.

1) Set the TAU+PAPI environment variables. If you plan on using TAU+PAPI frequently, you may want to add these environment variables to your ~/.profile. Otherwise, be sure to set these in each open shell you'll be using.

Bash and ksh users:


% export TAU_MAKEFILE=$PET_HOME/tau/x86_64/lib/Makefile.tau-multiplecounters-pathcc-mpi-papi-pdt
% export PATH=$PATH:$PET_HOME/tau/x86_64/bin
- or -

Csh and tcsh users:


% setenv TAU_MAKEFILE $PET_HOME/tau/x86_64/lib/Makefile.tau-multiplecounters-pathcc-mpi-papi-pdt
% setenv PATH $PATH:$PET_HOME/tau/x86_64/bin

2) Link and compile your program with the appropriate TAU+PAPI scripts. Compile the Fortran code using the TAU "tau_f90.sh" script. The script location was added to your $PATH during step 1. You'll also want to include the command line option to embed the location of the PAPI shared library into the executable: -Wl,-R/u2/wes/PET_HOME/pkgs/papi-3.5.0/lib64. So, when put all together, your compilation command should look something like this:


tau_f90.sh myfile.f90 -o myfile -Wl,-R/u2/wes/PET_HOME/pkgs/papi-3.5.0/lib64  

3) Set the PAPI counter environment variables and execute the program. There are four hardware counters available on the AMD Opteron processors on midnight. To obtain a list of all hardware counters available on the machine, execute the "papi_avail" binary on a compute node. This can be done by first starting an interactive job then running the "papi_avail" executable:


% qsub -I
% /u2/wes/PET_HOME/pkgs/papi-3.5.0/bin/papi_avail
% exit

Once you decide on a list of events to be measured (remember 4 hardware counters is the maximum available on the midnight processors), you'll need to set the COUNTER[1-n] environment variables at runtime. In this example we will be measuring "total L1 cache accesses" and "L1 data cache misses". These variables can be set in either of the following two ways:

3.1) Inside your PBS job submission script, set the environment variables on the mpirun execution command line itself:


mpirun -np 4 COUNTER1=GET_TIME_OF_DAY COUNTER2=PAPI_L1_TCA COUNTER3=PAPI_L1_DCM ./myfile
- or -

3.2) Write a script containing the counter values and execute it within the PBS job submission script on the mpirun execution command line (here we create a "launch.sh" script):


% cat > launch.sh
    export COUNTER1=GET_TIME_OF_DAY
    export COUNTER2=PAPI_L1_TCA
    export COUNTER3=PAPI_L1_DCM
    ./myfile
    <ctrl-D>
% chmod u+x launch.sh

(Add the following to the PBS job submission script)


 mpirun -np 4 ./launch.sh

Regardless the method used, make sure COUNTER1 is always set to GET_TIME_OF_DAY. This allows TAU to synchronize timestamps across MPI tasks.

4) Examine the output for the counters. E.g., for the above counters:


pprof -f MULTI__GET_TIME_OF_DAY/profile
pprof -f MULTI__PAPI_L1_TCA/profile
pprof -f MULTI__PAPI_L1_DCM/profile

Portland Group 7.1.6 Compiler Suite Available on Midnight

Version 7.1.6 of the Portland Group compiler suite is now available on midnight. The new version can be used by loading either the "PrgEnv.pgi.new" module or the "PrgEnv.pgi-7.1.6" module.

Release notes for this version of the PGI compiler suite are available on the Portland Group website: http://www.pgroup.com/doc/pgiwsrn716.pdf

Note that midnight does not use the version of MPI included with the PGI workstation release, so sections of the release notes pertaining to MPI are not relevant to midnight.

Quick-Tip Q & A


A:[[ ARSC's data archive system works best when directory trees are
  [[ stored as one tar file rather than many constituent files.  
  [[ However, that makes comparing individual files in my working 
  [[ directory with my archived copy a pain.  Is there any way to 
  [[ easily synchronize my working directory with my archive directory 
  [[ when my archive is stored as a tar file?
  [[

#
# We didn't receive any solutions for this question. If you have any
# ideas on how to handle this, let us know!
#


Q: I frequently write scripts to build input files.  The way I 
   normally handle this is by using "echo" commands in the script
   to send the input file to stdout and then redirect stdout to the
   filename I would like to use.
   
   e.g.
   ./build_input > namelist.input
   
   Rather than redirecting stdout, I would like to have the output from
   echo go to a specific file (e.g. namelist.input).  Is there a way to 
   redirect stdout for a script within the script itself?  I really hate 
   having to redirect each "echo" statement individually.

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top