ARSC HPC Users' Newsletter 385, April 25, 2008



TAU Profiling on Midnight - Part II

[ By: Ed Kornkven, with thanks to Sameer Shende ( ) for much helpful feedback on this article ]


In > issue 381 , we showed the steps required to do profiling of an MPI program using automatically generated TAU profiler calls. In this article we will first discuss the automatic profiling of loops and then show how to manually add profiler calls to instrument arbitrary sections of code. We will finish with an introduction of TAU instrumentation overhead.

Automatic Loop Instrumentation

We saw previously how easy it is to instrument function calls with TAU. Instrumenting loops is nearly as easy using TAU's "select file" option. To tell TAU that we want to instrument loops in the main() function of our source file cpi.c, we create a configuration file (we'll call it select.tau) that looks like this:

    loops file="cpi.c" routine="main"

To build a program so instrumented, the basic idea remains the same: set the Makefile variable TAUMAKEFILE to the name of the TAU stub Makefile corresponding to the kind of instrumentation we want to do. We are going to use Makefile.tau-pathcc-mpi-pdt again. This time we also have to tell the TAU compiler to use our configuration file by adding the option "-optTauSelectFile=select.tau". Our TAUMAKEFLAGS variable now reads:


This example is set up in the code samples directory on Midnight. Let's fetch the sample code into our work directory, compile and run it, and view the profile:

   % cp -r $SAMPLES_HOME/parallelEnvironment/autoloops_tau_pi \
   % cd $WORKDIR/autoloops_tau_pi
   % make
   % qsub -I -l select=1:ncpus=4:node_type=4way -l walltime=30:00 \
          -q standard
   # That placed us on an X2200 compute node, once one is available.
   %% cd $PBS_O_WORKDIR
   %% mpirun -np 4 ./cpi
   %% $PET_HOME/tau/x86_64/bin/pprof

The profile displayed by pprof looks a lot like the one we created in the last installment, but with the addition of three entries corresponding to the three loops in file cpi.c. For example, one of the loop entries is labeled:

    Loop: int main(int, char **) C [{cpi.c} {58,5}-{62,5}]

This is the entry for the loop on lines 58-62 in cpi.c. Pretty straightforward -- we can match timings for loops to the source code, like we did for function calls. The profile shows the same information as for calls, including the number times it was executed, the number of subroutine calls made from the loop, and the inclusive and exclusive time that the loop used.

Manual Instrumentation

If TAU will insert profiling calls automatically, why would anyone want to insert them manually? After all, inserting profiling means changing the code, which is time-consuming and error prone. Furthermore, once those instructions are inserted into the code, they will have to be later removed (or otherwise dealt with) in order to run the program without profiling. Automatic instrumentation has a limitation however, in that you only get the profiling that the automatic instrumentor knows how to provide. To instrument arbitrary code sections, we will manually tell TAU what to instrument.

An Example

We have another version of our trusty pi program, this time with explicit calls to TAU timer routines. Again, there is a copy in the code samples directory on Midnight which we will run from our work directory:

   % cp -r $SAMPLES_HOME/parallelEnvironment/manual_tau_pi $WORKDIR
   % cd $WORKDIR/manual_tau_pi

The Makefile has a few changes from the previous example, but let's first take a look at one (simplified) API for doing manual TAU instrumentation in an excerpt from our new version of cpi.c:

#include <TAU.h>

double f(double a)
    return (4.0 / (1.0 + a*a));

int main(int argc, char* argv[]) {   
    int i, n, myid, numprocs, namelen;
    double mySum, h, sum, x;
    double startwtime, timePi, timeE, time1;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    TAU_START("main-init()");    /* <<<<<<<<<<<<<<<<<<<<< */

    if (argc > 1) {
        sscanf(argv[1], "%d", &n);
    } else {
        n = 1000000;
    TAU_STOP("main-init()");    /* <<<<<<<<<<<<<<<<<<<<< */

    /* ... */

    /* Calculate pi by integrating 4/(1 + x^2) from 0 to 1. */
    h   = 1.0 / (double) n;
    sum = 0.0;
    for (i = myid + 1; i <= n; i += numprocs) {
        x = h * ((double)i - 0.5);
        sum += f(x);
    mySum = h * sum;
    MPI_Reduce(&mySum, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    /* ... */

There are three added lines that we use to manually time an arbitrary section of code. We first #include TAU.h. Then, we add calls to macros called TAU_START() and TAU_STOP() to explicitly start and stop the timer for the section of code to be instrumented. We can have multiple (possibly nested, but non-overlapping) timers, each one identified by the string parameter passed to TAU_START() and TAU_STOP(). That same string also labels the timer output in the resulting profile.

Running the Program

We build and execute the program in the usual way:

   % make
   % qsub -I -l select=1:ncpus=4:node_type=4way -l walltime=30:00 \
          -q standard
   # Now we're on a compute node...
   %% cd $PBS_O_WORKDIR
   %% mpirun -np 4 ./cpi

There are a couple of changes to the Makefile we introduced in Part 1. Again, we set the variable TAUMAKEFILE to Makefile.tau-pathcc-mpi-pdt. Now we need to add options to appropriately use the TAU calls at compile and link times. First, we specify the path to TAU.h using the "-I" compiler option in the variable CFLAGS. Second, we move the reference to TAUMAKEFLAGS to the definition of CC so that when we disable TAU profiling (discussed below), those flags aren't used.

The Profile

After running the program, we again display the resulting profile with pprof.

   %% $PET_HOME/tau/x86_64/bin/pprof

The meaning of the columns in the pprof output was given in part one ( > issue 381 ). The difference between the profile of the automatically instrumented code and this one with explicit TAU calls is the additional profile entry for main-init(), our explicitly instrumented initialization code.


There is a potential problem with profiling short routines that are called many times- the instrumentation instructions themselves can take a relatively large portion of the total execution time, possibly confusing the profile results. The function f() in the pi code is an example of this. Our profile shows that f() was called 250,000 times and consumed approximately 1.5 seconds. Its caller, main(), used about the same amount of time, presumably by making all those calls. Starting and stopping the (automatic) timer for all those calls is proving to be expensive. TAU has a mechanism for dampening the overhead effect of many calls which is activated by the environment variable TAU_THROTTLE. With TAU_THROTTLE turned on, if a function executes greater than 100,000 times and has an inclusive time per call of less than 10 microseconds, profiling of that function will be disabled after the 100,000-call threshold is reached. (These thresholds can be modified using the environment variables TAU_THROTTLE_NUMCALLS and TAU_THROTTLE_PERCALL -- see the Tau User's Guide.)

Let's run the experiment again, this time with TAU_THROTTLE.

    % mpirun -np 4 TAU_THROTTLE=1 ./cpi
    % pprof

We see now that the time reported for f() has dropped to 0.6 seconds, the time for main() is down to 0.6 seconds, and the total time to run the program is reduced from 3.2 seconds to 1.4 seconds. Notice too that the number of calls to f() is reported as 100,001 -- TAU_THROTTLE is doing its job and is reducing the overhead of TAU. This isn't a particularly good example of the benefit of TAU_THROTTLE. f() is very simple and is called relatively many times, so we still see a lot of overhead. TAU has other more advanced strategies for compensating for instrumentation overhead that we aren't going to cover here but the point for now is, overhead can be an issue when instrumenting real codes and TAU has methods for dealing with it.

Disabling TAU Instrumentation

Suppose that we now want to run this manually instrumented progam without TAU instrumentation. We could edit the code to remove the three lines we inserted. But in a real application, we might have instrumented the code in several places. Finding those locations and removing code is an error-prone bother. Fortunately there is a much easier way. Going back to the Makefile, we see these two lines:

    # Uncomment the following line to compile with TAU disabled.
    #CC     = mpicc -DTAU_DISABLE_API

Try it. Edit the Makefile, or use the copy in Makefile.disable_TAU, do a "make clean" to get rid of the previous executable and object files, and rebuild the program:

    % make clean
    % make -f Makefile.disable_TAU

Notice the commands that were executed:

    mpicc -DTAU_DISABLE_API -I/u2/wes/PET_HOME/tau/include -c cpi.c
    mpicc -DTAU_DISABLE_API  cpi.o -o cpi -lstdc++

By uncommenting that line in the Makefile, we changed the compiler from the TAU wrapper "" to the ordinary "mpicc", and we defined the variable "TAU_DISABLE_API". This definition causes the macros TAU_START() AND TAU_STOP() to be #defined as null statements. So they don't have to be edited out of the code in order to run the uninstrumented code.

In Closing

These two articles have attempted to distill the power and complexity of TAU into a simple and repeatable approach for getting basic profile information from MPI-based codes. There is plenty more to say about TAU, and I plan future articles covering topics such as instrumenting OpenMP codes, using TAU with PAPI counters, doing execution traces and more discussion of the pprof and paraprof reporting tools. Feedback from readers, sharing stories of success or otherwise, is welcomed.


Using PAPI with PMPI for Performance Measurement

[By: Don Bahls]

We have done a few articles on PAPI and PMPI in the past. This week we have a short example code that combines the two libraries to provide a basic performance analysis tool for MPI applications.

The idea is pretty simple- most MPI applications spend a majority of the runtime between the MPI_Init and MPI_Finalize calls. So if we start a PAPI timer when MPI_Init is called and stop it when MPI_Finalize is called, we can get a ball-park estimate of the floating point performance for each MPI task.

In > issue 373 , we showed how to get the FLOP/s rate for a section of code using the PAPI_flops routine, in this example the same basic code is used to get the performance for each task.

Here's the code:

mg56 % cat mpi_timers.c
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <papi.h>

/* function prototypes */
void toggle_timers(int first);

int MPI_Init(int *argc, char ***argv)
  toggle_timers(1);    /* start the timer */
  return PMPI_Init(argc,argv);

int MPI_Finalize()
  toggle_timers(0);    /* stop the timer and print out performance 
                          stats */
  return PMPI_Finalize();
void toggle_timers(int first)
  int retval;               /* return value for PAPI calls */
  static float rtime;       /* total realtime since first PAPI_flops()
                                call */
  static float ptime;       /* total process time since the first
                               PAPI_flops() call */
  static long long flpops;  /* total floating point instructions or
                                 operations since the first call */
  static float mflops;      /* Mflop/s achieved since the previous call */

  if ( first == 1 )
    retval=PAPI_flops(&rtime, &ptime, &flpops, &mflops);
    int size, rank, clock_rate, ii;
    double flops_per_clock=2;
    retval=PAPI_flops(&rtime, &ptime, &flpops, &mflops);
    retval=PMPI_Comm_size(MPI_COMM_WORLD, &size);
    retval=PMPI_Comm_rank(MPI_COMM_WORLD, &rank);

    for(ii=0; ii<size; ++ii)
      if ( rank == ii )
        double util;
        util=100.0 * mflops / ( (flops_per_clock) * (double) clock_rate );
        if ( rank == 0 )
          fprintf(stderr, "PAPI: task; real time; mflops; utilization %%\n");
          fprintf(stderr, "PAPI: %d; %f; %f; %f %%\n", 
                          rank, rtime, mflops, util);

Here the toggle_timers routine does nearly all of the work. It initializes the PAPI timers and prints out the results when MPI_Finalize is called. Rather than use global variables to store the timers, static values are used for rtime, ptime, flpops and mflops.

This version improves on the PAPI example in > issue 373 by including a floating point utilization rate. The utilization rate is calculated by multiplying the clock rate by the number of floating point operations that can be performed per clock cycle (denoted by the variable flops_per_clock). In the case of a dual core Opteron this value is 2, while newer AMD quad core processors can perform 4 floating point operations per clock cycle. Check your processor documentation if you aren't sure about this value. We also need to know the clock rate to determine the floating point utilization rate. Fortunately, the PAPI C API has a call which returns the clock rate for a processor, so there is no need to hard code in that value into the code.



The Makefile is similar to the PAPI example from > issue 373 , however in this case we will generate a shared library as well.

mg54 % cat Makefile 
CFLAGS=-I$(PAPI)/include -L$(PAPI)/lib64 -Wl,-R $(PAPI)/lib64 \
   -lpapi -lperfctr -Wall mpi_timers.o
        $(CC) $(CFLAGS) $^ -shared -o $@

mpi_timers.o: mpi_timers.c
        $(CC) $^ $(CFLAGS) -c 

Running "make" will yield a shared library called " If you happen to be using an MPI stack which uses shared libraries, you can get performance numbers without recompiling by simply setting the LD_PRELOAD environment variable on a Linux based system.


mg54 % mpirun LD_PRELOAD=$PWD/ ./a.out
PAPI: task; real time; mflops; utilization %
PAPI: 0; 1.919473; 1399.003174; 26.770057 %
PAPI: 1; 1.918528; 1417.304199; 27.120249 %
PAPI: 2; 1.918332; 1417.622192; 27.126334 %
PAPI: 3; 1.919480; 1402.978394; 26.846123 %
Cleaning up all processes ...

ARSC Summer Tours start June 4th

We will be conducting tours in the Discovery Lab once again this summer. This is a great introduction to ARSC. Reservations are not required, but the tour size is limited to about 20 people on a first-come, first serve basis.

ARSC Summer Tours, 2008: June 4-August 27: Wednesdays, 1 PM

For more information contact or call 450-8600.


Quick-Tip Q & A

A:[[ I just generated a bunch of image files that are sequentially named:
  [[ image1.png, image2.png, ..., image100.png.  Unfortunately when I 
  [[ do an "ls" on the directory, the files aren't sorted the way I 
  [[ would like.
  [[ mg56 % ls -1
  [[ image100.png
  [[ image10.png
  [[ image11.png
  [[ image12.png
  [[ image13.png
  [[ image14.png
  [[ ...
  [[ ...
  [[ How can rename the files with padded zeros so the files show up
  [[ in numerical order?  For example, I would like the file image1.png
  [[ to become image001.png and image33.jpg to become image033.png, etc.
  [[ Show me how to do this shell gurus!

# The most popular answer was the rename utility.  Thanks to Bill 
# Homer, Jed Brown, Anton Kulchitsky, and Alec Bennett for sharing 
# this solution.

Paraphrasing the man page for rename only slightly:

   For example, given the files image1.png, ..., image9.png,
   image10.png, ..., image278.png, the commands

          rename image image0 image?.png
          rename image image0 image??.png

   will turn them into image001.png, ..., image009.png,
   image010.png, ..., image278.png.

# Bill Homer also shared this Perl solution.

If the names aren't quite so regular, here's a perl script solution:

  ./my_rename *.png


% cat my_rename
use English;
map {rename $_, $PREMATCH.sprintf("%03d",$MATCH).$POSTMATCH)
     if /\d+/;} @ARGV;

# Thanks to Martin Luthi for this python based solution which will work
# on Macs, Unix and Windows.

Here comes the inevitable Python solution. It's not in its tersest form
for readability, but you should get the idea. As an added benefit, the
script runs on any platform (Mac, Windows, Unix, ...).  For that reason
the libraries os.path and shutil were used instead of the usual string

#!/usr/bin/env python

import glob, os, shutil

for filename in glob.glob('image*.jpg'):
    path, fname = os.path.split(filename)
    newname = os.path.join(path, 'image%03d.jpg' % (int(fname[5:-4])) )
    shutil.move(filename, newname)

The hard-to-parse 'image%03d.jpg' contains a C-format template '%03d',
meaning a three-digit integer '%3d' padded with leading zeros.

For my personal use I rename all images with the EXIF date/time which
I obtain using jhead (one could also use PIL, the Python Imaging
library).  The images are then distributed into directories named
after year/month, like 2008/03.

# Dale Clark came up with this perl one-liner.

perl -e 'for (`ls i*`) { /(\D+)(\d+)(\S+)/ && rename "$1$2$3",sprintf "%s%03d%s",$1,$2,$3}'

# Jed Brown suggested two alternatives to rename.

A more flexible tool is `prename' which is packaged with Debian perl,
but is available on other platforms as well.  The following pads the
first decimal number with it's 3-digit zero padded representation.

$ prename 's/\d+/sprintf("%03d",$&)/e' *.png

Tip: Use prename with `-n' to see what changes it will make without
actually making them.

Yet another method is to load the directory into emacs using
wdired-mode, then use interactive search-and-replace as well as macros
to make your changes.

# Ryan Czerwiec shared this solution

Here's a fairly generic solution:

#!/bin/csh -f
set root = $1
set ext = $2
set nf = `ls -1 ${root}*${ext} 
 wc -l`
set ndigits = `echo "scale = 0 ; l($nf)/l(10) + 1" 
 bc -l`
set index = 1
while ( $index <= $nf )
  set newnum = `printf "%${ndigits}i" $index 
 tr ' ' '0'`
  if ( $%index < $ndigits ) mv $root$index.$ext $root$newnum.$ext
  @ index++

The script takes the file name root and extension as its arguments, i.e.
./script.csh image png
in this case.
I've had it handle for you the number of digits to use rather than have
it be another argument, and a lot more could be automated based on your
specific situation, like if the incremented number is the only numeric
characters in the file name, whether there's only one dot (this example
assumes one before the extension), if the current directory doesn't
contain any files other than those that should be processed, etc., but
just typing in two filename parts isn't much bother, and further
automation quickly has diminishing returns.  You might need to sub \mv
for just mv, which may behave like the commonly aliased 'mv -i' and bog
you down with questions.

# Robert Linzell suggested the following.

I do this sort of thing routinely.  In this case, awk would probably be
the best way to extract the numerals and print a formatted file name.
The shell will determine the looping syntax.  In bash:

for f in image*; do 
   n=`echo $f
awk '{i1=6;i2=index($1,".")-1;printf("image%3.3d.jpg", substr($1,i1,(i2-i1)+1))}'`; 
   echo $f to $n; 
   mv $f $n;
image101.jpg to image101.jpg
mv: `image101.jpg' and `image101.jpg' are the same file
image10.jpg to image010.jpg
image1.jpg to image001.jpg
image2.jpg to image002.jpg
image3.jpg to image003.jpg

# Kurt Carlson shared this ksh based solution.

A ksh method:

typeset -R3 X
integer J; integer I=1
while [ I -lt 100 ]; do 
   J=1000+$I; X=$J; 
   if [ -f image$I.png ]; then 
      mv image$I.png image$X.png; 

# Ed Kornkven shared an alternative ksh version.

Here is a ksh solution which has been generalized a little to work
for any file names with embedded numbers that begin with an alphabetic

typeset -Z3 num
for x in [a-zA-Z]*[0-9]*
   #echo "head = $head, tail = $tail, num = $num"
   echo "mv $x $y"

Use a for loop to process all the files that match "*.png".
The typeset command types the variable "num" as integer with a width
of 3 digits and zero padding.  The rest of the assignments use Korn
shell pattern operators to extract the parts of a file name and
then reconstitute them into variable "y" with the number part padded
with zeros.

Q: #
   # Thanks to Ryan Czerwiec for this week's question!

   I often use a driver script to set up and run FORTRAN executables
   since each is better at doing particular things.  Usually the script
   will just execute things before and after the executable, but
   sometimes there's something I need done in the middle of the FORTRAN
   execution that is more easily done by a script, so I'll use the "call
   system" syntax.  On one occasion, I needed the FORTRAN to do some I/O
   through both standard methods and system calls, and it needed to be
   done in a particular order.  However, the compiled executable
   insisted on doing all the system call read/writes before all of the
   standard ones (or vice versa, it's been long enough that I don't
   remember) regardless of the order they were written, or any stall
   tactics I inserted to try and let these statements complete (dummy do
   loops, sleep commands, etc.).  Is there a way to enforce a desired
   order to these things?

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top