ARSC HPC Users' Newsletter 381, February 22, 2008

Introduction to Using TAU Profiling on Midnight

[ By: Ed Kornkven ]


TAU is a versatile and portable performance analysis toolkit. It is installed on all HPCMP machines including Midnight. In this article, we will show how TAU can be used to gather and display performance profile information for a parallel program. You might ask, "What about gprof?" The answer is that gprof is useful only for serial codes. For programs using MPI, we need another tool.

In this article, we will automatically instrument the code by recompiling using TAU tools. By "instrumenting", we mean inserting calls to TAU library routines that start and stop timers and output the timings. When the instrumented code is then executed, the profile output is stored for post-analysis. Readers who have used the standard Unix profilers prof or gprof will be familiar with this process.

An Example

For an example, we will take a slightly modified code from the samples that come with TAU -- the C program that calculates pi in parallel using MPI. The example is modified to remove explicit TAU profiling calls in order to illustrate the automatic instrumentation feature of TAU. There is a copy in the code samples directory on Midnight which is accessible through the environment variable $SAMPLES_HOME. Let's copy the whole directory containing the example to our $WORKDIR and cd to that directory:

   % cp -r $SAMPLES_HOME/parallelEnvironment/auto_tau_pi $WORKDIR
   % cd $WORKDIR/auto_tau_pi

Now let's take a look at the Makefile. With comment lines removed for brevity, here it is:

    TARGET              = cpi

    TAUMAKEFILE = Makefile.tau-pathcc-mpi-pdt

    CC          = $(TAUROOTDIR)/x86_64/bin/
    CFLAGS      = 
    LDFLAGS     = 
    CXX         = $(TAUROOTDIR)/x86_64/bin/
    F90         = $(TAUROOTDIR)/x86_64/bin/


    all:                $(TARGET)       

    $(TARGET):  $(TARGET).o
        $(CC) $(LDFLAGS) $(TAUMAKEFLAGS) $< -o $@ -lstdc++

    $(TARGET).o : $(TARGET).c
        $(CC) $(CFLAGS) $(TAUMAKEFLAGS) -c $<

        $(RM) $(TARGET).o $(TARGET)


At the bottom of the Makefile, between the hashes, you will see standard make rules for building an executable named by $(TARGET). Note that $(CC) is set to the TAU wrapper for the C compiler called "". This replaces the usual "pathcc" or "mpicc" compiler. I have added the wrappers for C++ ( and Fortran 90 ( for illustration though they aren't used to build this program. Of course, if you put the path to the wrappers in your PATH, you can dispense with specifying their full paths in the Makefile. Note also the addition of -lstdc++ on the link line. The TAU libraries themselves require C++ support, so either use the C++ compiler wrapper to link, or, as we have here, add the library to the link command.

The other Makefile change to be aware of is the variable TAUMAKEFILE. This variable specifies the TAU "experiment" which determines the data that will be collected and displayed in the profile. More on experiments below. To build the program, just type "make":

   % make

Now let's execute the code. Since it is so short, we'll run it interactively instead of with a batch job. In this example we will request an X2200 node although we could just as well choose an X4600 node, and "mpirun" from there. After the qsub, don't forget to cd to the directory containing the executable, just as you would within a PBS script:

   % qsub -I -l select=1:ncpus=4:node_type=4way -l walltime=30:00 \
     -q standard
   % cd $PBS_O_WORKDIR
   % mpirun -np 4 ./cpi

The Profile

If all has gone well up to this point, you will see the output of the program including the approximation for pi that was computed and the "Total time" of the program. You should also see four new files (if you ran with four MPI tasks): profile.0.0.0, profile.1.0.0, profile.2.0.0, and profile.3.0.0. These will be the inputs to the profile viewer "pprof". Since pprof expects its inputs to be named this way by default, to see the resulting profile simply invoke pprof:

   % $PET_HOME/tau/x86_64/bin/pprof

With no options, pprof will display six charts: one for each MPI task, then one for the sum of all times for all tasks, and finally one for the mean time of all four tasks. There is a line for each function in the program, including MPI routines, that lists the profile data for that function:

  • %Time - cumulative percent of total execution time taken by this routine and all others below it in the table
  • Exclusive msec - time (in milliseconds) spent in this routine, excluding the time spent in its child routines
  • Inclusive msec - time spent in this routine, including the time spent in its child routines
  • #Call - number of times this routine was called
  • #Subrs - number of child routines called
  • Inclusive usec/call - inclusive time per call, in microseconds

TAU Profiling Experiments

The particular experiment (kind of profile) that we ran in this example is indicated by the name of the TAUMAKEFILE that was specified in the Makefile: Makefile.tau-pathcc-mpi-pdt. The features of this experiment are:

  • pathcc - the PathScale C compiler
  • mpi - profile MPI calls
  • pdt - use the Program Database Toolkit for automatic profiling

Other experiments available can be found by listing the Makefiles in $PET_HOME/tau/x86_64/lib/Makefile*.

Additional TAU Capabilities

I mentioned that TAU is versatile. There are many capabilities that TAU offers that are beyond the scope of this article including:

  • Numerous options and environment variables for tailoring the behavior of TAU
  • Paraprof, an X-based viewer of profile output (to use in place of pprof)
  • The ability to add explicit profiler calls to your code to profile blocks of code, not just function calls
  • Execution tracing (instead of profiling)
  • Memory utilization reporting
  • A database and web portal for storing and analyzing multiple performance experiments.

The place to start for more information is the documentation page at the TAU web site: .

There is also an example of manually inserting TAU library calls into this pi program in $SAMPLES_HOME/parallelEnvironment/manual_tau_pi.

Final Notes

  • TAU does add overhead to the execution time of the program. Expect the instrumented version to run 3 or 4 times slower than the original. There are options for factoring the overhead out of the profile.
  • If you instrument a serial code with TAU using the PAPI counters, make sure you execute the code on a Midnight compute node, not one of the login nodes, because the PAPI counters aren't available on the login nodes.

Matlab Functions on the Command Line

[ By: Don Bahls ]

I've met a number of the people using matlab to do what could be categorized as parameter sweeps. This type of work involves testing a whole bunch of different input values to determine which values work best, or are the most interesting. Matlab can take input from stdin, this can be a nice way to handle this type of operation without writing a unique matlab script for each run in the sweep.

Here's a simple and boring matlab function that will be used to demonstrate this technique:

   mg15 % more add_them.m 
   function val=add_them(a, b)
       val=a + b

If you had to use this function on a large set of values, you could either write a matlab file to run each job, or you could move some of the complexity to a script. Here's a simple example showing how to use a here-document to call a matlab function:

   mg15 % matlab -nodisplay << EOF
   add_them(1.2, 3.4) 
                                 < M A T L A B >
                     Copyright 1984-2007 The MathWorks, Inc.
   val =
   ans =

If matlab is run serially, there really isn't any benefit to moving this complexity from a matlab script to a shell script, however if you run multiple copies of matlab at the same time (as was shown in > issue 378 ), this method becomes more useful. An adapted version of the script from > issue 378 is below:


   #PBS -q standard
   #PBS -l select=1:ncpus=4:node_type=4way
   #PBS -l walltime=6:00:00
   #PBS -joe
   # The values below could be read from a file instead.
   # Here a bash array is used to make it easy to index values.
   A[0]=1.2;   B[0]=2.0
   A[1]=2.4;   B[1]=4.0
   A[2]=6.4;   B[2]=8.0
   A[3]=2.4;   B[3]=16.0
   # You could also use this syntax if you prefer:
   # A=(1.2  2.4  6.4  2.4 )
   # B=(2.0  4.0  8.0  16.0 )
   # Load the matlab module
   . /usr/share/modules/init/bash
   module load matlab
   # start 4 matlab processes in the background
   # for each processor on the node.
   for p in 0 1 2 3; do 
       # start matlab in the background passing it a here-document
       # with the parameters for this processor.
       matlab -nodisplay << EOF > output$p.out 2>&1 &
       add_them( ${A[$p]}, ${B[$p]} )
   # issue the "wait" command so that the shell will pause until
   # all the background processes have completed.
   # end of script

The shell will expand the shell variables defined within the here-document before matlab ever sees it. For instance in the example above when p is 0, the function call:

   add_them( ${A[$p]}, ${B[$p]} )

will be expanded to:

   add_them( 1.2, 2.0 )

You can take this one step further and write a wrapper so you can simply pass input parameters on the command line:


   mg15 % cat run_model.bash
   # prints a usage message to stderr.
   function usage()
       echo "run_model.bash  arg1 arg2" 1>&2
       echo "   arg1 is the first parameter to add_them" 1>&2
       echo "   arg2 is the second parameter to add_them" 1>&2 
   # Very simple error checking (are there 2 command line args?)
   if [ $# != 2 ]; then
      exit 1;
   # This starts matlab passing the first and second command line 
   # arguments to the add_them function.
   matlab -nodisplay << EOF 
       add_them( $1, $2 )
   # NOTE: This example does not redirect stdout or stderr.

The PBS script above can be modified to use the new version by simply changing the for loop.

   for p in 0 1 2 3; do 
       ./run_model.bash ${A[$p]} ${B[$p]} > output$p.out 2>&1 &

As was mentioned in > issue 378 , it's not a good idea to run more tasks than there are processors, so keep that in mind if you decide to try this technique.

Long Term Storage Best Practices

If you've used ARSC systems for a while, you have probably used your $ARCHIVE_HOME directory (also known as $ARCHIVE) to store something. You may have also noticed that while long term storage has significant capacity, it doesn't always act like a standard filesystem. The fact that files can be offline (i.e. only on tape) changes the rules a bit.

Towards that end we have come up with a list of best practices to consider when using using long term storage:


Quick-Tip Q & A

A:[[ I am currently debugging my code by submitting jobs to the "debug"
  [[ queue on midnight.  To properly test my PBS scripts, I need to 
  [[ submit a handful of jobs at a time to see if each job will pick up 
  [[ where the previous job left off.  I often find a bug early in the 
  [[ job sequence, resulting in a handful of test jobs still queued 
  [[ without any real purpose.  Rather than waiting for each of these 
  [[ jobs to finish, I have been using the "qdel" command to cancel 
  [[ them.  
  [[ For example:
  [[   % qdel 123456 123459 123479 123480
  [[ Depending on the number of debug jobs still queued, it can become
  [[ tedious to type or copy and paste each of these job IDs for every
  [[ iteration of my debugging efforts.  Is there an easier way to 
  [[ delete all of my jobs in the debug queue?

# Thanks to Ed Kornkven for this solution.

I like one-line solutions that I can turn into aliases if I end up
performing the operation frequently.  Here is one for the sh family
of shells:

qdel $(qstat -u $USER 
 grep debug 
 cut -d ' ' -f 1 
 tr '\n' ' ')

The commands inside the $( ) are getting all my PBS job IDs and
reformatting them to appear on a single line.  The "cut" relies
on the fact that "qstat" on Midnight outputs job IDs in the form
"123456.mpbs1" at the beginning of the line.  If using this on another
machine, make sure your "qstat" behaves the same.  You could also use
a period as a delimiter and lop off the ".mpbs1".  The "grep" is for
filtering out the header that "qstat" outputs and for choosing only
the jobs in the debug queue.  Finally, "tr" turns the one-per-line
"qstat" output into a single line and the resulting list is then fed
to qdel.   If I don't have any jobs in the debug queue, the list is
going to be empty, which "qdel" is going to interpret as an error,
giving the message:

   usage: qdel [-W delay
force] job_identifier...

which is fine with me.

Q: I have an MPI code and would like to find out how long each MPI_Send
   call is taking.  The first way I thought of doing this was adding
   MPI_Wtime() calls before and after the MPI_Send and printing out the
   Is there a better way to do this?  A method that doesn't involve
   me changing my code a whole lot would be preferable!!

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top