ARSC HPC Users' Newsletter 275, August 22, 2003

IBM xprofiler and Optimization

[ Thanks to Kate Hedstrom of ARSC. ]

During a code tuning exercise with IBM ACTC visitors, I tried xprofiler, a gui version of gprof. To use xprofiler, compile your code with the "-g -pg -qfullpath" options. The -g flag does not conflict with optimization on the IBM. Run your code as normal and you'll have a gmon.out file. I was running a serial code on ferry, the interactive part of iceflyer, and my initial xlf flags were:

 -O3 -qstrict -qarch=pwr4 -qtune=pwr4 -qdpc         \
 -qflttrap=enable:invalid:imprecise -u -qzerosize 

These flags give the power4 instructions, a core dump on NaN, and constants promoted to double precision.

Start up xprofiler and ask it to load your executable and the gmon.out

file [File->Load Files]. You should see a lot of little graphs. Go to [Filter->Uncluster Functions] and [Filter->Hide All Library Calls]. Now you should have a call graph of your program, with the height and width of the boxes representing the time spent in the functions. The width is the time in the routine plus all children; the height is the time in that routine alone. Go to [View->Zoom In] to see the names of the boxes. A right mouse on a box will bring up a menu, giving you the option of looking at the source code. You will be able to see the relative counts for each line of code (the higher the number, the more time spent there). You can also do [Report->Flat Profile] to get a gprof-like listing.

I started with a 2-D setup of FVCOM (Finite Volume Coastal Ocean Model) taking 240 seconds to run. Using xprofiler, we noticed that pow, sin, and cos were taking a lot of time. The first optimization was to use the MASS library, which provides a faster, less accurate version of many intrinsic functions, like these. Simply add -lmass to the load step when compiling. That alone sped up the code to 130 seconds. There is also a vector version of MASS (-lmassv) which requires some rewriting of the function calls in the code for an even greater speedup.

Before taking that step, we looked at the calls to determine in more detail what was happening.

The pow calls were being generated by (term)**0.5, which can be rewritten as sqrt(term). That change alone got us down to 100 seconds since sqrt is a hardware instruction on the Power4 processor. In looking at the calls to sin and cos, they were being performed every timestep on angles which remain constant. Pre-computing the sin/cos values dropped the execution time to 80 seconds.

When I described this optimization to the FVCOM development team, they replied that the sin/cos computations were not necessary at all, and a simpler form could be used. Cleaning that up and converting term/dz to term*(stored 1/dz) got us down to 75 seconds.

The code is now three times faster than the original, and the next step will be to profile it with a 3-D problem.

The results look equivalent to the original results using the "eyeball norm". A diff of old and new shows some differences at about 1.0e-6, which could be explained by getting rid of the extra sin/cos nonsense.

All in all, this effort was well worth the time. It took a day to learn xprofiler and make these changes. Note that I didn't have to try the vector MASS library at all, since the transcendental library calls were removed.

However, if you have a significant number of these calls, it could be worth looking into - a good example of when to use conditional compilation:

  #ifdef MASSV
        call vsin(b, a, 100)
        do i=1,100
          b(i) = sin(a(i))
        end do

# Editors note: # # xlf will automatically convert loops like that given above to vector # intrinsic function calls, if you give it the -qhot option. When using # aggressive optimization on any compiler, be sure to validate results. # # For a similar case study, see: # "Optimizing with IBM Vector Intrinsics and xlf -qhot" # in issue #250 .

Redbook on IBM Performance and Optimization Tools

[ From Jim Long of ARSC: ]

The "AIX 5L Performance Tools Handbook" just came out last week. See:

This IBM Redbook takes an insightful look at the performance monitoring and tuning tools that are provided with AIX 5L. It discusses the use of the tools as well as the interpretation of the results in many examples.

This book is meant as a reference for system administrators and AIX technical support professionals so they can use the performance tools efficiently and interpret the outputs when analyzing AIX system performance.

A general concept and introduction to the tools is presented to introduce the reader to the process of AIX performance analysis.

The individual performance tools discussed in this book fall into these categories:

  • Multi-resource monitoring and tuning tools
  • CPU-related performance tools
  • Memory-related performance tools
  • Disk I/O-related performance tools
  • Network-related performance tools
  • Performance tracing tools
  • Additional performance topics, including performance monitoring API, Workload Manager tools, and performance toolbox for AIX.

Conditional Compilation: Part II

[ Second in a 2 part series, contributed by Kate Hedstrom of ARSC. ]

Last time, we covered conditional compilation with the C preprocessor, cpp. This time, we're going to cover the new coco part of the latest Fortran standard. The draft description of coco in the standard is on the web at:

as chunk N1306 under the electronic archives. They explicitly want this to be conditional compilation only, not a macro processor such as cpp or m4. Like cpp, coco has an include statement.

Coco style:

A coco command has a '??' for the first two characters on the line. The rest of the syntax is meant to be Fortran-like:

  call xx

?? integer, parameter :: XX = 1
?? if (XX == 1) then
  call xx
?? end if

The coco commands can be in either upper or lower case and the space in "END IF" and "ELSE IF" is optional. The types are either integer or logical, constant parameters or not.

In cpp, the lines that aren't used are turned into blanks or are deleted. In coco, you get to choose one of five options: delete, blank, and three styles of Fortran comments (starting with !). For instance:

?? integer, parameter :: CRAY = 1, IBM = 2, SGI = 3
?? integer :: system = IBM
?? if (system == IBM) then
   use ibm_mod
?? else if (system == CRAY) then
   use cray_mod
?? else
   use sgi_mod
?? end if
will produce by default:

!?>?? integer, parameter :: CRAY = 1, IBM = 2, SGI = 3
!?>?? integer :: system = IBM
!?>?? if (system == IBM) then
   use ibm_mod
!?>?? else if (system == CRAY) then
!?>   use cray_mod
!?>?? else
!?>   use sgi_mod
!?>?? end if

Set file:

A coco program will recognize a set file, a separate file which can be used to set the values of coco variables. For the above case, we can have a set file containing:

?? alter: delete
?? integer, parameter :: SGI = 3
?? integer :: system = SGI
producing this output:

   use sgi_mod

As you can see, the set file overrides any values set inside the coco program. Each coco program can have at most one set file and the set file can be shared by all the routines that make up a program.

Running coco:

The goal is that eventually, coco will be a part of the Fortran 2000 compiler system and you won't have to do anything. Right now, the major free implementation is by Purple Sage:

There is a claim that another is at:

but this site pops up a bunch of ads, then causes a core dump of my old netscape. Let's concentrate on the Purple Sage version. If you type:

% coco model

it will look for model.fpp as the input, produce model.f90 as the output, and look for model.set as the set file. If model.set doesn't exist, it will look for coco.set. Obviously, we need to be invoking coco in our Makefile for now:

.SUFFIXES: .o .fpp .f90

        $(COCO) $(COCOFLAGS) $<
        $(F90) -c $(FFLAGS) $*.f90

        $(F90) -c $(FFLAGS) $<

        $(COCO) $(COCOFLAGS) $<

Building the Purple Sage coco is a multi-step process and they provide some example input files for PC compilers. To be perfectly honest, I haven't had any luck yet building it on our Unix platforms. Still, it is wonderful that they are willing to provide the source code, which means that it can and will be fixed. In the long run, coco will make the Fortran purists feel good about conditional compilation. Meanwhile, the rest of us will continue to get by with cpp and similar tools.

Quick-Tip Q & A

A:[[ I changed the optimization level for one compiler optimization option
  [[ in my makefile, remade everything, and now my program is getting
  [[ different results.  There are over 75 source files.
  [[ Any suggestions how I might find where this compiler option is causing
  [[ a difference?

  # Thanks to Brad Chamberlain: 
  Well, the good news is that most optimizing compilers treat source files
  independently, so you can probably factor out interplay between the 75
  source files, which reduces the combinatorics somewhat.  A brute force way
  would be to compile 75 times, turning optimizations on for only one file
  at a time to determine where the problem is.  Or you could use a binary
  search (compile half with optimizations, half without; depending on the
  result, try the opposite half) -- this assumes there's only one problem.
  Once you have the file in question, I tend to use printfs to determine
  where answers differ, painful as they are.
  Doing relative debugging between the two programs (optimized and
  unoptimized) would be the ideal way to approach this problem, but I don't
  think any of the relative debuggers have made their way far enough out of
  research-land to make their use worthwhile.
  Another more indirect approach would be to see if turning on additional
  warnings in the compiler, bounds checking, pointer checking, efence,
  whatever features are available to you will reveal any problems in your
  code that are changing the meaning of the code with optimizations
  (incorrect code is the most likely cause of optimizations changing
  # From Guy Robinson:
  Sometimes I've just compared the sizes of the object and other files
  output by the compiler.  If it is only a small difference you are
  looking for this works well. A typical case is trying to see if inlining
  has been done.
  Also, the IBM and Cray compilers can both be asked to output
  intermediate, semi-readable listings.  These, and other listings like
  loopmarks can be diff'ed from one compile to the next.

Q: I like "mget" and "mput" in ftp, but I'm sick of answering "y", "y",
   "y", "y", "y", "y", "y, "y", "y", "y", "y"... when I know I want ALL
   the files!  You may have experienced it.  It goes like this:

    ftp> mget *.f
    mget adpott.f? y
    227 Entering Passive Mode (199,165,85,37,4,128)
    150 Opening BINARY mode data connection for adpott.f (500 bytes).
    226 Transfer complete.
    500 bytes received in 0.0022 seconds (2.2e+02 Kbytes/s)
    mget at.f? y
    227 Entering Passive Mode (199,165,85,37,4,129)
    150 Opening BINARY mode data connection for at.f (15682 bytes).
    226 Transfer complete.
    15682 bytes received in 0.009 seconds (1.7e+03 Kbytes/s)
    mget badolb.f? y
    227 Entering Passive Mode (199,165,85,37,4,130)
    150 Opening BINARY mode data connection for badolb.f (4543 bytes).
    226 Transfer complete.
    4543 bytes received in 0.012 seconds (3.6e+02 Kbytes/s)
    mget bccc.f? 

  So I often log onto the remote system (when the files are in my own
  account, of course), make a tar file, and just "get" the tar file.  Is
  there another way?

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top