ARSC HPC Users' Newsletter 373, November 02, 2007

Introduction to PAPI

Modern processors usually have hardware counters which allow an end user to get performance statistics from the processor with minimal performance overhead. The Performance Application Programming Interface library (i.e. PAPI) attempts to provide a common interface for hardware counters for different processor architectures. There are a number of low-level and high-level routines available in PAPI, including high-level routines to calculate MFlop rates.

The PAPI_flops routine queries the floating point instruction counters and outputs a MFlop/s value. It bases this value on the operations performed and the elapsed time since the previous call to PAPI_flops. Here's the C declaration for PAPI_flops (there is also a Fortran interface to this routine).

   int PAPI_flops (float *rtime, float *ptime, long_long *flpops, float *mflops); Where the input values are:     rtime total realtime since first PAPI_flops() call     ptime total process time since the first PAPI_flops() call     flpops total floating point operations since the first call     mflops Mflop/s achieved since the previous call

For more details see:     http://icl.cs.utk.edu/projects/papi/files/html_man3/papi_flops.html

Here's a simple example that uses the PAPI_flops routine to get floating point statistics for a program. This also does some basic sanity checking to ensure there isn't a PAPI version mismatch and to ensure that the PAPI library is functional.


mg56 % cat measure_flops.c
#include <papi.h>
#include <stdio.h>
#include <stdlib.h>

#define MSIZE (1024 * 100)

main()
{
    int retval;        /* return value for PAPI calls */
    float rtime;       /* total realtime since first PAPI_flops() 
                          call */
    float ptime;       /* total process time since the first 
                          PAPI_flops() call */
    long long flpops;  /* total floating point instructions or 
                          operations since the first call */
    float mflops;      /* Mflop/s achieved since the previous call */
    int ii;
    int jj;

    float * array1;
    float * array2;

    /* Initialize the PAPI library */
    retval = PAPI_library_init(PAPI_VER_CURRENT);

    /* Verify there isn't a version mismatch */
    if (retval != PAPI_VER_CURRENT && retval > 0)
    {
        fprintf(stderr,"PAPI library version mismatch!\n");
        exit(1);
    }

    /* verify that there wasn't a different PAPI error */
    if (retval < 0)
    {
        fprintf(stderr, "PAPI Initialization error!\n");
        exit(1);
    }

    array1=(float *) malloc( (size_t)MSIZE * sizeof(float) );
    if ( array1 == NULL ) { printf("ERROR: array1==NULL!\n"); exit(1); }
    array2=(float *) malloc( (size_t)MSIZE * sizeof(float) );
    if ( array2 == NULL ) { printf("ERROR: array2==NULL!\n"); exit(1); }

    /* Initialize the arrays */    
    for(ii=0;ii<MSIZE;++ii)
    {
        array1[ii]=.1234 * (float)ii;
        array2[ii]=.5678 * (float)ii;
    }

    /* Initialize the counters */
    retval=PAPI_flops(&rtime, &ptime, &flpops, &mflops);
    if ( retval != PAPI_OK )
    {
        fprintf(stderr,"Error running PAPI_flops!");
    }

    /* Do some floating point work */
    for(jj=0;jj<MSIZE/2;++jj)
    {
        for(ii=0;ii<MSIZE;++ii)
        {
            array1[ii]+=array2[ii];
        }
    }

    printf("array1[0]=%f\n", array1[0]);

    /* Reread the counters */
    retval=PAPI_flops(&rtime, &ptime, &flpops, &mflops);
    if ( retval != PAPI_OK )
    {
        fprintf(stderr,"Error running PAPI_flops!");
    }
    
    /* Print statistics */
    fprintf(stdout, "real time          = %f\n",  rtime);
    fprintf(stdout, "processing time    = %f\n",  ptime);
    fprintf(stdout, "floating point ops = %ld\n", flpops);
    fprintf(stdout, "MFlop/s            = %f\n",  mflops);

    return 0;
}

PAPI is available in the $PET_HOME directory on iceberg and midnight. Below is an example of how the above code can be compiled on midnight.


mg56 % PAPI=${PET_HOME}/pkgs/papi-3.5.0
mg56 % pathcc -I${PAPI}/include -Ofast \
       -L${PAPI}/lib64 \
       -Wl,-R ${PAPI}/lib64 \
       -lpapi -lperfctr \
       measure_flops.c -o measure_flops

When run on a compute node, the "measure_flops" code will produce the following output:


mt257 % ./measure_flops 
array1[0]=0.000000
real time          = 1.177192
processing time    = 1.176266
floating point ops = 2705502720
MFlop/s            = 2300.077393

NOTE: PAPI is available only on compute nodes on midnight.

Similarly on iceberg:


iceberg2 % PAPI=${PET_HOME}/pkgs/papi-3.5.0-64bit
iceberg2 % xlc -q64 -I${PAPI}/include -O5 \
       -L${PAPI}/lib \
       -lpapi64 -lpmapi \
       measure_flops.c -o measure_flops

iceberg2 % ./measure_flops 
array1[0]=0.000000
real time          = 4.140038
processing time    = 4.133397
floating point ops = 5242883584
MFlop/s            = 1268.420166

For more details on PAPI check out the PAPI documentation online:     http://icl.cs.utk.edu/projects/papi/files/html_man3/papi.html

There are tools built on top of PAPI that provide access to PAPI counters without code modification. One of these tools is TAU, also installed in $PET_HOME, will be featured in a future issue of this newsletter. For more information on TAU, see:     http://www.cs.uoregon.edu/research/tau

PathScale Debugging Flags

The default compiler suite on midnight is PathScale. This compiler has a few handy options which can aide in the debugging process.

-trapuv

When the -trapuv flag is used, uninitialized floating point variables are initialized to NaN and the CPU is set to detect floating point exceptions. When an uninitialized variable is used, a core dump will be produced. This option only applies to local scalar and array variables and memory allocated via the "alloca" call. This specifically does not apply to memory allocated via "malloc" (C), "new" (C++) or "allocate" (Fortran 90), nor will this option detect uninitialized integer data.

Works with:

pathcc, pathCC, and pathf90 (as well as MPI compilers using PathScale)

Example Use:

Here's a simple example using -trapuv to catch an uninitialized REAL*4 variable (b). It a good idea to include -g when using -trapuv to ensure the debugger can make sense of the core file.


   mg56 % pathf90 test.f90 -trapuv -g -o test
   mg56 % ./test
   Floating point exception (core dumped)
   mg56 % gdb ./test core 
   ...
   ...
   #0  0x0000000000400be4 in MAIN__ () at test.f90:7
   7           b=b*a
   (gdb)

-zerouv

The -zerouv option will set uninitialized variables to zero at runtime rather than NaN. This is an easy option to try when a code misbehaves. This option works with local scalar and array variables and memory allocated via the "alloca" call. There is a slight performance overhead associated with this option as variables are set to zero at run-time.

-C

The -C option will enable array bounds checking for Fortran90 codes. This can be a quick way to track down out-of-bounds array access.

Works With:

pathf90

Example Use:

   mg56 % pathf90 test.f90 -C -g -o test
   mg56 % ./test
   
   lib-4964 : WARNING 
     Subscript is out of range for dimension 1 for array
     'C' at line 11 in file '/lustre/wrkdir/bahls/debug/test.f90',
     diagnosed in routine '__f90_bounds_check'.
    c(10)= 1.

You can set the environment variable F90_BOUNDS_CHECK_ABORT to yes to have the first out of bound array access cause the application to abort.


    mg56 % export  F90_BOUNDS_CHECK_ABORT=YES
    mg56 % ./test

    lib-4964 : UNRECOVERABLE library error 
      Subscript is out of range for dimension 1 for array
     'C' at line 11 in file '/lustre/wrkdir/bahls/debug/test.f90',
     diagnosed in routine '__f90_bounds_check'.
    Aborted (core dumped)

As with most debugging techniques, it is a good idea to include the -g option when compiling. More importantly, it is a good idea not to use these techniques when running production work as there may be a significant performance hit.

Other Tips

The MVAPICH MPI stack used on midnight has a few nuances that can cause confusion.

  1. Limits set within a PBS script are not propagated to the MPI environment. The workaround for this is to set the limits within the rc file for your shell. For example if your login shell on midnight is bash, you can increase the "core" limit for MPI applications using the ~/.bashrc file # for bash users
    mg56 % grep ulimit ~/.bashrc
    ulimit -Sc unlimited

    # for csh/tcsh users
    mg56 % grep limit ~/.cshrc
    ulimit -Sc unlimited
    limit stacksize unlimited
  2. Also, environment variables are not propagated to the MPI environment by default. You need to specify the environment variables within the mpirun command. e.g. mpirun -np 4 F90_BOUNDS_CHECK_ABORT=YES ./a.out

Quick-Tip Q & A


A:[[ I have a code written in C++ and would like to append a floating
  [[ point value to the end of a string.  It's pretty easy to do this
  [[    in C using with a character array, e.g.:
  [[ 
  [[      #include <stdio.h>
  [[      #include <stdlib.h>
  [[ 
  [[      int main()
  [[      {
  [[          char buf[1024];
  [[          float v=1.23;
  [[          sprintf(buf,"Val= %.2f\n", v);
  [[          printf("%s", buf);
  [[      }
  [[ 
  [[ 
  [[    But, as I said, I need to do this in C++.  Is there some slick C++
  [[    way to do this?  The following doesn't work!
  [[ 
  [[      #include <string>
  [[      #include <iostream>
  [[ 
  [[      int main()
  [[      {
  [[          std::string buf;
  [[          float val=1.23;
  [[          buf="Val= " + val;
  [[          std::cout << buf << std::endl;
  [[      } 
  [[
  
#
# Thanks to Greg Newby, Lorin Hochstein, Rich Griswold and Sean 
# Ziegeler for sharing solutions using the stringstream class.
#
# Here's Rich's response:
#

The stringstream class is an input/output stream with an associated
string object.  You can use the standard insertion and extraction
operators, and you can get and set the associated string with the
str() method.

#include <sstream>
#include <iostream>

int main()
{
    std::stringstream buf;
    float val=1.23;
    buf << "Val= " << val;
    std::cout << buf.str() << std::endl;
}

To clear the contents of a stringstream, don't use clear(),
since it only clears the status flags.  Instead, call str()
with an empty string:  buf.str("").  For more information, see

http://www.cplusplus.com/reference/iostream/stringstream/
.


#
# Thanks to Greg Newby for sharing this style recommendation.
#

Note that good style is to capture the output to the ostringstream in
some sort of test for success.  Otherwise, you might get unexpected
results if you try to use the class with a non-intrinsic data type.
Substitute something like this, or throw an exception:

          if (! (o << buf << val << endl) ) 
                cerr << "Oops!" << endl;

                
# 
# Last but not least thanks to Sean Ziegeler for sharing the
# following  
#

C is a valid subset of C++.  Just use the C approach (though you
might consider using the more secure snprintf() function).  If you
really need a C++ string after that, you can create one from the
character array:
        string str(buf);
        
C++ fanatics might not like you, but if it ain't broke...


Q: My application creates a number of postscript files that I need
   to convert to the png image format so I can put them on my webpage.
   Currently I open each file in an image editor and save the file
   to the new name.  This seems like a waste of my time!  Is there
   a way to automate this process?
       

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top