ARSC HPC Users' Newsletter 315, May 6, 2005

GNU Make, Part II of III


[ Thanks to Kate Hedstrom for another installment in this series! ]

Last time I mentioned that there is a new book out about GNU Make (Managing projects with GNU Make, Robert Mecklenburg, O'Reilly, 2005). In the past, I was generating Makefiles for a variety of computers by using imake, a tool that comes with the X11 windows software. The key to imake is separating out the parts of the Makefile that depend on what computer you are on, the parts that depend on the project you are building, and the parts that don't change between your Fortran projects (or X11 projects).

This was not too ungainly as long as our ocean model was a serial code. Then we started generating different Makefiles for MPI/OpenMP/Serial on each system we were using. Before long, we had 20-30 Makefiles! It's time to tame that beast. In a previous article, we talked about using automake/autoconf to generate the Makefile on the system we use. Somehow that solution wasn't a good enough fit to our needs and never replaced imake. Using GNU make, I think we finally have a better way.

include

The include directive is fairly well supported across make variants and not surprisingly works with GNU make. It can be used to clean up the Makefile by putting bits in an include file. The syntax is simply:


   include file

We are using it to include the list of sources to compile and the list of dependencies. This is how we are now keeping the project information neat. We can also include a file with the system/compiler information in it, assuming we have some way of deciding which file to include. Last time I showed the example from the book, using uname to generate a file name. I decided to modify this example slightly for our needs.

I have a make variable called FORTRAN, which is set by the user of the Makefile. This value is combined with the result of "uname -s" to provide a machine and compiler combination. For instance, ftn on Linux is the Cray cross-compiler. This would link to a different copy of the NetCDF library and use different compiler flags than the Intel compiler on Linux.


# The user sets FORTRAN:
  FORTRAN := ftn

  MACHINE := $(shell uname -s)
  MACHINE += $(FORTRAN)
  MACHINE := $(shell echo $(MACHINE) 
 sed 's/[\/ ]/-/g')
  include $(MACHINE).mk

Now, instead of having 30 Makefiles, we have about 12 include files and we pick one at compile time. In this example, we will pick Linux-ftn.mk, containing the Cray cross-compiler information. The value Linux comes from the uname command, the ftn comes from the user, the two are concatenated, then the space is converted to a dash by the sed command. The sed command will also turn the slash in UNICOS/mp into a dash; the native Cray include file is UNICOS-mp-ftn.mk.

The other tricky system is CYGWIN, which puts a version number in the uname output, such as CYGWIN_NT-5.1. GNU make has quite a few built-in functions plus allows the user to define their own functions. One of the built-in functions allows us to do string substitution:


MACHINE := $(patsubst CYGWIN_%,CYGWIN,$(MACHINE))

In make, the % symbol is a sort of wild card, much like * in the shell. Here, we match CYGWIN followed by an underscore and anything else, replacing the whole with simply CYGWIN. Another example of a built-in function is the substitution we do in:


   objects = $(subst .F,.o,$(sources))

where we build our list of objects from the list of sources. There are quite a few other functions - see the book or the GNU make manual for a complete list.

Conditionals

One reason we had so many Makefiles was having a separate one for each of the serial/MPI/OpenMP versions on each system (if supported). For instance, the name of the IBM compiler changes when using MPI; the options change for OpenMP. The compiler options also change when using 64-bit addressing or for debugging, which we were manually setting. A better way to do this is to have the user select 64-bit or not, MPI or not, etc, then do the right thing later.

GNU make supports two kinds of if tests, ifdef and ifeq (plus the negative versions ifndef, ifneq). The example from the book is:


ifdef COMSPEC
   # We are running Windows
else
   # We are not on Windows
endif

An example from the IBM include file is:


               FC := xlf95_r
           FFLAGS := -qsuffix=f=f90 -qmaxmem=-1 -qarch=pwr4 -qtune=pwr4
          ARFLAGS := -r -v

ifdef LARGE
           FFLAGS += -q64
          ARFLAGS += -X 64
          LDFLAGS += -bmaxdata:0x200000000
    NETCDF_INCDIR := /usr/local/pkg/netcdf/netcdf-3.5.0_64/include
    NETCDF_LIBDIR := /usr/local/pkg/netcdf/netcdf-3.5.0_64/lib
else
          LDFLAGS += -bmaxdata:0x70000000
    NETCDF_INCDIR := /usr/local/include
    NETCDF_LIBDIR := /usr/local/lib
endif

ifdef DEBUG
           FFLAGS += -g -qfullpath
else
           FFLAGS += -O3 -qstrict
endif

We also test for MPI and OpenMP and change things accordingly. To test for equality, an example is:


ifeq ($(MPI),on)
   # Do MPI things
endif

or


ifeq "$(MPI)" "on"
   # Do MPI things
endif

The user has to set values for the MPI, DEBUG, and LARGE switches in the Makefile *before* the compiler-dependent piece is included:


    MPI := on
    DEBUG :=
    LARGE :=

Be sure to use the immediate assign with the ":=".

The conditional features supported by GNU make are very handy and have replaced the need for so many Makefiles in our build system. We have further tidied it up by putting all the include files in their own subdirectory.

X1: PE 5.4 Available for Testing

A new programming environment is available on klondike as the non-default "PrgEnv.new". Users are encouraged to test it and let us know if it gives you trouble or if it improves the performance of your code.

C/C++ users may be especially interested, as PE5.4 implements automatic inlining of your codes, which can have a profound effect on performance (by allowing previously unvectorizable loops to vectorize).

To test PE5.4, switch to the new environment with the command: module switch PrgEnv PrgEnv.new and then recompile. You are advised to give the "-V" option to the compiler (Cray cc, CC, or ftn), because it will dump the compiler version, so you can be certain you're getting 5.4.


    $  module switch PrgEnv PrgEnv.new
    $  cc -V                          
      Cray Standard C: Version 5.4.0.0  Mon May 02, 2005  14:51:22
    $  CC -V
      Cray Standard C: Version 5.4.0.0  Mon May 02, 2005  14:51:26
    $  ftn -V
      Cray Fortran: Version 5.4.0.0  Mon May 02, 2005  14:51:29

The Cray Programming Environment 5.4 releases provide the following new features:

  • Tail Recursion Optimization (C/C++)
  • Automatic Inlining (C/C++)
  • tracebk Function Added (C/C++ & CFTN)
  • Implementation of selected Fortran 2003 Features (CFTN):
    • The ERRMSG and SOURCE specifiers on the ALLOCATE statement
    • The IOMSG specifier on OPEN, CLOSE, INQUIRE, READ, WRITE, BACKSPACE, ENDFILE, REWIND, and FLUSH statements
    • The SIZE specifier on the INQUIRE statement
    • The ABSTRACT INTERFACE statement
    • Procedure pointers
  • New API Functions (CrayTools)
  • New Environment Variables PAT_RT_REGION_MAX and PAT_RT_REGION_STKSZ (CrayTools)
  • Environment Variable PAT_ROOT (CrayTools)
  • Complex-to-complex FFTs contain butterflies for radices 7, 11, and 13 (LibSci)
  • LAPACK Built with Inlined BLAS Routines (LibSci)

Performance Statistics with libhpm

Recently IBM's High Performance Computing Toolkit was installed on iceberg. This kit has numerous analysis tools for serial, threaded and mpi applications, including libhpm. This library provides access to the same hardware counters as hpmcount, but allows the user to start and stop polling the counters via a simple interface from within their code.

Here is a template C code:


iceberg2 1% cat template.c
#include <stdio.h>
#include <stdlib.h>
#include <libhpm.h>  /* include file for libhpm */

int main(int argc, char ** argv)
{
    /* ... initialize variables etc. ... */

    /*
    hmpInit needs to be called before any other libhpm calls
       
    The number "0" in the call represents an ID, while the string 
    "template" is a label which will appear in the output files.       
    */    

    hpmInit(0, "template");

    /*
    A call to hpmStart will start polling.  As with
    hpmInit the number is an ID and a string label.      
    */    
    hpmStart(1, "everything");

    hpmStart(2, "initialize");      /* starts timer 2 */

    /* ... do some computation ... */

    hpmStop(2);                     /* stop counter 2 */

    hpmStart(3, "something else");        /* start another timer */

    /* ... do some more computation ... */

    hpmStop(3);                     /* stop timer 3 */
    hpmStop(1);                     /* stop timer 1 */

    hpmTerminate(0);                /* call hpmTerminate last. */

}

iceberg2 2% export IHPCT_BASE=/usr/local/pkg/ihpct/current
iceberg2 3% xlc -q64 -I$IHPCT_BASE/include \
           -L$IHPCT_BASE/lib/pwr4 -lhpm -lpmapi \
           -lm template.c -o template

NOTE: When using thread-safe compilers be sure to use lhpm_r rather than lhpm.

The Fortran interface is similar to the C interface with "f_" preceding each subroutine name (e.g. hpmInit in the C interface becomes f_hpminit). The Fortran interface makes use of C style preprocessor directives so the header file needs to follow a "#include" instead of a standard Fortran "include" statement.

Template Fortran code:


iceberg2 4% cat template.f 
program template
  #include "f_hpm.h"

  call f_hpminit(0, "template")
  call f_hpmstart(1,"sample 1")

  ! do some computation

  call f_hpmstop(1)
  call f_hpmterminate(0)

end program

Be sure to include the flag "-qsuffix=cpp=f" when compiling so that the source is run through the C preprocessor before the Fortran compiler.


iceberg2 2% export IHPCT_BASE=/usr/local/pkg/ihpct/current
iceberg2 3% xlf90_r -q64 -qsuffix=cpp=f \
           -I$IHPCT_BASE/include \
           -L$IHPCT_BASE/lib/pwr4 -lhpm_r -lpmapi \
           -lm template.f -o template

When the executable is run, it will produce two HPM files: a flat text file and a file suitable for use with other High Performance Computing Toolkit applications (such as peekperf) In this article we will look at the flat text file output.

The output for a program performing a matrix multiply is below.


iceberg2 4% ./matrix_mult 
 adding: PM_FPU_FDIV - FPU executed FDIV instruction
 adding: PM_FPU_FMA - FPU executed multiply-add instruction
 adding: PM_FPU0_FIN - FPU0 produced a result
 adding: PM_FPU1_FIN - FPU1 produced a result
 adding: PM_CYC - Processor cycles
 adding: PM_FPU_STF - FPU executed store instruction
 adding: PM_INST_CMPL - Instructions completed
 adding: PM_LSU_LDF - LSU executed Floating Point load instruction

4092.000000

HPM output in perfhpm0000.434402

Notice there are three labeled sections to the output: "everything", "initialize", and "multiply" showing the performance statistics for each region of the code.


iceberg2 6% cat perfhpm0000.434402 

 libhpm (Version 2.5.4) summary - running on POWER4

 Total execution time of instrumented code (wall time): 33.695981 seconds

 ########  Resource Usage Statistics  ########  

 Total amount of time in user mode            : 33.680000 seconds
 Total amount of time in system mode          : 0.030000 seconds
 Maximum resident set size                    : 14072 Kbytes
 Average shared memory use in text segment    : 2695 Kbytes*sec
 Average unshared memory use in data segment  : 470219 Kbytes*sec
 Number of page faults without I/O activity   : 3571
 Number of page faults with I/O activity      : 44
 Number of times process was swapped out      : 0
 Number of times file system performed INPUT  : 0
 Number of times file system performed OUTPUT : 0
 Number of IPC messages sent                  : 0
 Number of IPC messages received              : 0
 Number of signals delivered                  : 0
 Number of voluntary context switches         : 29
 Number of involuntary context switches       : 45

 #######  End of Resource Statistics  ########

 Instrumented section: 1 - Label: everything - process: 0
 file: matrix_mult.c, lines: 22 <--> 74
  Count: 1
  Wall Clock Time: 33.695669 seconds
  Total time in user mode: 33.627955656555 seconds
  Exclusive duration: 2.6e-05 seconds

  PM_FPU_FDIV (FPU executed FDIV instruction)          :               0
  PM_FPU_FMA (FPU executed multiply-add instruction)   :      1073743046
  PM_FPU0_FIN (FPU0 produced a result)                 :       705318816
  PM_FPU1_FIN (FPU1 produced a result)                 :      1450604311
  PM_CYC (Processor cycles)                            :     36989497303
  PM_FPU_STF (FPU executed store instruction)          :      1079025247
  PM_INST_CMPL (Instructions completed)                :     34434242623
  PM_LSU_LDF (LSU executed Floating Point load instr.. :      3230276263

  Utilization rate                           :          99.799 %
  Total load and store operations            :        4309.302 M
  Instructions per load/store                :           7.991
  MIPS                                       :        1021.919
  Instructions per cycle                     :           0.931
  HW Float point instructions per Cycle      :           0.058
  Total Floating point instructions + FMAs.. :        2150.641 M
  Flip rate (flips / WCT)                    :          63.825 Mflip/sec
  Flips / user time                          :          63.954 Mflip/sec
  FMA percentage                             :          99.853 %
  Computation intensity                      :           0.499


 Instrumented section: 2 - Label: initialize - process: 0
 file: matrix_mult.c, lines: 23 <--> 56
  Count: 1
  Wall Clock Time: 0.136736 seconds
  Total time in user mode: 0.112122804395573 seconds

...
...

 Instrumented section: 3 - Label: multiply - process: 0
 file: matrix_mult.c, lines: 61 <--> 73
  Count: 1
  Wall Clock Time: 33.558907 seconds
  Total time in user mode: 33.5158270637814 seconds

...
...

The HPM library and hpmcount allows a number of different hardware counters to be polled, though only one counter set can be used at a time. The output from the default group (60) is shown above. The counter group for an application is set via the environment variable HPM_EVENT_SET. In particular the following groups show information useful for performance analysis:

  • 60: (default) cycles, instructions, and floating point operations
  • 5: cache and memory statistics including L2,L3, and memory
  • 59: cycles, instructions, TLB misses, and L1 cache activity
  • 53: floating point operations, including fdiv, fsqrt, fma. processor cycles

To get a complete listing of all available counter sets run "hpmcount -l".

Quick-Tip Q & A


A: [[ When my job completes, I want to get the output back to my local
   [[ machine in an automated fashion for post processing, etc.  Is 
   [[ there a user friendly and secure way of doing this?


#
# Editor Response
# 

We didn't get any answers to the question from the last newsletter, so
let's try a new question.



Q: I am using Loadleveler and want to get the number of nodes and
   processors that my job requests to set two environment variables:
   NODES and PROCS.  After some searching, I found that LoadLeveler
   sets the environment variable LOADL_PROCESSOR_LIST when my job is
   run.  This lists the node that each task is running on.  Below is an
   example of what I got when running a 9 task, 3 node job.

   LOADL_PROCESSOR_LIST=b7n1 b7n1 b7n1 b7n4 b7n4 b7n4 b7n2 b7n2 b7n2

   There must be a way to get the information I want from this list.
   Can you help me figure out a way to set the value of NODES and PROCS
   from the value of this environment variable?
   

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top