ARSC HPC Users' Newsletter 376, December 14, 2007

Yet Another COMCOT Optimization Story

[By: Tom Logan]

Part I: Background

The imminent retirement of iceflyer and nelchina means that some ARSC users must again start migrating codes to the latest platform. I recently ported a couple of tsunami models to midnight, and thought it might be helpful to document the experience.

The first task was to establish baseline run times and output files from the current platform - the power4 nodes of iceberg. The codes were compiled using,


  -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64  -qsmp=auto

and run using 8 threads. Two different benchmark scenarios were tested for comparison:


IBM POWER4 Numbers:
                      UAF Tsunami           COMCOT
                      Wallclock (sec)       Wallclock (sec)
------------------------------------------------------------------
  Benchmark 1         3348                  1252
  Benchmark 2         3413                  1240

Part II: The Initial Port, Validation, & Compiler Optimizations

Since the codes use portable Fortran 90 syntax, the actual port was fairly simple. The makefiles were modified to use the pathf90 compiler and the command line switches were modified for the new environment. Most notably, the -byteswapio switch kept the endianness of all I/O the same.

Several different complier optimizations were attempted and all outputs were verified as correct. Performance results follow (since the two benchmarks run in nearly the same amount of time, only the first is reported here):


                      UAF Tsunami Code      COMCOT
  Switches            Wallclock (sec)       Wallclock (sec)
------------------------------------------------------------------
  -O3                 4762                  2243
  -Ofast              5057                  2328
  -O3 -apo, 4way      1980                  2220
  -O3 -apo, 16way     1395                  2509

From these results, we can see a few things:

  1. Using -Ofast is NOT always faster than using -O3. Always try different optimization levels and compare timings as well as RESULTS. (Faster code that gives the wrong answer is never an acceptable trade.)
  2. The UAF Tsunami code takes well to the auto-parallelization, and, as a result, its run times are sufficient.
  3. The COMCOT model did not respond to the auto-parallelization attempts. Thus, further optimization needed to be pursued.

PART III: Auto-Parallelization of COMCOT using manual means

As always, the first thing to do when optimizing a code is to profile it by recompiling with the -pg switch, running the model, and, provided it's a serial code, using gprof on the resulting gmon.out file:


mg56% gprof comcot

  Flat profile:

  Each sample counts as 0.01 seconds.
    %   cumulative   self              self     total           
   time   seconds   seconds    calls  Ks/call  Ks/call  name    
   75.69  10186.48 10186.48    16203     0.00     0.00  momt_s__
   13.62  12018.98  1832.49    16203     0.00     0.00  mass_s__
    3.72  12519.05   500.07     5401     0.00     0.00  jnz_
    2.69  12881.36   362.31    10802     0.00     0.00  minmax_
    1.50  13083.84   202.48     5401     0.00     0.00  change_
    .
    .
    .

Since one routine is taking 75% of the time, that's the obvious place to focus the optimization efforts.

Long-term readers of the ARSC Newletter may notice that this is not the first optimization of COMCOT. Five years ago, the code was ported and optimized for chilkoot, the SV1ex system, October 2002, > issue #255 . Later the code had to be optimized for iceflyer/iceberg, the IBMs, September 2006, issues > #347 & > 348 Re-reading the more recent articles reminded me that momt_s was the troubled routine on iceberg as well. The solution that time was to simply add a single directive (4 times) telling the compiler the loops truly were parallel in nature.

After an overview of compiler options and e-mails with Pathscale, it turns out that no feedback on auto-parallelization is readily available for that compiler suite. In addition no compiler directives to control auto-parallelization are provided. This left me with applying OpenMP directives by hand. The original code:


!IBM* INDEPENDENT
      do i=is,ixm1
        ip1 = i+1
!IBM* INDEPENDENT
        do j=js,l%jy
          if ((l%h(i,j).gt.zero) .and. (l%h(ip1,j).gt.zero)) then
            if (j .le. js) then
              jm1 = js
            else
              jm1 = j-1
            endif
            if (j .ge. l%jy) then
              jp1 = l%jy
            else
              jp1 = j+1
            endif
            tot_n = l%n(i,j,1)+l%n(ip1,j,1)+l%n(i,jm1,1)+l%n(ip1,jm1,1)
            xm = l%m(i,j,1)-l%r2(i,j)*(l%z(ip1,j,2)-l%z(i,j,2))+l%r3(i,j)*tot_n-l%r2(i,j)*
                 twlvth*((l%z(ip1,jp1,2)-2*l%z(ip1,j,2)+l%z(ip1,jm1,2))-(l%z(i,jp1,2)-2*
                 l%z(i,j,2)+l%z(i,jm1,2)))
            if (abs(xm) .lt. eps) xm = zero
            l%m(i,j,2) = xm
          else
            l%m(i,j,2) = 0.0
          end if
        end do
      end do
Became:

!$OMP PARALLEL DO PRIVATE(ip1,jm1,jp1,tot_n,xm)
      do i=is,ixm1
        ip1 = i+1
        do j=js,l%jy
          if ((l%h(i,j).gt.zero) .and. (l%h(ip1,j).gt.zero)) then
            if (j .le. js) then
              jm1 = js
            else
              jm1 = j-1
            endif
            if (j .ge. l%jy) then
              jp1 = l%jy
            else
              jp1 = j+1
            endif
            tot_n = l%n(i,j,1)+l%n(ip1,j,1)+l%n(i,jm1,1)+l%n(ip1,jm1,1)
            xm = l%m(i,j,1)-l%r2(i,j)*(l%z(ip1,j,2)-l%z(i,j,2))+l%r3(i,j)*tot_n-l%r2(i,j)*
                 twlvth*((l%z(ip1,jp1,2)-2*l%z(ip1,j,2)+l%z(ip1,jm1,2))-(l%z(i,jp1,2)-2*
                 l%z(i,j,2)+l%z(i,jm1,2)))
            if (abs(xm) .lt. eps) xm = zero
            l%m(i,j,2) = xm
          else
            l%m(i,j,2) = 0.0
          end if
        end do
      end do

Rerunning the code showed that my quick OpenMP fix worked well:


                               COMCOT
  Switches                     Wallclock (sec)
  ------------------------------------------------------------------
  -O3                          2243
  -Ofast                       2328
  -O3 -apo, 4way               2220
  -O3 -apo, 16way              2509

  -O3 -apo -mp, 4way           1045  [With OpenMP directive inserted]
  -O3 -apo -mp, 16way          789   [With OpenMP directive inserted]

The final version of the code executes in 63% of the time it took on iceberg. This is sufficient for standard runs of the model.

PART IV: Why do I have to keep doing this?

This is all great, but one may ask why couldn't either the XLF or the Pathscale compilers auto-parallelize these loops?

Although the pathscale compiler does not give an auto-parallelization report, one can get a vectorization report using the "-LNO:simd_verbose" switch. This reports:


  Non-contiguous array "L(L.0.456)" references exist, Loop was not vectorized.

But, we can easily see that the calculation of tot_n references several non-contiguous values of l%n. However, since the loop can be parallelized, this does not provide an answer to the question.

The IBM XLF compiler stated:


  "Loop cannot be automatically parallelized. A dependency is
  carried by variable aliasing or function call."

Now, this does help out. If we remove the intermediate variable xm, we see that


  l%m(i,j,2) = l%m(i,j,1)-...

So, one of two possibile explanations come to mind. First, the use of the user defined fortran type (l) might be confusing the compiler. After all, parts of l are referenced 12 times in just the "xm=" equation. Alternately, the fact that the third dimension of the arrays really signify old and new values may be lost to the compiler and the simple fact that l%m(i,j,2) relies on values of l%m(i,j,1) may be the cause of confusion.

In either case, with the portability of OpenMP, I won't have to do this again, right? One can only hope...

The Joys of Segmentation Faults

[ By: Lee Higbie ]

On Midnight, as on many machines, the "segmentation fault" is not very informative. There are many possible causes, but that which the diagnostic offers, "This is likely to have been caused by a stack overflow. Use your shell's ulimit or limit command to see if your stack size limit is too small," is likely to be incorrect because stack defaults to unlimited most of the time. (One can check this value with the "ulimit -s" command.)

Another likely "segmentation fault" cause is a subscript out of bounds. You can at least set your program to catch this type of error by compiling with the -C option, as,


  pathf95 -C ....  or   mpif90 -C ....    or    pathf90 -C ....

This will flag subscripts that are too large or too small so you can check the errant subscript or dimensioning.

The pathscale compiler manual says that if you set the environment variable:


  export F90_BOUNDS_CHECK_ABORT=YES

the program will stop at the first out of range subscript (see: > http://www.arsc.edu/support/howtos/usingsun.html#mpi for instructions on passing environment variables to MPI programs), but this doesn't work (at least not all the time). If you use it anyway, you will catch the errors, even if you get a ton of output along with the bit you need.

For example,


   do i = 1, 1000
      do j = 0, i
        x(i, j) = 0
      enddo
   enddo

would generate a thousand error messages if j's lower index bound is declared as the default of 1. The compiler should warn that the subscript j goes out of range, but I've never seen a compiler that does. Not in Fortran or in any other language.

Fortran Floating Point Traps for Linux

[ By: Don Bahls ]

For this week's quick tip question we were looking for a way to trap floating point exceptions (i.e. produce a core dump when a floating point exception occurs). The question was limited to Linux since the IBM XL compilers on AIX have some great built-in features for handling these exceptions. The best solution I've found is the feenableexcept routine which gcc provides as an extension to the C99 standard. This routine lets you enable particular IEEE exceptions so that a core dump results when the exception occurs.

I didn't find a standard way to do this in Fortran, that's not to say there isn't one, but I didn't find one. In any case, when I couldn't find a Fortran solution for this, I decided to try the C99 solution for Fortran. As it turns out this method worked out nicely. It works for every Fortran compiler I tried. The list included PGI, PathScale, Sun Studio, g77, and gfortran on SLES 9 and Fedora Core 7.

Note this technique really only works on Linux systems and may not work on Linux systems using non-Intel family processors.

The simple solution

This quick solution to this is to simply write a void C function that takes no parameters and enables all of the needed exceptions. Once compiled, this C code can be called from the Fortran program.

The following C code demonstrates a simple solution:


    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <fenv.h>

    void enable_exceptions_()
    {
         int retval;
         /* feenableexcept returns the previous exceptions that were  enabled
            on success, otherwise it returns -1
         */
         retval=feenableexcept( FE_DIVBYZERO 
 FE_INVALID 
  FE_OVERFLOW 
 FE_UNDERFLOW );
         if ( retval == -1 )
         {
             fprintf(stderr, "Warning: call to feenableexcept() failed \n");
         }
    }

    /* This second routine is for Fortran compilers such as g77 and  pathf90
       which follow the f2c name mangling style
    */
    void enable_exceptions__()
    {
        enable_exceptions_();
    }

To compile this code use:


    % gcc fpe.c -Wall -g -c
      # results in fpe.o

Note since feenableexcept is a gcc extension I'm using the GNU compiler to ensure all of the corresponding header files will be present.

To use this routine from a Fortran code simply call the enable_exceptions routine. Here's a simple Fortran code which has a floating point exception (division by zero).


    program exception
         real*4 a
         real*4 b
         call enable_exceptions()
         a=1
         b=0
         a=a/b             !  with exceptions enabled the code should crash here.

         print *,"a=", a
    end program

To compile:


    % pathf90 -g exceptions.f90 fpe.o -lm -o exception
      # results in exceptions

When the code is run it will now produce a core dump as soon as the divide by zero is encountered.


    mg56 % ./exceptions
    Floating point exception (core dumped)

Which makes the bug quite easy to track down with a debugger.


    mg56 % gdb ./exceptions
    GNU gdb 6.3
    Copyright 2004 Free Software Foundation, Inc.
    GDB is free software, covered by the GNU General Public License,  and you are
    welcome to change it and/or distribute copies of it under certain  conditions.
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB.  Type "show warranty"  for details.
    This GDB was configured as "x86_64-suse-linux"...Using host  
libthread_db library "/lib64/tls/libthread_db.so.1".

    (gdb) r
    Starting program: ./exceptions

    Program received signal SIGFPE, Arithmetic exception.
    0x0000000000400c5e in MAIN__ () at exceptions.f90:8
    8             a=a/b             !  with exceptions enabled the code should crash here.
    Current language:  auto; currently fortran

General Solution

The more general solution to this problem would be to make the feenableexcept callable directly from Fortran. This lets you selectively enable floating point exceptions from the Fortran code. There is one minor complication to this design. That is making the exception types available to Fortran, since they are defined in a C header file. The way I got around this was by writing a simple C program that generates a Fortran header file. I could have done this manually, but it's probably better to have an automated way to do this in case the constants vary from system to system.

Here's the code:


    mg56 % more build_fortran_header.c
    #include <fenv.h>
    #include <stdio.h>
    #include <stdlib.h>

    int main()
    {
    printf("       INTEGER*4   FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW, FE_ALL_EXCEPT\n");
    printf("       parameter ( FE_DIVBYZERO = %d )\n", FE_DIVBYZERO);
    printf("       parameter ( FE_INEXACT = %d )\n", FE_INEXACT);
    printf("       parameter ( FE_INVALID = %d )\n", FE_INVALID);
    printf("       parameter ( FE_OVERFLOW = %d)\n", FE_OVERFLOW);
    printf("       parameter ( FE_UNDERFLOW = %d )\n", FE_UNDERFLOW);
    printf("       parameter ( FE_ALL_EXCEPT = %d )\n", FE_ALL_EXCEPT);
    return 0;
    }

    mg56 % gcc build_fortran_header.c -o build_fortran_header

When the resultant executable is run, it will produce a Fortran header file.


    mg56 % ./build_fortran_header > fenv.fh

As shown here:


    mg56 % more fenv.fh
           INTEGER*4   FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW, FE_ALL_EXCEPT
           parameter ( FE_DIVBYZERO = 4 )
           parameter ( FE_INEXACT = 32 )
           parameter ( FE_INVALID = 1 )
           parameter ( FE_OVERFLOW = 8)
           parameter ( FE_UNDERFLOW = 16 )
           parameter ( FE_ALL_EXCEPT = 61 )

Here the constants used by the feenableexcept were named the same as the corresponding C constants for consistency sake, but this isn't a requirement.

The new Fortran interface to the C routine looks like this:


    mg56 % more fenv.c
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <fenv.h>

    void feenableexcept_(int * except)
    {

         int retval;
         /* feenableexcept returns the previous exceptions that were enabled
            on success, otherwise it returns -1
         */
         retval=feenableexcept( *except );
         if ( retval == -1 )
         {
             fprintf(stderr, "Warning: call to feenableexcept() failed \n");
         }
    }

    /* This second routine is for Fortran compilers such as g77 and pathf90
       which follow the f2c name mangling style
    */
    void feenableexcept__(int * except)
    {
        feenableexcept_(except);
    }

The pointer arguments are to handle the call by reference calling convention used by Fortran. There are also two versions of the routine, to cover two common Fortran name mangling styles.


    mg56 % gcc fenv.c -Wall -g -c

Optionally you can produce a library from the object file.


    mg56 % ar -r libfenv.a fenv.o
    ar: creating libfenv.a

The Fortran code can now directly enable traps for the selected floating point exceptions. The following code enables the traps divide by zero, overflow, and underflow. E.g.:


    mg56 % more exceptions2.f90
    program exception
          include 'fenv.fh'
          real*4 a
          real*4 b
          !  or is the bitwise or operator
          call feenableexcept( or(FE_DIVBYZERO, or(FE_OVERFLOW,  FE_UNDERFLOW)) )
          a=1
          b=0
          a=a/b             !  with exceptions enabled the code  should crash here.
        
          print *,"a=", a
    end program

The compile statement becomes:


    mg56 % pathf90 exceptions2.f90 fenv.o -g -o exceptions2

Alternatively if you opted to use the library method it would be:


    mg56 % pathf90 exceptions2.f90 -L . -lfenv -g -o exceptions2

Further Exercises

The GNU C compiler also includes a second routine fedisableexcept which disables floating point exceptions. This routine could be wrapped similarly to feenableexcept to provide a way to disable exceptions.

That task is left as an exercise.

Happy Holidays -- See You Next Year

      The bits were still switching,       While all about ARSC,       The cubicles felt empty,       And echoed like karst,              With no one to watch them,       The bits became nibbles,       And bytes and then blocks,       And then ran out in dribbles,              They poured on the floor,       Filled the room to the ceiling,       A tsunami of data,       Filled me with such feeling!              ...              But it just had to wait...

--

The newsletter editors wish you Happy Holidays. The letters from North Pole have been mailed. We won't bother you again till January 11th.

Quick-Tip Q & A



A:[[ The IBM compilers have some nice options for tracking down floating
  [[ exceptions.  I took a look at the man page for both gcc and PathScale
  [[ and don't see anything similar.  Is there a easy way to trap floating
  [[ point exceptions with compilers on a Linux system?  Basically I want
  [[ the code to produce a core dump as soon as a floating point exception
  [[ occurs.


  #
  # Editor's response
  #

  GNU compilers on Linux include the routine, "feenableexcept" as an
  extension to the C99 standard.  This lets you trap particular floating
  point exceptions.  This is very much a Linux only solution from what
  I have seen...  it won't work on AIX or Mac OS X.  See the article
  "Fortran Floating Point Traps for Linux" (above) for an example of how
  to use feenableexcept.  The man page is also useful.




Q: I downloaded and untarred a mondo source code, spanning many
   sub-directories.  I've made changes here and there, to about
   25 files.  Now I need to figure out which files I changed, copy
   them out, and back them up.

   My problem is that at least 10 of the updated files have the same
   name ("makefile"). Thus, the complete PATH to each file is critical
   information.

   How can I identify the changed files, extract them, and associate
   them with their PATHs so I'll know later where they came from?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top