| Newsletter Index | Quick-Tip Index | Search Newsletters |
[By: Tom Logan]
Part I: Background
The imminent retirement of iceflyer and nelchina means that some ARSC users must again start migrating codes to the latest platform. I recently ported a couple of tsunami models to midnight, and thought it might be helpful to document the experience.
The first task was to establish baseline run times and output files from the current platform - the power4 nodes of iceberg. The codes were compiled using,
-O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 -qsmp=auto
and run using 8 threads. Two different benchmark scenarios were tested for comparison:
IBM POWER4 Numbers:
UAF Tsunami COMCOT
Wallclock (sec) Wallclock (sec)
------------------------------------------------------------------
Benchmark 1 3348 1252
Benchmark 2 3413 1240
Part II: The Initial Port, Validation, & Compiler Optimizations
Since the codes use portable Fortran 90 syntax, the actual port was fairly simple. The makefiles were modified to use the pathf90 compiler and the command line switches were modified for the new environment. Most notably, the -byteswapio switch kept the endianness of all I/O the same.
Several different complier optimizations were attempted and all outputs were verified as correct. Performance results follow (since the two benchmarks run in nearly the same amount of time, only the first is reported here):
UAF Tsunami Code COMCOT
Switches Wallclock (sec) Wallclock (sec)
------------------------------------------------------------------
-O3 4762 2243
-Ofast 5057 2328
-O3 -apo, 4way 1980 2220
-O3 -apo, 16way 1395 2509
From these results, we can see a few things:
PART III: Auto-Parallelization of COMCOT using manual means
As always, the first thing to do when optimizing a code is to profile it by recompiling with the -pg switch, running the model, and, provided it's a serial code, using gprof on the resulting gmon.out file:
mg56% gprof comcot
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ks/call Ks/call name
75.69 10186.48 10186.48 16203 0.00 0.00 momt_s__
13.62 12018.98 1832.49 16203 0.00 0.00 mass_s__
3.72 12519.05 500.07 5401 0.00 0.00 jnz_
2.69 12881.36 362.31 10802 0.00 0.00 minmax_
1.50 13083.84 202.48 5401 0.00 0.00 change_
.
.
.
Since one routine is taking 75% of the time, that's the obvious place to focus the optimization efforts.
Long-term readers of the ARSC Newletter may notice that this is not the first optimization of COMCOT. Five years ago, the code was ported and optimized for chilkoot, the SV1ex system, October 2002, issue #255. Later the code had to be optimized for iceflyer/iceberg, the IBMs, September 2006, issues #347 & 348 Re-reading the more recent articles reminded me that momt_s was the troubled routine on iceberg as well. The solution that time was to simply add a single directive (4 times) telling the compiler the loops truly were parallel in nature.
After an overview of compiler options and e-mails with Pathscale, it turns out that no feedback on auto-parallelization is readily available for that compiler suite. In addition no compiler directives to control auto-parallelization are provided. This left me with applying OpenMP directives by hand. The original code:
!IBM* INDEPENDENT
do i=is,ixm1
ip1 = i+1
!IBM* INDEPENDENT
do j=js,l%jy
if ((l%h(i,j).gt.zero) .and. (l%h(ip1,j).gt.zero)) then
if (j .le. js) then
jm1 = js
else
jm1 = j-1
endif
if (j .ge. l%jy) then
jp1 = l%jy
else
jp1 = j+1
endif
tot_n = l%n(i,j,1)+l%n(ip1,j,1)+l%n(i,jm1,1)+l%n(ip1,jm1,1)
xm = l%m(i,j,1)-l%r2(i,j)*(l%z(ip1,j,2)-l%z(i,j,2))+l%r3(i,j)*tot_n-l%r2(i,j)*
twlvth*((l%z(ip1,jp1,2)-2*l%z(ip1,j,2)+l%z(ip1,jm1,2))-(l%z(i,jp1,2)-2*
l%z(i,j,2)+l%z(i,jm1,2)))
if (abs(xm) .lt. eps) xm = zero
l%m(i,j,2) = xm
else
l%m(i,j,2) = 0.0
end if
end do
end do
Became:
!$OMP PARALLEL DO PRIVATE(ip1,jm1,jp1,tot_n,xm)
do i=is,ixm1
ip1 = i+1
do j=js,l%jy
if ((l%h(i,j).gt.zero) .and. (l%h(ip1,j).gt.zero)) then
if (j .le. js) then
jm1 = js
else
jm1 = j-1
endif
if (j .ge. l%jy) then
jp1 = l%jy
else
jp1 = j+1
endif
tot_n = l%n(i,j,1)+l%n(ip1,j,1)+l%n(i,jm1,1)+l%n(ip1,jm1,1)
xm = l%m(i,j,1)-l%r2(i,j)*(l%z(ip1,j,2)-l%z(i,j,2))+l%r3(i,j)*tot_n-l%r2(i,j)*
twlvth*((l%z(ip1,jp1,2)-2*l%z(ip1,j,2)+l%z(ip1,jm1,2))-(l%z(i,jp1,2)-2*
l%z(i,j,2)+l%z(i,jm1,2)))
if (abs(xm) .lt. eps) xm = zero
l%m(i,j,2) = xm
else
l%m(i,j,2) = 0.0
end if
end do
end do
Rerunning the code showed that my quick OpenMP fix worked well:
COMCOT
Switches Wallclock (sec)
------------------------------------------------------------------
-O3 2243
-Ofast 2328
-O3 -apo, 4way 2220
-O3 -apo, 16way 2509
-O3 -apo -mp, 4way 1045 [With OpenMP directive inserted]
-O3 -apo -mp, 16way 789 [With OpenMP directive inserted]
The final version of the code executes in 63% of the time it took on iceberg. This is sufficient for standard runs of the model.
PART IV: Why do I have to keep doing this?
This is all great, but one may ask why couldn't either the XLF or the Pathscale compilers auto-parallelize these loops?
Although the pathscale compiler does not give an auto-parallelization report, one can get a vectorization report using the "-LNO:simd_verbose" switch. This reports:
Non-contiguous array "L(L.0.456)" references exist, Loop was not vectorized.
But, we can easily see that the calculation of tot_n references several non-contiguous values of l%n. However, since the loop can be parallelized, this does not provide an answer to the question.
The IBM XLF compiler stated:
"Loop cannot be automatically parallelized. A dependency is carried by variable aliasing or function call."
Now, this does help out. If we remove the intermediate variable xm, we see that
l%m(i,j,2) = l%m(i,j,1)-...
So, one of two possibile explanations come to mind. First, the use of the user defined fortran type (l) might be confusing the compiler. After all, parts of l are referenced 12 times in just the "xm=" equation. Alternately, the fact that the third dimension of the arrays really signify old and new values may be lost to the compiler and the simple fact that l%m(i,j,2) relies on values of l%m(i,j,1) may be the cause of confusion.
In either case, with the portability of OpenMP, I won't have to do this again, right? One can only hope...
[ By: Lee Higbie ]
On Midnight, as on many machines, the "segmentation fault" is not very informative. There are many possible causes, but that which the diagnostic offers, "This is likely to have been caused by a stack overflow. Use your shell's ulimit or limit command to see if your stack size limit is too small," is likely to be incorrect because stack defaults to unlimited most of the time. (One can check this value with the "ulimit -s" command.)
Another likely "segmentation fault" cause is a subscript out of bounds. You can at least set your program to catch this type of error by compiling with the -C option, as,
pathf95 -C .... or mpif90 -C .... or pathf90 -C ....
This will flag subscripts that are too large or too small so you can check the errant subscript or dimensioning.
The pathscale compiler manual says that if you set the environment variable:
export F90_BOUNDS_CHECK_ABORT=YES
the program will stop at the first out of range subscript (see: http://www.arsc.edu/support/howtos/usingsun.html#mpi for instructions on passing environment variables to MPI programs), but this doesn't work (at least not all the time). If you use it anyway, you will catch the errors, even if you get a ton of output along with the bit you need.
For example,
do i = 1, 1000
do j = 0, i
x(i, j) = 0
enddo
enddo
would generate a thousand error messages if j's lower index bound is declared as the default of 1. The compiler should warn that the subscript j goes out of range, but I've never seen a compiler that does. Not in Fortran or in any other language.
[ By: Don Bahls ]
For this week's quick tip question we were looking for a way to trap floating point exceptions (i.e. produce a core dump when a floating point exception occurs). The question was limited to Linux since the IBM XL compilers on AIX have some great built-in features for handling these exceptions. The best solution I've found is the feenableexcept routine which gcc provides as an extension to the C99 standard. This routine lets you enable particular IEEE exceptions so that a core dump results when the exception occurs.
I didn't find a standard way to do this in Fortran, that's not to say there isn't one, but I didn't find one. In any case, when I couldn't find a Fortran solution for this, I decided to try the C99 solution for Fortran. As it turns out this method worked out nicely. It works for every Fortran compiler I tried. The list included PGI, PathScale, Sun Studio, g77, and gfortran on SLES 9 and Fedora Core 7.
Note this technique really only works on Linux systems and may not work on Linux systems using non-Intel family processors.
The simple solution
This quick solution to this is to simply write a void C function that takes no parameters and enables all of the needed exceptions. Once compiled, this C code can be called from the Fortran program.
The following C code demonstrates a simple solution:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fenv.h>
void enable_exceptions_()
{
int retval;
/* feenableexcept returns the previous exceptions that were enabled
on success, otherwise it returns -1
*/
retval=feenableexcept( FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW | FE_UNDERFLOW );
if ( retval == -1 )
{
fprintf(stderr, "Warning: call to feenableexcept() failed \n");
}
}
/* This second routine is for Fortran compilers such as g77 and pathf90
which follow the f2c name mangling style
*/
void enable_exceptions__()
{
enable_exceptions_();
}
To compile this code use:
% gcc fpe.c -Wall -g -c
# results in fpe.o
Note since feenableexcept is a gcc extension I'm using the GNU compiler to ensure all of the corresponding header files will be present.
To use this routine from a Fortran code simply call the enable_exceptions routine. Here's a simple Fortran code which has a floating point exception (division by zero).
program exception
real*4 a
real*4 b
call enable_exceptions()
a=1
b=0
a=a/b ! with exceptions enabled the code should crash here.
print *,"a=", a
end program
To compile:
% pathf90 -g exceptions.f90 fpe.o -lm -o exception
# results in exceptions
When the code is run it will now produce a core dump as soon as the divide by zero is encountered.
mg56 % ./exceptions
Floating point exception (core dumped)
Which makes the bug quite easy to track down with a debugger.
mg56 % gdb ./exceptions
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux"...Using host
libthread_db library "/lib64/tls/libthread_db.so.1".
(gdb) r
Starting program: ./exceptions
Program received signal SIGFPE, Arithmetic exception.
0x0000000000400c5e in MAIN__ () at exceptions.f90:8
8 a=a/b ! with exceptions enabled the code should crash here.
Current language: auto; currently fortran
General Solution
The more general solution to this problem would be to make the feenableexcept callable directly from Fortran. This lets you selectively enable floating point exceptions from the Fortran code. There is one minor complication to this design. That is making the exception types available to Fortran, since they are defined in a C header file. The way I got around this was by writing a simple C program that generates a Fortran header file. I could have done this manually, but it's probably better to have an automated way to do this in case the constants vary from system to system.
Here's the code:
mg56 % more build_fortran_header.c
#include <fenv.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
printf(" INTEGER*4 FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW, FE_ALL_EXCEPT\n");
printf(" parameter ( FE_DIVBYZERO = %d )\n", FE_DIVBYZERO);
printf(" parameter ( FE_INEXACT = %d )\n", FE_INEXACT);
printf(" parameter ( FE_INVALID = %d )\n", FE_INVALID);
printf(" parameter ( FE_OVERFLOW = %d)\n", FE_OVERFLOW);
printf(" parameter ( FE_UNDERFLOW = %d )\n", FE_UNDERFLOW);
printf(" parameter ( FE_ALL_EXCEPT = %d )\n", FE_ALL_EXCEPT);
return 0;
}
mg56 % gcc build_fortran_header.c -o build_fortran_header
When the resultant executable is run, it will produce a Fortran header file.
mg56 % ./build_fortran_header > fenv.fh
As shown here:
mg56 % more fenv.fh
INTEGER*4 FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW, FE_ALL_EXCEPT
parameter ( FE_DIVBYZERO = 4 )
parameter ( FE_INEXACT = 32 )
parameter ( FE_INVALID = 1 )
parameter ( FE_OVERFLOW = 8)
parameter ( FE_UNDERFLOW = 16 )
parameter ( FE_ALL_EXCEPT = 61 )
Here the constants used by the feenableexcept were named the same as the corresponding C constants for consistency sake, but this isn't a requirement.
The new Fortran interface to the C routine looks like this:
mg56 % more fenv.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fenv.h>
void feenableexcept_(int * except)
{
int retval;
/* feenableexcept returns the previous exceptions that were enabled
on success, otherwise it returns -1
*/
retval=feenableexcept( *except );
if ( retval == -1 )
{
fprintf(stderr, "Warning: call to feenableexcept() failed \n");
}
}
/* This second routine is for Fortran compilers such as g77 and pathf90
which follow the f2c name mangling style
*/
void feenableexcept__(int * except)
{
feenableexcept_(except);
}
The pointer arguments are to handle the call by reference calling convention used by Fortran. There are also two versions of the routine, to cover two common Fortran name mangling styles.
mg56 % gcc fenv.c -Wall -g -c
Optionally you can produce a library from the object file.
mg56 % ar -r libfenv.a fenv.o
ar: creating libfenv.a
The Fortran code can now directly enable traps for the selected floating point exceptions. The following code enables the traps divide by zero, overflow, and underflow. E.g.:
mg56 % more exceptions2.f90
program exception
include 'fenv.fh'
real*4 a
real*4 b
! or is the bitwise or operator
call feenableexcept( or(FE_DIVBYZERO, or(FE_OVERFLOW, FE_UNDERFLOW)) )
a=1
b=0
a=a/b ! with exceptions enabled the code should crash here.
print *,"a=", a
end program
The compile statement becomes:
mg56 % pathf90 exceptions2.f90 fenv.o -g -o exceptions2
Alternatively if you opted to use the library method it would be:
mg56 % pathf90 exceptions2.f90 -L . -lfenv -g -o exceptions2
Further Exercises
The GNU C compiler also includes a second routine fedisableexcept which disables floating point exceptions. This routine could be wrapped similarly to feenableexcept to provide a way to disable exceptions.
That task is left as an exercise.
The bits were still switching,
While all about ARSC,
The cubicles felt empty,
And echoed like karst,
With no one to watch them,
The bits became nibbles,
And bytes and then blocks,
And then ran out in dribbles,
They poured on the floor,
Filled the room to the ceiling,
A tsunami of data,
Filled me with such feeling!
...
But it just had to wait...
--
The newsletter editors wish you Happy Holidays. The letters from North Pole have been mailed. We won't bother you again till January 11th.
A:[[ The IBM compilers have some nice options for tracking down floating
[[ exceptions. I took a look at the man page for both gcc and PathScale
[[ and don't see anything similar. Is there a easy way to trap floating
[[ point exceptions with compilers on a Linux system? Basically I want
[[ the code to produce a core dump as soon as a floating point exception
[[ occurs.
#
# Editor's response
#
GNU compilers on Linux include the routine, "feenableexcept" as an
extension to the C99 standard. This lets you trap particular floating
point exceptions. This is very much a Linux only solution from what
I have seen... it won't work on AIX or Mac OS X. See the article
"Fortran Floating Point Traps for Linux" (above) for an example of how
to use feenableexcept. The man page is also useful.
Q: I downloaded and untarred a mondo source code, spanning many
sub-directories. I've made changes here and there, to about
25 files. Now I need to figure out which files I changed, copy
them out, and back them up.
My problem is that at least 10 of the updated files have the same
name ("makefile"). Thus, the complete PATH to each file is critical
information.
How can I identify the changed files, extract them, and associate
them with their PATHs so I'll know later where they came from?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Contact:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619 Donald Bahls ARSC User Consultant ph: 907-450-8674 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.Email Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 |
voice: 907-474-6935 |
email:
home | search | about | support | news | science | resources