ARSC HPC Users' Newsletter 347, September 01, 2006
Fall 2006: ARSC User Training
Once again, ARSC users have a tremendous variety and depth of training opportunities. Here are the topics this fall:
- Introduction to ARSC & HPC
- Introduction to Unix
- Data Management at ARSC
- Writing Batch Scripts
- Data Visualization
- Introduction to Fortran
- Debugging
- Validation & Verification
- Performance Programming
- Parallel Shared Memory Programming
- Parallel Distributed Memory Programming
- Space Plasma Physics Applications
ARSC user training is offered in conjunction with "Core Skills For Computational Science," taught jointly by the UAF Physics Department and ARSC.
This *IS* ARSC's Fall User Training. You are encouraged to drop in on any lecture of interest. Here's the complete training schedule:
http://people.arsc.edu/~cskills/schedule.shtml And here's the primary training web site:
http://people.arsc.edu/~cskills/
Contact Tom Logan (logan AT arsc.edu) with questions.
How to Beat Moore's Law: An Optimization Story in 6 Acts--Part I
[ Many thanks to Tom Logan of ARSC for this two-part thriller! ]
| ACT I | Analysis Shows I'm Doomed |
| ACT II | Depression Sets In |
| ACT III | Relying on Moore |
| ACT IV | Take Your Own Advice |
| ACT V | Directives Save The Day |
| ACT VI | Conclusions of the Super-Linear Kind |
ACT I: Analysis Shows I'm Doomed
I was recently faced with a problem that often comes up in the scientific computing realm: the tsunami model that I was working with was too slow. I needed to run a job for 24780 iterations (time steps). Not realizing this was about 10 times longer than any of the test runs previously made, I started the job up on one p655 node on Iceberg (the code is serial) and waited for the results.
What I got was very discouraging. In the eight hours allowed in the standard queue on Iceberg, the code only completed 4050 iterations. This worked out to about 8.4 iterations per minute. A quick calculation showed me that I was most certainly doomed, since the full run would take roughly 49 hours to complete and, while the "single" queues at ARSC would permit such a long run, it would be impractical for the desired test and production work.
ACT II: Depression Sets In
Since I wanted to get these runs done in a timely fashion, I ruled out any significant code changes. For instance, trying to modify the code to write a restart file would be too time consuming. Writing a parallel version of the code using MPI would be a serious time sink, not to mention that these types of codes (many many iterations on relatively small grids) are not the best candidates for message passing algorithms.
I thus turned to the compiler to help with my dilemma. At this point, my compiler flags looked like this:
LDFLAGS = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 FFLAGS = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 -qmaxmem=-1 FFLAGS_1 = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 -qmaxmem=-1 -qsuffix=f=f90
Where the FFLAGS are for the .f files while the FFLAGS_1 are for the .f90 files. I already had the code tuned for the architecture and at the highest level of optimization provided.
This left my next alternative as trying to use the IBM compiler's built-in auto-parallelization. Not having had much luck with this in the past, I was not optimistic. Sure enough, simply adding the -qsmp=auto switch to my compiler flags and setting the environment variable OMP_NUM_THREADS=8 in my loadleveler script bought me nothing. I was still getting roughly 8.3 iterations per minute.
To facilitate testing, I reduced the run to only 1000 iterations, or approximately 2 hours of run time.
ACT III: Relying on Moore
Next I had what I thought was a brilliant idea! We've got these new power5 nodes on Iceflyer. Maybe that'll work - make Moore's law work for me by using a bigger/better/faster machine! So, that's what I did. I moved the code to iceflyer and compiled it with some slightly modified flags:
LDFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 FFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 FFLAGS_1 = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 -qsuffix=f=f90
Well, I got what Moore said I should expect. A bit less than a 2 times speedup, bringing the time down to 66 minutes for 1,000 iterations. Taking advantage of the 16 hour queue time in the p5 queue, I re-ran the entire simulation only to have the job time out after completing 15000 iterations. Close, but no cigar.
What followed was many failed attempts at slight variations. I tried auto parallelization using 4 or 8 threads. Iterations per minute improved from 15.15 (serial) to 15.38 (4 threads) to 15.63 (8 threads). Once again, virtually no gains were realized from auto-parallelization.
I also tried AIX 5.3. Since it has support for simultaneous multi-threading, I could use up to 16 threads on a single 8-processor node. Alas, the times were pretty much exactly the same as on the AIX 5.2 nodes.
ACT IV: Take Your Own Advice
Finally, I took the advice that I give in all of my classes. Start by profiling your code. See what kind of optimizations are possible. I changed up my compilations flags a bit, adding -pg -g to turn on profiling with symbol tables and adding -qreport -qsource -qlist to get full compilation reports for the code:
LDFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qsmp=auto -qreport \
-qsource -qlist -pg -g -qfullpath
FFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 -qsmp=auto \
-qreport -qsource -qlist -pg -g -qfullpath
FFLAGS_1 = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 \
-qsuffix=f=f90 -qsmp=auto -qreport -qsource -qlist -pg \
-g -qfullpath
When the run of this compilation was complete, I had my gmon.out tracefile which I readily processed through grpof using:
% gprof > gprof_pre_opt.out
Wading through the nearly 2000 lines of output, I found (somewhere near the bottom of the file) the following report:
Time: 3891.28 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 90.9 3536.56 3536.56 1001 3533.03 3533.03 .momt_s [3] 5.2 3738.03 201.47 1001 201.27 201.27 .mass_s [4] 1.1 3782.05 44.02 1001 43.98 43.98 .change [5] 0.7 3810.05 28.00 1001 27.97 27.97 .minmax [6] 0.4 3825.69 15.64 144108020 0.00 0.00 ._log [11] 0.3 3836.97 11.28 172929624 0.00 0.00 .scalb [12] 0.3 3847.70 10.72 .__mcount [13] 0.2 3856.30 8.60 86464812 0.00 0.00 ._atan2 [10] 0.1 3860.81 4.51 ._power_logb [15] 0.1 3865.25 4.44 28821604 0.00 0.00 .udcal [14] 0.1 3868.66 3.41 1 3410.00 25943.51 .deform_smylie [8] 0.1 3871.34 2.68 1001 2.68 2.68 .open [16] 0.1 3873.37 2.03 .memmove [17] 0.1 3875.37 2.00 ._cosh [18] . . .
So, 90.9% of the execution time was being spent in the routine momt_s, which calculates momentum in spherical coordinates. I next looked at the .lst file created during compilation. Since this file had over 98,000 lines in it, I searched for momt_s and found:
>>>>> SOURCE SECTION <<<<<
686
687
688
!-----------------------------------------------------------------
689
subroutine momt_s (l,layid)
690
! ....Solve momentum equation (linear) in spherical coord.
691
! layid = 1, outest layer
692
! otherwise, inner layer
693
!-----------------------------------------------------------------
694
use layer_params
695
type (layer) :: l
696
integer :: layid
697
! real z(ix,jy,2),m(ix,jy,2),n(ix,jy,2),h(ix,jy)
698
! real r2(ix,jy),r3(ix,jy),r4(ix,jy),r5(ix,jy)
699
data eps/1.0e-6/, zero/0.0/, twlvth/0.08333333333333/
700
!
701
ixm1 = l%ix-1
702
jym1 = l%jy-1
703
is = 2
704
js = 2
705
if (layid .eq. 1) then
706
is = 1
707
js = 1
708
end if
709
do i=is,ixm1
710
ip1 = i+1
711
do j=js,l%jy
712
if ((l%h(i,j).gt.zero) .and. (l%h(ip1,j).gt.zero)) then
713
if (j .le. js) then
714
jm1 = js
715
else
716
jm1 = j-1
717
endif
718
if (j .ge. l%jy) then
719
jp1 = l%jy
720
else
721
jp1 = j+1
722
endif
723
tot_n = l%n(i,j,1)+l%n(ip1,j,1)+l%n(i,jm1,1)+ &
l%n(ip1,jm1,1)
724
xm = l%m(i,j,1)-l%r2(i,j)*(l%z(ip1,j,2)-l%z(i,j,2))+ &
l%r3(i,j)*tot_n-l%r2(i,j)*twlvth*((l%z(ip1,jp1,2)- &
2*l%z(ip1,j,2)+l%z(ip1,jm1,2))-(l%z(i,jp1,2)-2* &
l%z(i,j,2)+l%z(i,jm1,2)))
725
if (abs(xm) .lt. eps) xm = zero
726
l%m(i,j,2) = xm
727
else
728
l%m(i,j,2) = 0.0
729
end if
730
end do
731
end do
732
!
733
do j=js,jym1
734
jp1 = j+1
735
do i=is,l%ix
736
if ((l%h(i,j).gt.zero) .and. (l%h(i,jp1).gt.zero)) then
737
if (i .le. is) then
738
im1 = is
739
else
740
im1 = i-1
741
endif
742
if (i .ge. l%ix) then
743
ip1 = l%ix
744
else
745
ip1 = i+1
746
endif
747
tot_m = l%m(im1,j,1)+l%m(im1,jp1,1)+l%m(i,j,1)+ &
l%m(i,jp1,1)
748
xn = l%n(i,j,1)-l%r4(i,j)*(l%z(i,jp1,2)-l%z(i,j,2))- &
l%r5(i,j)*tot_m-l%r5(i,j)*twlvth*((l%z(ip1,jp1,2)- &
2*l%z(i,jp1,2)+l%z(im1,jp1,2))-(l%z(ip1,j,2)-2* &
l%z(i,j,2)+l%z(im1,j,2)))
749
if (abs(xn) .lt. eps) xn = zero
750
l%n(i,j,2) = xn
751
else
752
l%n(i,j,2) = 0.0
753
end if
754
end do
755
end do
756
!
757
return
758
end
** momt_s === End of Compilation 12 ===
Source Source Loop Id Action / Information
File Line
-------- -------- ------- ----------------------------------------------
0 709 1 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
0 711 2 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
0 733 3 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
0 735 4 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
So 90.9% of my run time is spent in a routine that the compiler will not automatically parallelize for me. What to do...
...don't miss the thrilling conclusion in the next newsletter:
| ACT V | Directives Save The Day |
| ACT VI | Conclusions of the Super-Linear Kind |
Java for Fortran Programmers: Part I
[[ Thanks to Lee Higbie of ARSC for this tutorial. ]]
This is the first in a series of articles, presented as a tutorial, for scientists and engineers. Some knowledge of C is useful, but I will not assume that you know C++ or any other object oriented language.
My planned tutorial outline is:- Java's Uses for the Scientific and Engineering Community
- Object Oriented Programming (OOP)
- How the OOP mindset differs from that usual for Fortran programmers
- How the OOP syntax differs from that of Fortran and C
- Interfacing Java and Fortran programs
- Creating Java programs
- Example
How far and deep I go will depend on feedback. If this topic interests you, let me or one of the editors know!
This initial part of the tutorial is expected to interest new scientific and engineering programmers or programming managers, those considering a new project and wondering if Java might be a good choice. After this initial background, the material will become more technical and should interest programmers who are starting to learn Java or have picked up a little in the past.
Java's Uses for the Scientific and Engineering Community
Java is easy to use, but it has a steep learning curve if you've never used an object oriented programming language. OOPs require a different mindset from that for imperative languages (like Fortran and C). Unlike C++, where it is easy to write a conventional (imperative) program by only using C, Java is more aggressively object oriented--even HelloWorld uses an object.
In our world Java is especially suited for GUIs and support programs and I doubt I'll see it used for a major, computation-intensive application. Though unsuited for heavy computational work, Java is a well designed OO language with many good features. Some are:
- It includes an automatic documentation system. Stylized comments can be used to describe parts of a code and the documentation is automatically generated from them and the code.
- The are several large libraries of GUI widgets that allow control programs to interact visually with users.
- It is highly portable. With minimal care applications can be written that will run on most platforms.
- It was designed from the beginning for applets, programs that run in web browsers. An applet allows the user to safely run a program from a workstation.
- It has a built-in structure for creating and vetting exceptional conditions. A method can create an exception and force its users to deal with the exception.
- It has built in functionality and syntax to eliminate many of the problems that crop up in C++ programs (memory leaks, wandering pointers, weak typing, implicit type conversions, ...)
(I've measured a 4:1 slowdown when doing simple array computations in Java instead of Fortran. Carbon-based systems[*] react slowly so it works well for interacting with them.)
Object Oriented Programming (OOP)
So what is an object oriented programming language? The four defining characteristics of OOPs are:
- Encapsulation A single block of code, called a class, defines a data structure and the procedures for operating on it, called methods. Classes often include methods and variables that are hidden from users, which facilitates changing algorithms or code without users of the class knowing about it.
- Inheritance A class can include the data and variables of a parent class. This is especially useful for libraries and is an important concept to understand. For example, PopupMenu extends Menu extends MenuItem extends MenuComponent extends Object. This means that the methods for adding items to a PopupMenu are not recoded but are taken exactly from Menu and the event handling methods of MenuItem are directly inherited by any PopupMenu, and so on.
-
Polymorphism
. Methods (functions) can be called with a variety of arguments. The number and type of arguments is not constrained. In object-oriented languages it is common for
- A method to set some default parameters then call the general version of the method
- For inherited methods (methods from a class being extended) to provide variants that accept different arguments
- For a method to convert the argument types and call the general version of the method
- The basic unit of code is a class, which encapsulates a data structure and the methods for working with it. For example, the String class includes almost two score methods of its own, it inherits another half dozen from Object, the ultimate parent of all classes, and has polymorphic variants on many of these methods. The emphasis of an OOP is on the classes and their data, not on the flow of logic or control.
There is one more bit of basic OOP terminology that is needed to discuss OOP programs. As mentioned, a class is the code that describes a data structure and includes the methods (functions) for operating on it. The actual data structure is called an object, but don't confuse this with the class Object (upper case oh), which is the ultimate parent of all Java classes. Just as you might have dozens of strings in a program, in an OOP you may have dozens of string objects, each of which is an instance of the String class for a Java application.
So, how does Java measure up? It has all of these characteristics but also has basic, non-object data. Logical, various types of integers, floating point and character data are available facilitating basic imperative programming. In Part II, I will provide an example to illustrate the basic parts of code.
This article has described some of the places where scientific and engineering programmers might apply Java in their work. I have introduced the top level of OOP terminology. I'll recap with a dictionary translating the Fortran terminology used here to Java.
| Fortran term | Java term | Explanation |
|---|---|---|
| function | method | parameters passed by value, polymorphism rampant |
| structure declaration | class | class includes code, usually one to a file |
| structured variable | object (small oh) | object also owns all its class's methods |
| subroutine | method with void type | (no returned value) |
| type | conversion cast | syntax--use type in parens: x = (real) i; |
This article has covered the first two tutorial topics. We'll pick it up again with:
- How the OOP mindset differs from that usual for Fortran programmers
--
[*] Footnote: "Carbon-based systems": a euphemism for people. Those unfamiliar with this term are referred to Star Trek, where, I think, the Borg referred to the astronauts as a carbon-based infestation.Quick-Tip Q & A
A:[[ I am writing a script which looks at the extension of a file.
[[ So far I'm not too committed to a particular scripting language.
[[ Is there an easy way to get the extension of a file without
[[ using sed!
#
# Lorin Hochstein
#
In tcsh, the ":e" variable modifier will extract the extension of a
file. Also useful: the ":r" modifier will extract the name without the
extension.
$ set x="filename.txt"
$ echo $x:e
txt
$ echo $x:r
filename
#
# Harper Simmons
#
using csh/tcsh (I know, I know, uncool)
set a = roo.dat
set ext = $a:e
echo $ext
produces "dat"
#
# Ryan Czerwiec
#
For csh/tcsh this will work (there will be a similar answer for
sh/bash/ksh):
If your filename is stored in the variable "file,"
then the extension "ext" can be obtained with:
set ext = `echo $file
tr "." " "`
This will create an array where the extension is the last element,
or ext[$#ext]. This can also be useful if you need to reassemble the
filename with a different extension, for example.
This version uses less memory (it doesn't create an array), but it's
a little slower:
set ext = `echo $file
tr "." "\n"
tail -1`
You can do it a little more simply if you happen to know that all of
your filenames will have the same number of "." characters in them:
set ext = `echo $file
cut -d'.' -f2`
where the example of -f2 is for a file with one "." character. Use a
number one higher than the number of dots as long as that number is
fixed (you can use a variable for it, too, as in -f$num).
#
# One Editor:
#
You can use "expr" regular expressions. E.g.:
$ expr this.is.a.test : ".*\.\(.*\)"
test
#
# Other Editor:
#
I would use one of the bash pattern matching operators to do this.
${val##pattern}
This operator does the following: If pattern matches the beginning
of the variable $val it deletes the longest part that matches then
returns the rest of the string.
So the following pattern will return the extension as long as there
is a least one dot in the filename.
${val##*.}
If the filename might not have a dot in it, we can check for that
using grep:
for f in *; do
if [ ! -z "$(echo $f
grep "\." )" ]; then
echo ${f##*.};
fi
done
Alternately, we can eliminate the grep by ensuring there is a dot
in the filename. E.g.:
for f in *.*; do
echo ${f##*.};
done
Q: Here's a conditional statement grabbed from the (/bin/sh)
configure script for mysql. There are many like this:
if test X"$mysql_cv_compress" != Xyes; then
# ...do stuff...
fi
For my scripts, the following style has always worked:
if [[ $mysql_cv_compress != yes ]]; then
# ...do stuff...
fi
So, two questions:
1) Why would the experts use "test" rather than the square bracket
syntax?
2) Why bother with that "X" ???
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
