| Newsletter Index | Quick-Tip Index | Search Newsletters |
Once again, ARSC users have a tremendous variety and depth of training opportunities. Here are the topics this fall:
ARSC user training is offered in conjunction with "Core Skills For Computational Science," taught jointly by the UAF Physics Department and ARSC.
This *IS* ARSC's Fall User Training. You are encouraged to drop in on any lecture of interest. Here's the complete training schedule:
http://people.arsc.edu/~cskills/schedule.shtmlAnd here's the primary training web site:
http://people.arsc.edu/~cskills/
Contact Tom Logan (logan AT arsc.edu) with questions.
[ Many thanks to Tom Logan of ARSC for this two-part thriller! ]
ACT I Analysis Shows I'm Doomed ACT II Depression Sets In ACT III Relying on Moore ACT IV Take Your Own Advice ACT V Directives Save The Day ACT VI Conclusions of the Super-Linear Kind
ACT I: Analysis Shows I'm Doomed
I was recently faced with a problem that often comes up in the scientific computing realm: the tsunami model that I was working with was too slow. I needed to run a job for 24780 iterations (time steps). Not realizing this was about 10 times longer than any of the test runs previously made, I started the job up on one p655 node on Iceberg (the code is serial) and waited for the results.
What I got was very discouraging. In the eight hours allowed in the standard queue on Iceberg, the code only completed 4050 iterations. This worked out to about 8.4 iterations per minute. A quick calculation showed me that I was most certainly doomed, since the full run would take roughly 49 hours to complete and, while the "single" queues at ARSC would permit such a long run, it would be impractical for the desired test and production work.
ACT II: Depression Sets In
Since I wanted to get these runs done in a timely fashion, I ruled out any significant code changes. For instance, trying to modify the code to write a restart file would be too time consuming. Writing a parallel version of the code using MPI would be a serious time sink, not to mention that these types of codes (many many iterations on relatively small grids) are not the best candidates for message passing algorithms.
I thus turned to the compiler to help with my dilemma. At this point, my compiler flags looked like this:
LDFLAGS = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 FFLAGS = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 -qmaxmem=-1 FFLAGS_1 = -O5 -qarch=pwr4 -qtune=pwr4 -qstrict -q64 -qmaxmem=-1 -qsuffix=f=f90
Where the FFLAGS are for the .f files while the FFLAGS_1 are for the .f90 files. I already had the code tuned for the architecture and at the highest level of optimization provided.
This left my next alternative as trying to use the IBM compiler's built-in auto-parallelization. Not having had much luck with this in the past, I was not optimistic. Sure enough, simply adding the -qsmp=auto switch to my compiler flags and setting the environment variable OMP_NUM_THREADS=8 in my loadleveler script bought me nothing. I was still getting roughly 8.3 iterations per minute.
To facilitate testing, I reduced the run to only 1000 iterations, or approximately 2 hours of run time.
ACT III: Relying on Moore
Next I had what I thought was a brilliant idea! We've got these new power5 nodes on Iceflyer. Maybe that'll work - make Moore's law work for me by using a bigger/better/faster machine! So, that's what I did. I moved the code to iceflyer and compiled it with some slightly modified flags:
LDFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 FFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 FFLAGS_1 = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 -qsuffix=f=f90
Well, I got what Moore said I should expect. A bit less than a 2 times speedup, bringing the time down to 66 minutes for 1,000 iterations. Taking advantage of the 16 hour queue time in the p5 queue, I re-ran the entire simulation only to have the job time out after completing 15000 iterations. Close, but no cigar.
What followed was many failed attempts at slight variations. I tried auto parallelization using 4 or 8 threads. Iterations per minute improved from 15.15 (serial) to 15.38 (4 threads) to 15.63 (8 threads). Once again, virtually no gains were realized from auto-parallelization.
I also tried AIX 5.3. Since it has support for simultaneous multi-threading, I could use up to 16 threads on a single 8-processor node. Alas, the times were pretty much exactly the same as on the AIX 5.2 nodes.
ACT IV: Take Your Own Advice
Finally, I took the advice that I give in all of my classes. Start by profiling your code. See what kind of optimizations are possible. I changed up my compilations flags a bit, adding -pg -g to turn on profiling with symbol tables and adding -qreport -qsource -qlist to get full compilation reports for the code:
LDFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qsmp=auto -qreport \
-qsource -qlist -pg -g -qfullpath
FFLAGS = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 -qsmp=auto \
-qreport -qsource -qlist -pg -g -qfullpath
FFLAGS_1 = -O5 -qarch=pwr5 -qtune=pwr5 -qstrict -q64 -qmaxmem=-1 \
-qsuffix=f=f90 -qsmp=auto -qreport -qsource -qlist -pg \
-g -qfullpath
When the run of this compilation was complete, I had my gmon.out tracefile which I readily processed through grpof using:
% gprof > gprof_pre_opt.out
Wading through the nearly 2000 lines of output, I found (somewhere near the bottom of the file) the following report:
Time: 3891.28 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 90.9 3536.56 3536.56 1001 3533.03 3533.03 .momt_s [3] 5.2 3738.03 201.47 1001 201.27 201.27 .mass_s [4] 1.1 3782.05 44.02 1001 43.98 43.98 .change [5] 0.7 3810.05 28.00 1001 27.97 27.97 .minmax [6] 0.4 3825.69 15.64 144108020 0.00 0.00 ._log [11] 0.3 3836.97 11.28 172929624 0.00 0.00 .scalb [12] 0.3 3847.70 10.72 .__mcount [13] 0.2 3856.30 8.60 86464812 0.00 0.00 ._atan2 [10] 0.1 3860.81 4.51 ._power_logb [15] 0.1 3865.25 4.44 28821604 0.00 0.00 .udcal [14] 0.1 3868.66 3.41 1 3410.00 25943.51 .deform_smylie [8] 0.1 3871.34 2.68 1001 2.68 2.68 .open [16] 0.1 3873.37 2.03 .memmove [17] 0.1 3875.37 2.00 ._cosh [18] . . .
So, 90.9% of the execution time was being spent in the routine momt_s, which calculates momentum in spherical coordinates. I next looked at the .lst file created during compilation. Since this file had over 98,000 lines in it, I searched for momt_s and found:
>>>>> SOURCE SECTION <<<<<
686 |
687 |
688 |!-----------------------------------------------------------------
689 | subroutine momt_s (l,layid)
690 |! ....Solve momentum equation (linear) in spherical coord.
691 |! layid = 1, outest layer
692 |! otherwise, inner layer
693 |!-----------------------------------------------------------------
694 | use layer_params
695 | type (layer) :: l
696 | integer :: layid
697 |! real z(ix,jy,2),m(ix,jy,2),n(ix,jy,2),h(ix,jy)
698 |! real r2(ix,jy),r3(ix,jy),r4(ix,jy),r5(ix,jy)
699 | data eps/1.0e-6/, zero/0.0/, twlvth/0.08333333333333/
700 |!
701 | ixm1 = l%ix-1
702 | jym1 = l%jy-1
703 | is = 2
704 | js = 2
705 | if (layid .eq. 1) then
706 | is = 1
707 | js = 1
708 | end if
709 | do i=is,ixm1
710 | ip1 = i+1
711 | do j=js,l%jy
712 | if ((l%h(i,j).gt.zero) .and. (l%h(ip1,j).gt.zero)) then
713 | if (j .le. js) then
714 | jm1 = js
715 | else
716 | jm1 = j-1
717 | endif
718 | if (j .ge. l%jy) then
719 | jp1 = l%jy
720 | else
721 | jp1 = j+1
722 | endif
723 | tot_n = l%n(i,j,1)+l%n(ip1,j,1)+l%n(i,jm1,1)+ &
l%n(ip1,jm1,1)
724 | xm = l%m(i,j,1)-l%r2(i,j)*(l%z(ip1,j,2)-l%z(i,j,2))+ &
l%r3(i,j)*tot_n-l%r2(i,j)*twlvth*((l%z(ip1,jp1,2)- &
2*l%z(ip1,j,2)+l%z(ip1,jm1,2))-(l%z(i,jp1,2)-2* &
l%z(i,j,2)+l%z(i,jm1,2)))
725 | if (abs(xm) .lt. eps) xm = zero
726 | l%m(i,j,2) = xm
727 | else
728 | l%m(i,j,2) = 0.0
729 | end if
730 | end do
731 | end do
732 |!
733 | do j=js,jym1
734 | jp1 = j+1
735 | do i=is,l%ix
736 | if ((l%h(i,j).gt.zero) .and. (l%h(i,jp1).gt.zero)) then
737 | if (i .le. is) then
738 | im1 = is
739 | else
740 | im1 = i-1
741 | endif
742 | if (i .ge. l%ix) then
743 | ip1 = l%ix
744 | else
745 | ip1 = i+1
746 | endif
747 | tot_m = l%m(im1,j,1)+l%m(im1,jp1,1)+l%m(i,j,1)+ &
l%m(i,jp1,1)
748 | xn = l%n(i,j,1)-l%r4(i,j)*(l%z(i,jp1,2)-l%z(i,j,2))- &
l%r5(i,j)*tot_m-l%r5(i,j)*twlvth*((l%z(ip1,jp1,2)- &
2*l%z(i,jp1,2)+l%z(im1,jp1,2))-(l%z(ip1,j,2)-2* &
l%z(i,j,2)+l%z(im1,j,2)))
749 | if (abs(xn) .lt. eps) xn = zero
750 | l%n(i,j,2) = xn
751 | else
752 | l%n(i,j,2) = 0.0
753 | end if
754 | end do
755 | end do
756 |!
757 | return
758 | end
** momt_s === End of Compilation 12 ===
Source Source Loop Id Action / Information
File Line
-------- -------- ------- ----------------------------------------------
0 709 1 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
0 711 2 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
0 733 3 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
0 735 4 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
So 90.9% of my run time is spent in a routine that the compiler will not automatically parallelize for me. What to do...
...don't miss the thrilling conclusion in the next newsletter:
ACT V Directives Save The Day ACT VI Conclusions of the Super-Linear Kind
[[ Thanks to Lee Higbie of ARSC for this tutorial. ]]
This is the first in a series of articles, presented as a tutorial, for scientists and engineers. Some knowledge of C is useful, but I will not assume that you know C++ or any other object oriented language.
My planned tutorial outline is:How far and deep I go will depend on feedback. If this topic interests you, let me or one of the editors know!
This initial part of the tutorial is expected to interest new scientific and engineering programmers or programming managers, those considering a new project and wondering if Java might be a good choice. After this initial background, the material will become more technical and should interest programmers who are starting to learn Java or have picked up a little in the past.
Java's Uses for the Scientific and Engineering Community
Java is easy to use, but it has a steep learning curve if you've never used an object oriented programming language. OOPs require a different mindset from that for imperative languages (like Fortran and C). Unlike C++, where it is easy to write a conventional (imperative) program by only using C, Java is more aggressively object oriented--even HelloWorld uses an object.
In our world Java is especially suited for GUIs and support programs and I doubt I'll see it used for a major, computation-intensive application. Though unsuited for heavy computational work, Java is a well designed OO language with many good features. Some are:
(I've measured a 4:1 slowdown when doing simple array computations in Java instead of Fortran. Carbon-based systems[*] react slowly so it works well for interacting with them.)
Object Oriented Programming (OOP)
So what is an object oriented programming language? The four defining characteristics of OOPs are:
There is one more bit of basic OOP terminology that is needed to discuss OOP programs. As mentioned, a class is the code that describes a data structure and includes the methods (functions) for operating on it. The actual data structure is called an object, but don't confuse this with the class Object (upper case oh), which is the ultimate parent of all Java classes. Just as you might have dozens of strings in a program, in an OOP you may have dozens of string objects, each of which is an instance of the String class for a Java application.
So, how does Java measure up? It has all of these characteristics but also has basic, non-object data. Logical, various types of integers, floating point and character data are available facilitating basic imperative programming. In Part II, I will provide an example to illustrate the basic parts of code.
This article has described some of the places where scientific and engineering programmers might apply Java in their work. I have introduced the top level of OOP terminology. I'll recap with a dictionary translating the Fortran terminology used here to Java.
This article has covered the first two tutorial topics. We'll pick it up again with:
Fortran term Java term Explanation function method parameters passed by value, polymorphism rampant structure declaration class class includes code, usually one to a file structured variable object (small oh) object also owns all its class's methods subroutine method with void type (no returned value) type conversion cast syntax--use type in parens: x = (real) i;
--
[*] Footnote: "Carbon-based systems": a euphemism for people. Those unfamiliar with this term are referred to Star Trek, where, I think, the Borg referred to the astronauts as a carbon-based infestation.
A:[[ I am writing a script which looks at the extension of a file.
[[ So far I'm not too committed to a particular scripting language.
[[ Is there an easy way to get the extension of a file without
[[ using sed!
#
# Lorin Hochstein
#
In tcsh, the ":e" variable modifier will extract the extension of a
file. Also useful: the ":r" modifier will extract the name without the
extension.
$ set x="filename.txt"
$ echo $x:e
txt
$ echo $x:r
filename
#
# Harper Simmons
#
using csh/tcsh (I know, I know, uncool)
set a = roo.dat
set ext = $a:e
echo $ext
produces "dat"
#
# Ryan Czerwiec
#
For csh/tcsh this will work (there will be a similar answer for
sh/bash/ksh):
If your filename is stored in the variable "file,"
then the extension "ext" can be obtained with:
set ext = `echo $file | tr "." " "`
This will create an array where the extension is the last element,
or ext[$#ext]. This can also be useful if you need to reassemble the
filename with a different extension, for example.
This version uses less memory (it doesn't create an array), but it's
a little slower:
set ext = `echo $file | tr "." "\n" | tail -1`
You can do it a little more simply if you happen to know that all of
your filenames will have the same number of "." characters in them:
set ext = `echo $file | cut -d'.' -f2`
where the example of -f2 is for a file with one "." character. Use a
number one higher than the number of dots as long as that number is
fixed (you can use a variable for it, too, as in -f$num).
#
# One Editor:
#
You can use "expr" regular expressions. E.g.:
$ expr this.is.a.test : ".*\.\(.*\)"
test
#
# Other Editor:
#
I would use one of the bash pattern matching operators to do this.
${val##pattern}
This operator does the following: If pattern matches the beginning
of the variable $val it deletes the longest part that matches then
returns the rest of the string.
So the following pattern will return the extension as long as there
is a least one dot in the filename.
${val##*.}
If the filename might not have a dot in it, we can check for that
using grep:
for f in *; do
if [ ! -z "$(echo $f | grep "\." )" ]; then
echo ${f##*.};
fi
done
Alternately, we can eliminate the grep by ensuring there is a dot
in the filename. E.g.:
for f in *.*; do
echo ${f##*.};
done
Q: Here's a conditional statement grabbed from the (/bin/sh)
configure script for mysql. There are many like this:
if test X"$mysql_cv_compress" != Xyes; then
# ...do stuff...
fi
For my scripts, the following style has always worked:
if [[ $mysql_cv_compress != yes ]]; then
# ...do stuff...
fi
So, two questions:
1) Why would the experts use "test" rather than the square bracket
syntax?
2) Why bother with that "X" ???
[[ Answers, Questions, and Tips Graciously Accepted ]]
Contact:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Craig Stephenson ARSC User Consultant ph: 907-450-8653 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.E-mail Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-474-6935 | email:
home | search | about | support | news | science | resources