| Newsletter Index | Quick-Tip Index | Search Newsletters |
As you may have noticed, this is the 250th issue of the ARSC Users' Newsletter! We dispensed our first issue nearly 8 years ago, on August 25, 1994. All past issues are on-line, at:
http://www.arsc.edu/support/news/HPCnews.shtmlPast "Quick-Tips" are all indexed at:
http://www.arsc.edu/support/news/qtindex.xml
However... This might be YOUR first issue, since I just updated our mailing list with new ARSC users. (If you'd rather not receive this newsletter, unsubscribing is described at the bottom.)
IBM's massv library is available on icehawk for hand-tuning of vectorizable math intrinsics. XLF (with options, -qhot -O3) does a great job of detecting vectorizable operations and may actually do all the work for you. In either case, the vector intrinsics can really speed things up.
As a quick example, this loop:
do ni = 1,nint
loghaz(ni) = dlog(haz(ni))
enddo
can (and should) be replaced by:
call vlog (loghaz, haz, nint)
The vector intrinsics library is called "massv". ("V" for "vector" and "MASS" for "Mathematical Acceleration SubSystem".) There's also "massvp3," tuned for the Power3, and the basic "mass" library, for scalar operations.
On icehawk, these libraries are available in:
/usr/local/pkg/mass/current/liband linked to:
/usr/local/libA README is in:
/usr/local/pkg/mass/current/wrkHere are excerpts from the README file. The list of vector math intrinsics, performance data, and explanations/caveats:
----------------------------------------------------------------------
Math Library Performance
(cycles per evaluation, length 1000 loop)
604E 630 P2SC
function range libm mass massv libm mass massv vp3 libm mass massv
vrec D 32* 10 9* 6 4 8* 4
vsrec D 18* 8 7* 5 3 8* 4
vdiv D 32* 12 9* 7 5 9* 5
vsdiv D 18* 10 7* 6 3 9* 4
vsqrt C 67 48 16 11* 9 6 13* 7
vssqrt C 70 48 10 7* 8 5 13* 5
vrsqrt C 79 49 16 22* 9 6 22* 7
vsrsqrt C 83 51 9 16* 7 4 22* 5
vexp D 83 45 16 64 33 6 53 21 7
vsexp E 85 44 13 68 36 5 58 21 6
vlog C 99 56 20 83 53 8 67 35 8
vslog C 102 56 17 86 57 7 66 37 7
vsin B 50 29 11 36 16 5 34 17 5
vsin D 79 59 27 60 43 12 50 37 12
vssin B 51 26 8 39 18 4 40 16 4
vssin D 79 58 20 62 46 9 56 38 9
vcos B 51 26 9 37 16 4 34 17 4
vcos D 75 59 27 58 43 12 51 36 11
vscos B 52 26 7 39 18 3 40 16 3
vscos D 76 59 20 61 46 9 56 37 9
vsincos B 100 53 19 80 33 8 80 38 8
vsincos D 151 116 29 123 92 12 111 81 12
vssincos B 107 55 15 79 38 6 78 36 7
vssincos D 159 118 24 125 98 10 110 80 10
vcosisin B 104 55 19 78 34 8 79 37 8
vcosisin D 156 118 29 123 93 12 111 81 12
vscosisin B 108 55 15 79 36 6 78 36 6
vscosisin D 160 119 23 125 95 9 110 79 10
vtan D 136 74 32 111 52 19 90 38 13
vstan D 136 74 32 113 56 19 95 39 12
vatan2 D 545 104 40 413 87 25 555 73 17
vsatan2 D 545 104 40 418 89 25 558 71 17
vdnint D 37 22 7 24 12 3.4 23 13 2.7
vdint D 36 6 22 2.8 21 2.6
massvp2
vidint D 4.0 2.7
vasin B 48 17
vacos B 49 17
vdfloat D 3.0 1.8
vdsign D 9 3.5
* indicates inline instructions timed (not a subroutine call)
Range Key Processor Cycle time Dcache size
A = 0, 1 604E 3.0 nanoseconds 32 kilobytes
B = -1, 1 630 5.0 nanoseconds 64 kilobytes
C = 0,100 P2SC 7.4 nanoseconds 128 kilobytes
D = -100,100
E = -10, 10
"[This] data should be considered approximate. It was obtained by
timing many repetitions of a loop over 1,000 random arguments and
includes all overhead. Timing in this way will bring the input and
output vectors into the on-chip cache (the loop is short enough for
them to fit in cache). Performance may deteriorate seriously when the
input and output vectors are not in cache. Performance may also
deteriorate for arguments at or near the end-points of the valid
argument ranges."
----------------------------------------------------------------------
Example
For a more informative example, here's a routine from a simulation code
under evaluation at ARSC. (A driver for this routine was constructed
from the original code, and used for the timings given below.) The array
length of 2000 is a do-not-exceed size... in the current data set,
"nint" is only 352.
integer ni,nint
double precision beta,cumhaz,haps,haz,prit,rate,
& total
double precision time(2000)
prit = 0d0
total = 0d0
do ni = 1,nint
haps = dble(nint)-dble(ni)+2d0
rate = haps*(haps-1d0)/2d0
haz = rate*dexp(beta*(total+time(ni)))
cumhaz = rate*dexp(beta*total)*
& (dexp(beta*time(ni))-1d0)/beta
prit = prit+dlog(haz)-cumhaz
total = total+time(ni)
enddo
Basically, this loop iterates over the array "time", performing multiple
dexp's and a dlog operation, accumulating and storing intermediate
results to the scalar variables.
The loop can be restructured by hand to replace dexp with vexp. (dexp
is double precision and vexp works on doubles, so this is
appropriate.)
In the rewritten code, temporary arrays store entire sequences of scalar
values, allowing these sequences to be processed all at once by calls to
vexp. Obviously, the temporary arrays consume memory, and if the
default size in this example were 2,000,000 rather than 2,000, we'd have
a problem. On the other hand, I've given them self-documenting names
and not attempted to reuse them, so this could be tighened up.
Here's the hand-optimized version:
integer ni,nint
double precision beta,cumhaz,haps,haz,prit,rate,
& total
double precision time(2000)
double precision :: v_haz(2000)
double precision :: v_cumhaz(2000)
double precision :: v_beta_total_time(2000)
double precision :: v_dexp_beta_total_time(2000)
double precision :: v_total(2000)
double precision :: v_beta_time(2000)
double precision :: v_dexp_beta_time(2000)
double precision :: v_dlog_haz(2000)
double precision :: v_beta_total(2000)
double precision :: v_dexp_beta_total(2000)
prit = 0d0
v_total(1) = 0d0
do ni = 2,nint
v_total(ni) = v_total(ni-1) + time(ni-1)
enddo
do ni = 1,nint
v_beta_total_time(ni) = beta * (v_total(ni) + time(ni))
v_beta_time(ni) = beta * (time(ni))
v_beta_total(ni) = beta * v_total(ni)
enddo
call vexp (v_dexp_beta_total_time, v_beta_total_time, nint)
call vexp (v_dexp_beta_time, v_beta_time, nint)
call vexp (v_dexp_beta_total, v_beta_total, nint)
do ni = 1,nint
haps = dble(nint)-dble(ni)+2d0
rate = haps*(haps-1d0)/2d0
v_haz(ni) = rate*v_dexp_beta_total_time(ni)
v_cumhaz(ni) = rate*v_dexp_beta_total(ni)*
& (v_dexp_beta_time(ni)-1d0)/beta
enddo
call vlog (v_dlog_haz, v_haz, nint)
do ni = 1,nint
prit = prit+v_dlog_haz(ni)-v_cumhaz(ni)
enddo
This is a tedious procedure, but it yielded a speedup of over 3x. Here
are wallclock times (average of four runs) for the two versions of the
routine, as run out of the driver:
Version of Code
=======================
Original Hand
vectorized
xlf options (secs) (secs)
----------- -------- --------
<default> 3.39 0.92
-O3 3.18 0.83
-O3 -qhot 0.93 0.86
It's somewhat reassuring to see that the hand-coded version managed to
beat the best compiler time in all cases. Even better, since it's a lot
of work to do this by hand, is the observation that with the "high order
transformation option", the compiler does almost as well. (Different
codes will, of course, respond differently.)
Combining manual and compiler optimization
Here's an approach to optimizing a code with massv:
- compile with -O3 -qhot,
- verify that program output unchanged or acceptable--reordering
execution can change results,
- profile the code,
- identify math intrinsics (if any) which take significant time,
- find the routine(s) where such intrinsics are called,
- determine if XLF has transformed them into vector intrinsics,
- if not, attempt to hand-optimize.
Steps 3 and 6 require additional tools, as follows.
Profiling codes on the SP
- Recompile with the -pg option:
xlf -pg -O3 -qhot ...
- Run the executable as usual
- This produces the trace file: gmon.out
- Use gprof to view the profile:
gprof executable_name gmon.out
[ Note: this type of profiling adds overhead. Recompile without -pg
before doing any production runs! ]
Here's a snippet of gprof output, from the "call graph profile:"
called/total parents
index %time self descendents called+self name index
called/total children
0.05 0.00 515084/4435146 .endpolpr [17]
0.05 0.00 553902/4435146 .endpolpl [15]
0.26 0.00 2836611/4435146 .endpolp2 [3]
[13] 4.7 0.40 0.00 4435146 ._log [13]
0.00 0.00 12/23 ._Errno [193]
This calling tree tells us that log is called by endpolpr,
endpolpl, endpolp2 and that log takes 4.7% of the code's total run
time. (4.7% may not be worth worrying about.) It also tells us that
log calls Errno.
From this, we know to examine the potential of subroutine endpolp2 for
the replacement of log with vlog. To avoid wasting time, we must
first determine if XLF has already done this, as described next.
Locate vector intrinsics added/missed by the compiler
- Obtain a compiler report by passing XLF the "-qreport" option:
xlf -qreport -O3 -qhot ...
- view the ".lst" file produced by the compiler.
Here's a snippet from a report. It ain't perty, but you can find
occurances of both _exp and CALL __vexp" in this transformed code.
Remember, in the source code, vexp" didn't exist... the compiler has
rearranged the loops to replace exp with vexp.
@CSE23 = dr(1)
@CSE24 = _exp(-(rrate * %VAL(@CSE23)))
prob[].off0 = p1 * ( 1.0000000000000000E+000 - @CSE24)
@CSE25 = _exp(-(rrate2 * %VAL(@CSE23)))
prob2[].off0 = p1 * ( 1.0000000000000000E+000 - @CSE25)
temp2 = @CSE24
temp4 = @CSE25
@MARKSTK0 = __getstack()
GOTO lab_83
2913| lab_83
2886| IF ((@ICM6 > 0)) THEN
2893| @NumElements0 = int(int(@ICM6))
CALL __vexp((@addr.split6 + () + (8)*(0)),(@addr.split6 + ()&
& + (8)*(0)),@NumElements0)
2894| @NumElements1 = int(int(@ICM6))
CALL __vexp((@addr.split7 + () + (8)*(0)),(@addr.split7 + ()&
& + (8)*(0)),@NumElements1)
2886| @CIV4 = 0
Id=11 DO @CIV4 = @CIV4, int(@ICM6)-1
2898| @CSE31 = @addr.split6%@split6(@CIV4)
temp2 = temp2 * @CSE31
The compiler report also provides some annotation in english, as shown
in the following snippet:
Source Source Loop Id Action / Information
File Line
----- ------- ------- -----------------------------------------
1 2893 Vectorization applied to statement.
1 2894 Vectorization applied to statement.
1 2886 11 The loop on line 2886 was created by the
distribution of the loop on line 2886.
Given this information, the programmer's goal is to find occurances of
_exp within the (transformed) loops. If found, we know that XLF was
unable to "vectorize" those loops, and thus, the corresponding loops in
the original source might possibly be hand-optimized. Loops are
identified by the source code line numbers given in column 3 of the
tranformed code in the report.
In this example, and, in fact, for the complete application code, XLF
"vectorized" every occurance and left nothing to do by hand.
BLUI in SIGGRAPH Studio
ARSC/UAF's "Body Language User Interface", or BLUI, project will be
featured in the "Studio" at SIGGRAPH, next week.
Here's the blurb from the Studio web page (under the new category, "VR,"
at the bottom):
http://www.siggraph.org/s2002/conference/studio/index.html
VR
"New for SIGGRAPH 2002, this area features a system for immersive
display configured for 3D solid modeling. Bill Brody of the University
of Alaska at Fairbanks demonstrates his "BLUIsculpt" system, in which
fully 3D objects can be created and output as .stl files for rapid
prototyping."
DOE Benchmarking
Interesting work DOE's benchmarking of early systems:
http://www.csm.ornl.gov/evaluation/index.html
Evaluation of Early Systems
Computational requirements for many large-scale simulations and ensemble
studies of vital interest to the Department of Energy (DOE) exceed what
is currently offered by any U.S. computer vendor. Examples are numerous,
ranging from global change research to combustion to informatics. It is
incumbent on DOE to be aware of the performance of new or beta systems
from high performance computing vendors that will determine the
performance of future production-class offerings. It is equally
important that DOE work with vendors in finding solutions that will
fulfill DOE's computational requirements.
In support of this mission, Oak Ridge National Laboratory (ORNL) is
currently performing in-depth evaluations of a number of high
performance computer systems,
Fortran Information
Learn about all things Fortran from Michael Metcalf's Fortran 90/95/HPF
Information File, at:
http://www.fortran.com/metcalf.htm
Document headings:
A:[[ June is mosquito month in Fairbanks, and 2002 has been impressive by [[ all accounts. Send us your favorite (short) mosquito story, remedy, [[ or advice. Any luck with mosquito traps, DEET-free dope, or personal [[ concoctions?
# The Gadget Award goes to Kate Hedstrom:
| We got a "SonicWeb" trap, shown here. It has a heartbeat sound that is loud enough to hear. It also uses heat and Octenol to attract critters. Ours has trapped all sorts of insects, mostly flies, wasps and itty-bitty things. We also caught a dragonfly and a few mosquitos. | ![]() |
|
# From Brad Chamberlain My favorite mosquito story was when I was camping in college with a Buddhist friend of mine. Angrily slapping mosquitos left and right, she implored me, "Don't kill them, just brush them away. They just want a drop of your blood, and you want to take their life." I pondered this a few days and then asked her if she was reincarnated as a mosquito, whether she might appreciate being sent on to her next life all the sooner. She had to agree that that sounded attractive... :) But in spite of this funny exchange, she truly believed that if you brush mosquitos away rather than swat at them, then they will leave you alone. And ever since then, I have brushed mosquitos away, and don't remember the last time I was plagued with as many bites as I was when I was a kid. Interesting fact that I did not know: According to Webster on-line, the plural of mosquito is either -os or -oes. Lucky Dan Quayle. # From one of the editors:The EPA says DEET products are safe ("when used as directed"), but a quick search on "DEET" && "Gulf War Syndrome" may give you pause. I do my best to avoid it. Head nets and long sleeves are the best. On the other hand, I'd *never* go hiking, fishing, etc., without some stron bug dope in my pack. Swarming mosquitos can drive a person crazy, and make you do things more dangerous than just wearing DEET. This summer, I've used DEET just to work in the yard, wanting to avoid smacking myself in the head with a shovel in an attempt to swat some bug.
# Tom Logan deserves some award for this... You probably know the mosquitos in Alaska are big, but the other night I overheard this from two that we're buzzing around my bed: Mosquito 1: "I'm tired of eating out, lets pick him up and take him back to the swamp" Mosquito 2: "NO! When we get back, the big ones will take him away from us!" Q: I received a "not enough memory" error when trying to compile a large subroutine (part of a big code) on the T3E with -O3,unroll2 . Is there any way to increase the memory allocation to f90 or am I stuck compiling that subroutine with -O2 ? Thanks!
[[ Answers, Questions, and Tips Graciously Accepted ]]
Contact:
Donald Bahls ARSC User Consultant ph: 907-450-8674 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.E-mail Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources