ARSC HPC Users' Newsletter 250, July 22, 2002
Bicenquinquagennial Issue
As you may have noticed, this is the 250th issue of the ARSC Users' Newsletter! We dispensed our first issue nearly 8 years ago, on August 25, 1994. All past issues are on-line, at:
http://www.arsc.edu/support/news/HPCnews.shtml
Past "Quick-Tips" are all indexed at:http://www.arsc.edu/support/news/qtindex.xml
However... This might be YOUR first issue, since I just updated our mailing list with new ARSC users. (If you'd rather not receive this newsletter, unsubscribing is described at the bottom.)
Optimizing with IBM Vector Intrinsics and "xlf -qhot"
IBM's massv library is available on icehawk for hand-tuning of vectorizable math intrinsics. XLF (with options, -qhot -O3 ) does a great job of detecting vectorizable operations and may actually do all the work for you. In either case, the vector intrinsics can really speed things up.
As a quick example, this loop:
do ni = 1,nint
loghaz(ni) = dlog(haz(ni))
enddo
can (and should) be replaced by:
call vlog (loghaz, haz, nint)
The vector intrinsics library is called "massv". ("V" for "vector" and "MASS" for "Mathematical Acceleration SubSystem".) There's also "massvp3," tuned for the Power3, and the basic "mass" library, for scalar operations.
On icehawk, these libraries are available in:
/usr/local/pkg/mass/current/lib
and linked to:/usr/local/lib
A README is in:/usr/local/pkg/mass/current/wrk
Here are excerpts from the README file. The list of vector math intrinsics, performance data, and explanations/caveats:
----------------------------------------------------------------------
Math Library Performance
(cycles per evaluation, length 1000 loop)
604E 630 P2SC
function range libm mass massv libm mass massv vp3 libm mass massv
vrec D 32* 10 9* 6 4 8* 4
vsrec D 18* 8 7* 5 3 8* 4
vdiv D 32* 12 9* 7 5 9* 5
vsdiv D 18* 10 7* 6 3 9* 4
vsqrt C 67 48 16 11* 9 6 13* 7
vssqrt C 70 48 10 7* 8 5 13* 5
vrsqrt C 79 49 16 22* 9 6 22* 7
vsrsqrt C 83 51 9 16* 7 4 22* 5
vexp D 83 45 16 64 33 6 53 21 7
vsexp E 85 44 13 68 36 5 58 21 6
vlog C 99 56 20 83 53 8 67 35 8
vslog C 102 56 17 86 57 7 66 37 7
vsin B 50 29 11 36 16 5 34 17 5
vsin D 79 59 27 60 43 12 50 37 12
vssin B 51 26 8 39 18 4 40 16 4
vssin D 79 58 20 62 46 9 56 38 9
vcos B 51 26 9 37 16 4 34 17 4
vcos D 75 59 27 58 43 12 51 36 11
vscos B 52 26 7 39 18 3 40 16 3
vscos D 76 59 20 61 46 9 56 37 9
vsincos B 100 53 19 80 33 8 80 38 8
vsincos D 151 116 29 123 92 12 111 81 12
vssincos B 107 55 15 79 38 6 78 36 7
vssincos D 159 118 24 125 98 10 110 80 10
vcosisin B 104 55 19 78 34 8 79 37 8
vcosisin D 156 118 29 123 93 12 111 81 12
vscosisin B 108 55 15 79 36 6 78 36 6
vscosisin D 160 119 23 125 95 9 110 79 10
vtan D 136 74 32 111 52 19 90 38 13
vstan D 136 74 32 113 56 19 95 39 12
vatan2 D 545 104 40 413 87 25 555 73 17
vsatan2 D 545 104 40 418 89 25 558 71 17
vdnint D 37 22 7 24 12 3.4 23 13 2.7
vdint D 36 6 22 2.8 21 2.6
massvp2
vidint D 4.0 2.7
vasin B 48 17
vacos B 49 17
vdfloat D 3.0 1.8
vdsign D 9 3.5
* indicates inline instructions timed (not a subroutine call)
Range Key Processor Cycle time Dcache size
A = 0, 1 604E 3.0 nanoseconds 32 kilobytes
B = -1, 1 630 5.0 nanoseconds 64 kilobytes
C = 0,100 P2SC 7.4 nanoseconds 128 kilobytes
D = -100,100
E = -10, 10
"[This] data should be considered approximate. It was obtained by
timing many repetitions of a loop over 1,000 random arguments and
includes all overhead. Timing in this way will bring the input and
output vectors into the on-chip cache (the loop is short enough for
them to fit in cache). Performance may deteriorate seriously when the
input and output vectors are not in cache. Performance may also
deteriorate for arguments at or near the end-points of the valid
argument ranges."
----------------------------------------------------------------------
Example
For a more informative example, here's a routine from a simulation code
under evaluation at ARSC. (A driver for this routine was constructed
from the original code, and used for the timings given below.) The array
length of 2000 is a do-not-exceed size... in the current data set,
"nint" is only 352.
integer ni,nint
double precision beta,cumhaz,haps,haz,prit,rate,
& total
double precision time(2000)
prit = 0d0
total = 0d0
do ni = 1,nint
haps = dble(nint)-dble(ni)+2d0
rate = haps*(haps-1d0)/2d0
haz = rate*dexp(beta*(total+time(ni)))
cumhaz = rate*dexp(beta*total)*
& (dexp(beta*time(ni))-1d0)/beta
prit = prit+dlog(haz)-cumhaz
total = total+time(ni)
enddo
Basically, this loop iterates over the array "time", performing multiple dexp 's and a dlog operation, accumulating and storing intermediate results to the scalar variables.
The loop can be restructured by hand to replace dexp with vexp . ( dexp is double precision and vexp works on doubles, so this is appropriate.)
In the rewritten code, temporary arrays store entire sequences of scalar values, allowing these sequences to be processed all at once by calls to vexp . Obviously, the temporary arrays consume memory, and if the default size in this example were 2,000,000 rather than 2,000, we'd have a problem. On the other hand, I've given them self-documenting names and not attempted to reuse them, so this could be tighened up.
Here's the hand-optimized version:
integer ni,nint
double precision beta,cumhaz,haps,haz,prit,rate,
& total
double precision time(2000)
double precision :: v_haz(2000)
double precision :: v_cumhaz(2000)
double precision :: v_beta_total_time(2000)
double precision :: v_dexp_beta_total_time(2000)
double precision :: v_total(2000)
double precision :: v_beta_time(2000)
double precision :: v_dexp_beta_time(2000)
double precision :: v_dlog_haz(2000)
double precision :: v_beta_total(2000)
double precision :: v_dexp_beta_total(2000)
prit = 0d0
v_total(1) = 0d0
do ni = 2,nint
v_total(ni) = v_total(ni-1) + time(ni-1)
enddo
do ni = 1,nint
v_beta_total_time(ni) = beta * (v_total(ni) + time(ni))
v_beta_time(ni) = beta * (time(ni))
v_beta_total(ni) = beta * v_total(ni)
enddo
call vexp (v_dexp_beta_total_time, v_beta_total_time, nint)
call vexp (v_dexp_beta_time, v_beta_time, nint)
call vexp (v_dexp_beta_total, v_beta_total, nint)
do ni = 1,nint
haps = dble(nint)-dble(ni)+2d0
rate = haps*(haps-1d0)/2d0
v_haz(ni) = rate*v_dexp_beta_total_time(ni)
v_cumhaz(ni) = rate*v_dexp_beta_total(ni)*
& (v_dexp_beta_time(ni)-1d0)/beta
enddo
call vlog (v_dlog_haz, v_haz, nint)
do ni = 1,nint
prit = prit+v_dlog_haz(ni)-v_cumhaz(ni)
enddo
This is a tedious procedure, but it yielded a speedup of over 3x. Here are wallclock times (average of four runs) for the two versions of the routine, as run out of the driver:
Version of Code
=======================
Original Hand
vectorized
xlf options (secs) (secs)
----------- -------- --------
<default> 3.39 0.92
-O3 3.18 0.83
-O3 -qhot 0.93 0.86
It's somewhat reassuring to see that the hand-coded version managed to beat the best compiler time in all cases. Even better, since it's a lot of work to do this by hand, is the observation that with the "high order transformation option", the compiler does almost as well. (Different codes will, of course, respond differently.) Combining manual and compiler optimization
Here's an approach to optimizing a code with massv:
- compile with -O3 -qhot ,
- verify that program output unchanged or acceptable--reordering execution can change results,
- profile the code,
- identify math intrinsics (if any) which take significant time,
- find the routine(s) where such intrinsics are called,
- determine if XLF has transformed them into vector intrinsics,
- if not, attempt to hand-optimize.
Steps 3 and 6 require additional tools, as follows.
Profiling codes on the SP
- Recompile with the -pg option: xlf -pg -O3 -qhot ...
- Run the executable as usual
- This produces the trace file: gmon.out
- Use gprof to view the profile: gprof executable_name gmon.out
[ Note: this type of profiling adds overhead. Recompile without -pg before doing any production runs! ]
Here's a snippet of gprof output, from the "call graph profile:"
called/total parents
index %time self descendents called+self name index
called/total children
0.05 0.00 515084/4435146 .endpolpr [17]
0.05 0.00 553902/4435146 .endpolpl [15]
0.26 0.00 2836611/4435146 .endpolp2 [3]
[13] 4.7 0.40 0.00 4435146 ._log [13]
0.00 0.00 12/23 ._Errno [193]
This calling tree tells us that log is called by endpolpr , endpolpl , endpolp2 and that log takes 4.7% of the code's total run time. (4.7% may not be worth worrying about.) It also tells us that log calls Errno .
From this, we know to examine the potential of subroutine endpolp2 for the replacement of log with vlog . To avoid wasting time, we must first determine if XLF has already done this, as described next. Locate vector intrinsics added/missed by the compiler
- Obtain a compiler report by passing XLF the "-qreport" option: xlf -qreport -O3 -qhot ...
- view the ".lst" file produced by the compiler.
Here's a snippet from a report. It ain't perty, but you can find occurances of both _exp and CALL __vexp " in this transformed code. Remember, in the source code, vexp " didn't exist... the compiler has rearranged the loops to replace exp with vexp .
@CSE23 = dr(1)
@CSE24 = _exp(-(rrate * %VAL(@CSE23)))
prob[].off0 = p1 * ( 1.0000000000000000E+000 - @CSE24)
@CSE25 = _exp(-(rrate2 * %VAL(@CSE23)))
prob2[].off0 = p1 * ( 1.0000000000000000E+000 - @CSE25)
temp2 = @CSE24
temp4 = @CSE25
@MARKSTK0 = __getstack()
GOTO lab_83
2913
lab_83
2886
IF ((@ICM6 > 0)) THEN
2893
@NumElements0 = int(int(@ICM6))
CALL __vexp((@addr.split6 + () + (8)*(0)),(@addr.split6 + ()&
& + (8)*(0)),@NumElements0)
2894
@NumElements1 = int(int(@ICM6))
CALL __vexp((@addr.split7 + () + (8)*(0)),(@addr.split7 + ()&
& + (8)*(0)),@NumElements1)
2886
@CIV4 = 0
Id=11 DO @CIV4 = @CIV4, int(@ICM6)-1
2898
@CSE31 = @addr.split6%@split6(@CIV4)
temp2 = temp2 * @CSE31
The compiler report also provides some annotation in english, as shown in the following snippet:
Source Source Loop Id Action / Information
File Line
----- ------- ------- -----------------------------------------
1 2893 Vectorization applied to statement.
1 2894 Vectorization applied to statement.
1 2886 11 The loop on line 2886 was created by the
distribution of the loop on line 2886.
Given this information, the programmer's goal is to find occurances of _exp within the (transformed) loops. If found, we know that XLF was unable to "vectorize" those loops, and thus, the corresponding loops in the original source might possibly be hand-optimized. Loops are identified by the source code line numbers given in column 3 of the tranformed code in the report.
In this example, and, in fact, for the complete application code, XLF "vectorized" every occurance and left nothing to do by hand.
BLUI in SIGGRAPH Studio
ARSC/UAF's "Body Language User Interface", or BLUI, project will be featured in the "Studio" at SIGGRAPH, next week.
Here's the blurb from the Studio web page (under the new category, "VR," at the bottom):
http://www.siggraph.org/s2002/conference/studio/index.html
VR "New for SIGGRAPH 2002, this area features a system for immersive display configured for 3D solid modeling. Bill Brody of the University of Alaska at Fairbanks demonstrates his "BLUIsculpt" system, in which fully 3D objects can be created and output as .stl files for rapid prototyping."
DOE Benchmarking
Interesting work DOE's benchmarking of early systems:
http://www.csm.ornl.gov/evaluation/index.html
Evaluation of Early Systems
Computational requirements for many large-scale simulations and ensemble studies of vital interest to the Department of Energy (DOE) exceed what is currently offered by any U.S. computer vendor. Examples are numerous, ranging from global change research to combustion to informatics. It is incumbent on DOE to be aware of the performance of new or beta systems from high performance computing vendors that will determine the performance of future production-class offerings. It is equally important that DOE work with vendors in finding solutions that will fulfill DOE's computational requirements.
In support of this mission, Oak Ridge National Laboratory (ORNL) is currently performing in-depth evaluations of a number of high performance computer systems,
Fortran Information
Learn about all things Fortran from Michael Metcalf's Fortran 90/95/HPF Information File, at:
http://www.fortran.com/metcalf.htm
Document headings:
- WHERE CAN I OBTAIN A FORTRAN 95 COMPILER?
- OTHER USEFUL PRODUCTS
-
WHAT BOOKS ARE AVAILABLE?
In these languages:
- Chinese
- Danish
- Dutch
- English
- Finnish
- French
- German
- Italian
- Japanese
- Russian
- Swedish
- WHERE CAN I OBTAIN COURSES, COURSE MATERIAL OR CONSULTANCY?
- WHERE CAN I FIND THE FORTRAN AND HPF STANDARDS?
Quick-Tip Q & A
A:[[ June is mosquito month in Fairbanks, and 2002 has been impressive by [[ all accounts. Send us your favorite (short) mosquito story, remedy, [[ or advice. Any luck with mosquito traps, DEET-free dope, or personal [[ concoctions?
# The Gadget Award goes to Kate Hedstrom:
| We got a "SonicWeb" trap, shown here. It has a heartbeat sound that is loudenough to hear. It also uses heat and Octenol to attract critters. Ours has trapped all sorts ofinsects, mostly flies, wasps and itty-bitty things. We also caught a dragonfly and a fewmosquitos. |
|
|
# From Brad Chamberlain My favorite mosquito story was when I was camping in college with a Buddhist friend of mine. Angrily slapping mosquitos left and right, she implored me, "Don't kill them, just brush them away. They just want a drop of your blood, and you want to take their life." I pondered this a few days and then asked her if she was reincarnated as a mosquito, whether she might appreciate being sent on to her next life all the sooner. She had to agree that that sounded attractive... :) But in spite of this funny exchange, she truly believed that if you brush mosquitos away rather than swat at them, then they will leave you alone. And ever since then, I have brushed mosquitos away, and don't remember the last time I was plagued with as many bites as I was when I was a kid. Interesting fact that I did not know: According to Webster on-line, the plural of mosquito is either -os or -oes. Lucky Dan Quayle. # From one of the editors:
The EPA says DEET products are safe ("when used as directed"), but a quick search on "DEET" && "Gulf War Syndrome" may give you pause. I do my best to avoid it. Head nets and long sleeves are the best. On the other hand, I'd *never* go hiking, fishing, etc., without some stron bug dope in my pack. Swarming mosquitos can drive a person crazy, and make you do things more dangerous than just wearing DEET. This summer, I've used DEET just to work in the yard, wanting to avoid smacking myself in the head with a shovel in an attempt to swat some bug.
# Tom Logan deserves some award for this... You probably know the mosquitos in Alaska are big, but the other night I overheard this from two that we're buzzing around my bed: Mosquito 1: "I'm tired of eating out, lets pick him up and take him back to the swamp" Mosquito 2: "NO! When we get back, the big ones will take him away from us!" Q: I received a "not enough memory" error when trying to compile a large subroutine (part of a big code) on the T3E with -O3,unroll2 . Is there any way to increase the memory allocation to f90 or am I stuck compiling that subroutine with -O2 ? Thanks!
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
