ARSC HPC Users' Newsletter 250, July 22, 2002

Bicenquinquagennial Issue

As you may have noticed, this is the 250th issue of the ARSC Users' Newsletter! We dispensed our first issue nearly 8 years ago, on August 25, 1994. All past issues are on-line, at:

http://www.arsc.edu/support/news/HPCnews.shtml

Past "Quick-Tips" are all indexed at:

http://www.arsc.edu/support/news/qtindex.xml

However... This might be YOUR first issue, since I just updated our mailing list with new ARSC users. (If you'd rather not receive this newsletter, unsubscribing is described at the bottom.)

Optimizing with IBM Vector Intrinsics and "xlf -qhot"

Intro

IBM's massv library is available on icehawk for hand-tuning of vectorizable math intrinsics. XLF (with options, -qhot -O3 ) does a great job of detecting vectorizable operations and may actually do all the work for you. In either case, the vector intrinsics can really speed things up.

As a quick example, this loop:

      do ni = 1,nint
        loghaz(ni) = dlog(haz(ni))
      enddo
can (and should) be replaced by:

      call vlog (loghaz, haz, nint)

The vector intrinsics library is called "massv". ("V" for "vector" and "MASS" for "Mathematical Acceleration SubSystem".) There's also "massvp3," tuned for the Power3, and the basic "mass" library, for scalar operations.

On icehawk, these libraries are available in:

/usr/local/pkg/mass/current/lib

and linked to:

/usr/local/lib

A README is in:

/usr/local/pkg/mass/current/wrk

Here are excerpts from the README file. The list of vector math intrinsics, performance data, and explanations/caveats:

----------------------------------------------------------------------

                      Math Library Performance 
              (cycles per evaluation, length 1000 loop)


                     604E                630               P2SC
  function range libm mass massv libm mass massv vp3  libm mass massv
     vrec    D     32*       10     9*        6    4     8*        4
    vsrec    D     18*        8     7*        5    3     8*        4
     vdiv    D     32*       12     9*        7    5     9*        5
    vsdiv    D     18*       10     7*        6    3     9*        4
    vsqrt    C     67   48   16    11*        9    6    13*        7
   vssqrt    C     70   48   10     7*        8    5    13*        5
   vrsqrt    C     79   49   16    22*        9    6    22*        7
  vsrsqrt    C     83   51    9    16*        7    4    22*        5
     vexp    D     83   45   16    64   33    6         53   21    7
    vsexp    E     85   44   13    68   36    5         58   21    6
     vlog    C     99   56   20    83   53    8         67   35    8
    vslog    C    102   56   17    86   57    7         66   37    7
     vsin    B     50   29   11    36   16    5         34   17    5
     vsin    D     79   59   27    60   43   12         50   37   12
    vssin    B     51   26    8    39   18    4         40   16    4
    vssin    D     79   58   20    62   46    9         56   38    9
     vcos    B     51   26    9    37   16    4         34   17    4
     vcos    D     75   59   27    58   43   12         51   36   11
    vscos    B     52   26    7    39   18    3         40   16    3
    vscos    D     76   59   20    61   46    9         56   37    9
  vsincos    B    100   53   19    80   33    8         80   38    8
  vsincos    D    151  116   29   123   92   12        111   81   12
 vssincos    B    107   55   15    79   38    6         78   36    7
 vssincos    D    159  118   24   125   98   10        110   80   10
 vcosisin    B    104   55   19    78   34    8         79   37    8
 vcosisin    D    156  118   29   123   93   12        111   81   12
vscosisin    B    108   55   15    79   36    6         78   36    6
vscosisin    D    160  119   23   125   95    9        110   79   10
     vtan    D    136   74   32   111   52   19         90   38   13
    vstan    D    136   74   32   113   56   19         95   39   12
   vatan2    D    545  104   40   413   87   25        555   73   17
  vsatan2    D    545  104   40   418   89   25        558   71   17
   vdnint    D     37   22    7    24   12  3.4         23   13  2.7
    vdint    D     36         6    22       2.8         21       2.6
                                                                massvp2
   vidint    D                                          4.0      2.7
    vasin    B                                           48       17
    vacos    B                                           49       17
  vdfloat    D                                          3.0      1.8
   vdsign    D                                            9      3.5

  * indicates inline instructions timed (not a subroutine call)

  Range Key     Processor     Cycle time         Dcache size
  A =    0,  1     604E    3.0 nanoseconds      32 kilobytes
  B =   -1,  1      630    5.0 nanoseconds      64 kilobytes
  C =    0,100     P2SC    7.4 nanoseconds     128 kilobytes
  D = -100,100
  E =  -10, 10

  "[This] data should be considered approximate. It was obtained by
  timing many repetitions of a loop over 1,000 random arguments and
  includes all overhead. Timing in this way will bring the input and
  output vectors into the on-chip cache (the loop is short enough for
  them to fit in cache). Performance may deteriorate seriously when the
  input and output vectors are not in cache. Performance may also
  deteriorate for arguments at or near the end-points of the valid
  argument ranges."
----------------------------------------------------------------------



Example
For a more informative example, here's a routine from a simulation code under evaluation at ARSC. (A driver for this routine was constructed from the original code, and used for the timings given below.) The array length of 2000 is a do-not-exceed size... in the current data set, "nint" is only 352.


      integer ni,nint
      double precision beta,cumhaz,haps,haz,prit,rate,
     &  total
      double precision time(2000)

      prit = 0d0
      total = 0d0

      do ni = 1,nint
        haps = dble(nint)-dble(ni)+2d0
        rate = haps*(haps-1d0)/2d0
        haz = rate*dexp(beta*(total+time(ni)))
        cumhaz = rate*dexp(beta*total)*
     &      (dexp(beta*time(ni))-1d0)/beta
        prit = prit+dlog(haz)-cumhaz
        total = total+time(ni)
      enddo

Basically, this loop iterates over the array "time", performing multiple dexp 's and a dlog operation, accumulating and storing intermediate results to the scalar variables.
The loop can be restructured by hand to replace dexp with vexp . ( dexp is double precision and vexp works on doubles, so this is appropriate.)
In the rewritten code, temporary arrays store entire sequences of scalar values, allowing these sequences to be processed all at once by calls to vexp . Obviously, the temporary arrays consume memory, and if the default size in this example were 2,000,000 rather than 2,000, we'd have a problem. On the other hand, I've given them self-documenting names and not attempted to reuse them, so this could be tighened up.
Here's the hand-optimized version:


      integer ni,nint
      double precision beta,cumhaz,haps,haz,prit,rate,
     &  total
      double precision time(2000)

      double precision :: v_haz(2000)
      double precision :: v_cumhaz(2000)
      double precision :: v_beta_total_time(2000)
      double precision :: v_dexp_beta_total_time(2000)
      double precision :: v_total(2000)
      double precision :: v_beta_time(2000)
      double precision :: v_dexp_beta_time(2000)
      double precision :: v_dlog_haz(2000)
      double precision :: v_beta_total(2000) 
      double precision :: v_dexp_beta_total(2000) 

      prit = 0d0
      v_total(1) = 0d0

      do ni = 2,nint
        v_total(ni) = v_total(ni-1) + time(ni-1)
      enddo 

      do ni = 1,nint
        v_beta_total_time(ni) = beta * (v_total(ni) + time(ni))
        v_beta_time(ni) = beta * (time(ni))
        v_beta_total(ni) = beta * v_total(ni)
      enddo 

      call vexp (v_dexp_beta_total_time, v_beta_total_time, nint)
      call vexp (v_dexp_beta_time, v_beta_time, nint)
      call vexp (v_dexp_beta_total, v_beta_total, nint)

      do ni = 1,nint
        haps = dble(nint)-dble(ni)+2d0
        rate = haps*(haps-1d0)/2d0
        v_haz(ni) = rate*v_dexp_beta_total_time(ni)
        v_cumhaz(ni) = rate*v_dexp_beta_total(ni)*
     &                  (v_dexp_beta_time(ni)-1d0)/beta
      enddo
     
      call vlog (v_dlog_haz, v_haz, nint)

      do ni = 1,nint
        prit = prit+v_dlog_haz(ni)-v_cumhaz(ni)
      enddo

This is a tedious procedure, but it yielded a speedup of over 3x. Here are wallclock times (average of four runs) for the two versions of the routine, as run out of the driver:


                   Version of Code
               =======================
               Original       Hand
                            vectorized
  xlf options   (secs)       (secs)
  -----------  --------     --------
    <default>    3.39         0.92
    -O3          3.18         0.83
    -O3 -qhot    0.93         0.86
    

It's somewhat reassuring to see that the hand-coded version managed to beat the best compiler time in all cases. Even better, since it's a lot of work to do this by hand, is the observation that with the "high order transformation option", the compiler does almost as well. (Different codes will, of course, respond differently.) Combining manual and compiler optimization
Here's an approach to optimizing a code with massv:
  1. compile with -O3 -qhot ,
  2. verify that program output unchanged or acceptable--reordering execution can change results,
  3. profile the code,
  4. identify math intrinsics (if any) which take significant time,
  5. find the routine(s) where such intrinsics are called,
  6. determine if XLF has transformed them into vector intrinsics,
  7. if not, attempt to hand-optimize.

Steps 3 and 6 require additional tools, as follows.
Profiling codes on the SP
  1. Recompile with the -pg option: xlf -pg -O3 -qhot ...
  2. Run the executable as usual
  3. This produces the trace file: gmon.out
  4. Use gprof to view the profile: gprof executable_name gmon.out

[ Note: this type of profiling adds overhead. Recompile without -pg before doing any production runs! ]
Here's a snippet of gprof output, from the "call graph profile:"


                                  called/total       parents
index  %time    self descendents  called+self    name           index
                                  called/total       children

                0.05        0.00  515084/4435146     .endpolpr [17]
                0.05        0.00  553902/4435146     .endpolpl [15]
                0.26        0.00 2836611/4435146     .endpolp2 [3]
[13]     4.7    0.40        0.00 4435146         ._log [13]
                0.00        0.00      12/23          ._Errno [193]

This calling tree tells us that log is called by endpolpr , endpolpl , endpolp2 and that log takes 4.7% of the code's total run time. (4.7% may not be worth worrying about.) It also tells us that log calls Errno .
From this, we know to examine the potential of subroutine endpolp2 for the replacement of log with vlog . To avoid wasting time, we must first determine if XLF has already done this, as described next. Locate vector intrinsics added/missed by the compiler
  1. Obtain a compiler report by passing XLF the "-qreport" option: xlf -qreport -O3 -qhot ...
  2. view the ".lst" file produced by the compiler.

Here's a snippet from a report. It ain't perty, but you can find occurances of both _exp and CALL __vexp " in this transformed code. Remember, in the source code, vexp " didn't exist... the compiler has rearranged the loops to replace exp with vexp .



                  @CSE23 = dr(1)
                  @CSE24 = _exp(-(rrate * %VAL(@CSE23)))
                  prob[].off0 = p1 * ( 1.0000000000000000E+000 - @CSE24)
                  @CSE25 = _exp(-(rrate2 * %VAL(@CSE23)))
                  prob2[].off0 = p1 * ( 1.0000000000000000E+000 - @CSE25)
                  temp2 = @CSE24
                  temp4 = @CSE25
                  @MARKSTK0 = __getstack()
                  GOTO lab_83
  2913
           lab_83
  2886
           IF ((@ICM6 > 0)) THEN
  2893
             @NumElements0 = int(int(@ICM6))
                    CALL __vexp((@addr.split6 + () + (8)*(0)),(@addr.split6 + ()&
                &      + (8)*(0)),@NumElements0)
  2894
             @NumElements1 = int(int(@ICM6))
                    CALL __vexp((@addr.split7 + () + (8)*(0)),(@addr.split7 + ()&
                &      + (8)*(0)),@NumElements1)
  2886
             @CIV4 = 0
       Id=11        DO @CIV4 = @CIV4, int(@ICM6)-1
  2898
               @CSE31 = @addr.split6%@split6(@CIV4)
                      temp2 = temp2 * @CSE31
                      

The compiler report also provides some annotation in english, as shown in the following snippet:


Source  Source  Loop Id  Action / Information
File    Line
-----   ------- ------- -----------------------------------------
    1    2893            Vectorization applied to statement.
    1    2894            Vectorization applied to statement.
    1    2886     11     The loop on line 2886 was created by the
                               distribution of the loop on line 2886.

Given this information, the programmer's goal is to find occurances of _exp within the (transformed) loops. If found, we know that XLF was unable to "vectorize" those loops, and thus, the corresponding loops in the original source might possibly be hand-optimized. Loops are identified by the source code line numbers given in column 3 of the tranformed code in the report.
In this example, and, in fact, for the complete application code, XLF "vectorized" every occurance and left nothing to do by hand.

BLUI in SIGGRAPH Studio


ARSC/UAF's "Body Language User Interface", or BLUI, project will be featured in the "Studio" at SIGGRAPH, next week.
Here's the blurb from the Studio web page (under the new category, "VR," at the bottom):

http://www.siggraph.org/s2002/conference/studio/index.html

VR "New for SIGGRAPH 2002, this area features a system for immersive display configured for 3D solid modeling. Bill Brody of the University of Alaska at Fairbanks demonstrates his "BLUIsculpt" system, in which fully 3D objects can be created and output as .stl files for rapid prototyping."


DOE Benchmarking


Interesting work DOE's benchmarking of early systems:

http://www.csm.ornl.gov/evaluation/index.html



Evaluation of Early Systems




Computational requirements for many large-scale simulations and ensemble studies of vital interest to the Department of Energy (DOE) exceed what is currently offered by any U.S. computer vendor. Examples are numerous, ranging from global change research to combustion to informatics. It is incumbent on DOE to be aware of the performance of new or beta systems from high performance computing vendors that will determine the performance of future production-class offerings. It is equally important that DOE work with vendors in finding solutions that will fulfill DOE's computational requirements.
In support of this mission, Oak Ridge National Laboratory (ORNL) is currently performing in-depth evaluations of a number of high performance computer systems,

Fortran Information



Learn about all things Fortran from Michael Metcalf's Fortran 90/95/HPF
Information File, at:

http://www.fortran.com/metcalf.htm




Document headings:
  • WHERE CAN I OBTAIN A FORTRAN 95 COMPILER?
  • OTHER USEFUL PRODUCTS
  • WHAT BOOKS ARE AVAILABLE? In these languages:
    • Chinese
    • Danish
    • Dutch
    • English
    • Finnish
    • French
    • German
    • Italian
    • Japanese
    • Russian
    • Swedish
  • WHERE CAN I OBTAIN COURSES, COURSE MATERIAL OR CONSULTANCY?
  • WHERE CAN I FIND THE FORTRAN AND HPF STANDARDS?

Quick-Tip Q & A



A:[[ June is mosquito month in Fairbanks, and 2002 has been impressive by
  [[ all accounts.  Send us your favorite (short) mosquito story, remedy,
  [[ or advice.  Any luck with mosquito traps, DEET-free dope, or personal
  [[ concoctions?

# The Gadget Award goes to Kate Hedstrom:

We got a "SonicWeb" trap, shown here. It has a heartbeat sound that is loudenough to hear. It also uses heat and Octenol to attract critters. Ours has trapped all sorts ofinsects, mostly flies, wasps and itty-bitty things. We also caught a dragonfly and a fewmosquitos. image of mosquito trap

# From Brad Chamberlain


My favorite mosquito story was when I was camping in college with a
Buddhist friend of mine.  Angrily slapping mosquitos left and right, she
implored me, "Don't kill them, just brush them away.  They just want a
drop of your blood, and you want to take their life."  I pondered this a
few days and then asked her if she was reincarnated as a mosquito,
whether she might appreciate being sent on to her next life all the
sooner.  She had to agree that that sounded attractive... :)

But in spite of this funny exchange, she truly believed that if you
brush mosquitos away rather than swat at them, then they will leave you
alone.  And ever since then, I have brushed mosquitos away, and don't
remember the last time I was plagued with as many bites as I was when I
was a kid.

Interesting fact that I did not know:  According to Webster on-line, the
plural of mosquito is either -os or -oes.  Lucky Dan Quayle.



# From one of the editors:
The EPA says DEET products are safe ("when used as directed"), but a quick search on "DEET" && "Gulf War Syndrome" may give you pause. I do my best to avoid it. Head nets and long sleeves are the best. On the other hand, I'd *never* go hiking, fishing, etc., without some stron bug dope in my pack. Swarming mosquitos can drive a person crazy, and make you do things more dangerous than just wearing DEET. This summer, I've used DEET just to work in the yard, wanting to avoid smacking myself in the head with a shovel in an attempt to swat some bug.

# Tom Logan deserves some award for this... You probably know the mosquitos in Alaska are big, but the other night I overheard this from two that we're buzzing around my bed: Mosquito 1: "I'm tired of eating out, lets pick him up and take him back to the swamp" Mosquito 2: "NO! When we get back, the big ones will take him away from us!" Q: I received a "not enough memory" error when trying to compile a large subroutine (part of a big code) on the T3E with -O3,unroll2 . Is there any way to increase the memory allocation to f90 or am I stuck compiling that subroutine with -O2 ? Thanks!

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top