ARSC HPC Users' Newsletter 264, February 21, 2003

ARSC HPC Technology Watch

Keeping up with all the developments in high performance computing presents a challenge. New systems, rapidly evolving and maturing software, different approaches to solving old problems, and entirely new areas of HPC application present a flurry of activity to follow.

An informal group is forming which will be known as the "ARSC HPC Technology Watch."

This group will hold its first meeting February 26 in Butrovich 212, and will serve as a forum in which interested and like minded individuals can gather to discuss the latest developments in HPC and share experiences in an informal setting. Each meeting will have a focus topic that will be discussed for the first half hour followed by a general session of open discussion/questions/experiences etc. Any interested parties are encouraged to attend.

Meetings will be held on the last Wednesday of the month, noon to 1 p.m., on the UAF campus, with the possibility of later meetings including contributions over the Access Grid. Starter topics for the first few meetings are described below:

February 26th: Room 212a, Butrovich

What is HPC technology watch? ARSC Director, Frank Williams, and Associate Director, Barbara Horner-Miller, will describe current activities at ARSC and answer questions.

March 26th: Room 106a, Butrovich

Q+A on storage issues. How to share data between systems, visualize data sets, keeping track of all your results.

April 30th: Room 204/Sherman Carter Conf Room, Butrovich

First experiences with recent arrivals at ARSC. Some benchmarks from new systems at ARSC over the past six months, new libraries, experiences.

Ideas for topics for future meetings are welcome. Please contact Guy Robinson, robinson@arsc.edu , if you plan to attend.

Halting (Or Not) on Numerical Exceptions

How to, or not to, stop on /0.0 or (-1)*0.5?

ARSC now has a variety of different systems and they have different behaviors when encountering some of the common numerical errors which can occur even in the best of programs.

In the following sample sessions, a simple test code (given at the end of the article) which generates the classic error of a divide by zero and the more interesting negative number raised to a fractional power is used to show how to control behavior of code when encountering each error.

Naturally, users should be respecting such errors and writing code not to divide by zero or to perform illegal operations. However all practical programmers know that often one is using libraries and other shared codes and there might often be parts of an array which generate these errors but these parts are not used/referenced in your program. Also sometimes the error we are looking for is a numerical aspect of the current problem being solved and not a simple coding error. The techniques below can let you get round problems but are to be used with caution.

Cray SV1ex

On the SV1ex, the code stops and reports the error.


  chilkoot% f90 -o demo demo.f
  chilkoot% ./demo
    int  512,  0.9980506822612085,  2*0.E+0
  Floating point exception
  
   Beginning of Traceback:
    Interrupt at address 465d in routine 'DEMO_ZEROS'.
    Called from line 350 (address 22655c) in routine '$START$'.
   End of Traceback.
  Floating exception (core dumped)

How can we get things past the stop if we know it isn't serious?

The "cpu" command permits us to control much about the environment an executable runs in, on the SV1 several hardware options can be controlled like cache and memory access but also the control of exceptions can be disabled. (See "man cpu" for full details.)

For example, we can allow the program to continue even after a floating point exception occurs:


  chilkoot% /etc/cpu -m fpeoff ./demo
    int  512,  0.9980506822612085,  2*0.E+0
    loop1  512,  0.9980506822612085,  2*0.E+0
  libm-335 : UNRECOVERABLE
    Vector Real raised by a scalar or vector Real has negative base.
  Abort

   Beginning of Traceback:
    [ ... cut ... ]
    Called from line 35 (address 526a) in routine 'DEMO_ZEROS'.
    Called from line 350 (address 22655c) in routine '$START$'.
   End of Traceback.
  Abort (core dumped)

Trapping the fractional power error is a more complicated process and we'll discuss this in a later item.

Cray SX-6

On the Cray SX-6 the code runs but reports errors. The default limit for the number of warnings before halting is 49 so we get many messages before the code halts.


 rime> f90 -o demo demo.f
  demo.f:
  
  f90: vec(1): demo.f, line 10: Vectorized loop.
  f90: vec(1): demo.f, line 23: Vectorized loop.
  f90: vec(1): demo.f, line 34: Vectorized loop.
  f90: demo.f, demo_zeros: There are 3 diagnoses.
  
 rime > ./demo
  int   512  0.9980507  0.000000E+00  0.000000E+00
    * 252 Floating-point zero divide PROG=demo_zeros ELN=25(400001580)
  loop1   512  0.9980507  0.000000E+00  inf
    * 274 RPWR -> R1<0 in R1**R2 PROG=demo_zeros ELN=36(4000017e8)

    [ ... 48 identical lines cut ... ]

 ****  99 Execution suspended PROG=demo_zeros ELN=36(4000017e8)
 

We can modify the number of error messages by setting the "F_ERRCNT" environment variable to another value. But this doesn't really help matters if we don't want to see any messages and aren't interested in them. However, not stopping on the first error can sometime be helpful as we get to see more of the numerical behavior of the code which might be enlightening, as the scientist attempts to understand the complexity of his/her algorithm's numerical behavior and what is happening to cause the bad numerical values.

On the SX-6, the handling of zero divide errors is through the compiler. (Compiler flags control not only zero divide, but underflow, overflow, and inexact.) We recompile with the appropriate flag:


  rime > f90 -o demo -Wf"-M nozdiv " demo.f
  demo.f:
  
  f90: vec(1): demo.f, line 10: Vectorized loop.
  f90: vec(1): demo.f, line 23: Vectorized loop.
  f90: vec(1): demo.f, line 34: Vectorized loop.
  f90: demo.f, demo_zeros: There are 3 diagnoses.

When run, the zero divide error is passed, but the program is halted on the other error:


   rime > ./demo
    int   512  0.9980507  0.000000E+00  0.000000E+00
    loop1   512  0.9980507  0.000000E+00  inf
      * 274 RPWR -&gt; R1&lt;0 in R1**R2 PROG=demo_zeros ELN=36(4000017e8)
      [ ... 48 identical lines cut ... ]
 ****  99 Execution suspended PROG=demo_zeros ELN=36(4000017e8)

Stopping on the second error requires a little more attention to detail. This error is specific, a negative number raised to a fractional power, and has its very own code, 274 (see above error diagnostics). We can tell the system to ignore any specific error by setting the environment variable "F_ERROPTn". Details, from the SX-6 online fortran manual:


  2.4.1.5  F_ERROPTn
  
  If an error is detected during execution, an error message is issued or
  the program is terminated according to the error processing control data
  set for each error (see Section 8.3). This option changes the error
  processing control data, thereby changing error processing.
  
     For sh:
            F_ERROPTn = n1, n2, alt, err, m, t, a, cnt
            export F_ERROPTn
     For csh:
            setenv F_ERROPTn n1, n2, alt, err, m, t, a, cnt

  Option values are:

      n The priority of handling error-processing control. Select from
         numbers 1 to 9.
      n1 The first error number in the range for which error-processing
         control data is changed.
      n2 The last error number in the range for which error-processing
         control data is changed.
      alt Determines whether a user-defined error-processing routine is
          executed when an error is detected.
               0 = Not changed..
               1 = A user-defined error-processing routine is executed.
               2 = A user-defined error-processing routine is not executed.

      err Determines whether control is passed to the specified statement
          when an error specifier is included in an input/output statement.
               0 = Not changed.
               1 = Control is passed to the specified statement
               2 = Control is not passed to the specified statement

      m Determines if error messages are issued.
               0 = Not changed.
               1 = Error messages are issued
               2 = Error messages are not issued

      t Determines if a trace back message is issued.
               0 = Not changed.
               1 = Trace-back messages are issued
               2 = Trace-back messages are not issued

      a Determines if the program is terminated when an error is
        detected.
               0 = Not changed.
               1 = The program is terminated abnormally
               2 = The program continues processing

      cnt Determines whether the number of errors that occurred is counted
          when an error is detected.
               0 = Not changed.
               1 = The number of errors is counted
               2 = The number of errors is not counted

We decide on the following: not to change the behavior for error 274 only, not to have a user defined routine, not to trap IO errors, not to issue error messages, not to trace back, not to terminate, and finally, not to count the errors in this class. Here's the appropriate setting, and a sample run:


 rime > setenv F_ERROPT1 274,274,0,0,2,2,2,2
 rime > ./demo
  int   512  0.9980507  0.000000E+00  0.000000E+00
  loop1   512  0.9980507  0.000000E+00  inf
  loop2   512  0.9980507  0.000000E+00  0.000000E+00
  loop2a   256  0.9961089  -0.9961089  nan

(Note the control we get with of F_ERROPTn: we would still stop, count,
and terminate on other numerical errors outside of the 274 class.)

IBM Regatta

With the default options, the program will run straight through numerical exceptions without halting or warnings:


  f1n2 83% xlf -o demo demo.f
  ** demo_zeros   === End of Compilation 1 ===
  1501-510  Compilation successful for file demo.f.
  f1n2 84% ./demo
    int  512 0.9980506897 0.0000000000E+00 0.0000000000E+00
    loop1  512 0.9980506897 0.0000000000E+00 INF
    loop2  512 0.9980506897 0.0000000000E+00 0.0000000000E+00
    loop2a  256 0.9961089492 -0.9961089492 NaNQ

We can ask the code to stop by using simple compiler options. This option, -qflttrap=zerodivide:enable forces the code to stop on zero divides. E.g.:


  f1n2 87% xlf -o demo -qflttrap=zerodivide:enable demo.f
  ** demo_zeros   === End of Compilation 1 ===
  1501-510  Compilation successful for file demo.f.
  f1n2 88% ./demo
    int  512 0.9980506897 0.0000000000E+00 0.0000000000E+00
  Breakpoint (core dumped)
 

This option, -qflttrap=invalid:enable stops on invalid operations, a class which includes (-1.0)**(0.5)


  f1n2 89% xlf -o demo -qflttrap=invalid:enable demo.f
  ** demo_zeros   === End of Compilation 1 ===
  1501-510  Compilation successful for file demo.f.
  f1n2 90% ./demo
    int  512 0.9980506897 0.0000000000E+00 0.0000000000E+00
    loop1  512 0.9980506897 0.0000000000E+00 INF
  Breakpoint (core dumped)
 

Compiling with both simply stops at the first error encountered.

Discussion:

It is best to be a responsible user and not do illegal operations. The author recently received a large code which had been developed in the workstation environment. There were several segments where the code would divide by zero yet never use the result. While purists would suggest going through all the lines of code to modify the code it would have been a lengthy process. Using the above flags permitted the benchmarks to be performed, performance of the code measured, with the results being validated to ensure the use of the flags didn't result in bad behavior on the systems. (It is hoped to do some measurements to assess if the generation of errors has an impact on the performance of the systems later.)

Perhaps the best conclusion would be that developing code with the flags set to trap any errors is a good practice, strongly encouraged. I've often commented if computers were pieces of workshop machinery we'd not operate them without the safety guards in place would we?

Next, what is the impact of testing on performance?

Here's the code used in the above examples:

demo.f

      program demo_zeros

      integer  nsize
      parameter (nsize=1024)

      real, dimension(nsize) :: a,b,c

      nhalf=nsize/2
      do n=1,nsize
        a(n)=real(n)/real(n+1)
        b(n)=real(n-nhalf)/real(n+1)
        c(n)=0.0
      enddo

      write(6,*) ' int ',nhalf,a(nhalf),b(nhalf),c(nhalf)
!!
      do n=1,nsize
        c(n)=a(n)/b(n)
      enddo

      write(6,*) ' loop1 ',nhalf,a(nhalf),b(nhalf),c(nhalf)
!!
      do n=1,nsize
        c(n)=b(n)**a(n)
      enddo

      write(6,*) ' loop2 ',nhalf,a(nhalf),b(nhalf),c(nhalf)
      nprint=nhalf/2
      write(6,*) ' loop2a ',nprint,a(nprint),b(nprint),c(nprint)

      stop
      end

SX-6 Quick Look at 32 vs 64 Bit Performance

[ Thanks to Andrew Lee of ARSC for this article. ]

A user on the SX-6 asked: what is the performance difference between using -dw and -ew? The two compiler options disable and enable wide precision. The option -dw (default) specifies that the size of the numeric storage unit is 4 bytes. While -ew specifies that the size of numeric storage unit is changed to eight bytes. ARSC recommends that users stick with -ew unless they have a specific reason to do otherwise.

I sought to answer this question using a small test code written by Guy Robinson for Newsletter 213 :

The code compiled and ran without error on rime using the -ew option. However using "-dw" resulted in run-time errors "... Floating-point data overflow...". I modified the program by removing 24 of the "c(i-1,j)*(" lines and 8 of their corresponding "+1)-1)+1)" lines from each of the three large assignment statements in the main loop. This worked for both options, but the runtime was under 5 seconds, so I increased the number of iterations from 20K to 200K. Below are the compiler options all performance data obtained by setting "F_PROGINF DETAIL" for two runs of the modified program.

As you can see, the program achieves over 6.5 GFLOPS on one CPU in both the "-dw" and "-ew" runs, and the measurements of every other metric are also very close. The difference may be attributable to the activity of other users on the system. Thus, for this code, performance does not appear to be a factor in deciding whether to use "-dw" or "-ew".

As a disclaimer, this program does not test all performance related precision issues such as i/o, memory, and memory bandwidth.

Reasons a user might compile with the lower precision include compatibility of the program's binary i/o files with other programs and systems and need to squeeze more data into available memory.

Results:

gflop.f compiled with the disable wide option "-dw", run on 1 CPU, compiled with:


  rime> f90 -Wf,-L,fmtlist,map -dw -C hopt   -G global -sx6 -c gflop.f

     ******  Program Information  ******
  Real Time (sec)       :         42.097395
  User Time (sec)       :         42.001194
  Sys  Time (sec)       :          0.037286
  Vector Time (sec)     :         42.000494
  Inst. Count           :        2974633689.
  V. Inst. Count        :        2652801569.
  V. Element Count      :      320326821000.
  FLOP Count            :      280483814445.
  MOPS                  :       7634.274625
  MFLOPS                :       6677.996139
  VLEN                  :        120.750389
  V. Op. Ratio (%)      :         99.899631
  Memory Size (MB)      :         48.031250
  MIPS                  :         70.822597
  I-Cache (sec)         :          0.001511
  O-Cache (sec)         :          0.002131
  Bank (sec)            :          0.000001
 
  Start Time (date)  :  2003/02/20 12:26:18
  End   Time (date)  :  2003/02/20 12:27:00

gflop.f compiled with the enable wide option "-ew", run on 1 CPU, compiled with:


  f90 -Wf,-L,fmtlist,map -ew -C hopt   -G global -sx6 -c gflop.f

     ******  Program Information  ******
  Real Time (sec)       :         42.783361
  User Time (sec)       :         42.716838
  Sys  Time (sec)       :          0.030750
  Vector Time (sec)     :         42.716304
  Inst. Count           :        2974634175.
  V. Inst. Count        :        2652801625.
  V. Element Count      :      320326859400.
  FLOP Count            :      280483814445.
  MOPS                  :       7506.376965
  MFLOPS                :       6566.118299
  VLEN                  :        120.750401
  V. Op. Ratio (%)      :         99.899631
  Memory Size (MB)      :         48.031250
  MIPS                  :         69.636103
  I-Cache (sec)         :          0.001148
  O-Cache (sec)         :          0.001526
  Bank (sec)            :          0.000000
 
  Start Time (date)  :  2003/02/20 12:24:19
  End   Time (date)  :  2003/02/20 12:25:02

(If you're a rime user, please see detailed explanations of how -ew and -dw affects the size of each data type: table 2.3, Chapter 2.2.1, "f90 and sxf90 Command Options," found at:

http://www.arsc.edu/craydocs/sx6docs/g1af07e/chap2.html

)

ARSC User News: SV1ex, T3E, and SX-6 Users, and Web Docs

Chilkoot and Yukon:

  • Programming environment 3.6 (PE3.6) remains available for testing, by switching your environment from PrgEnv to PrgEnv.new. This was first announced in issue 257 :
  • f90 3.6.0.1 and CC 3.6.0.1 have been installed and correct a couple minor problems found in 3.6.0.0. Next weds, Feb. 26th, they'll be added to PrgEnv.new
  • The default programming PrgEnv will be updated to PE3.6 (PrgEnv.new) on March 12.

Chilkoot:

  • BIOLIB 1.1 has been installed as the default. It is loaded automatically with the command: "module load biolib". This provides a library of low-level routines, optimized for the SV1ex architecture, for use in bioinformatics applications. See: http://www.arsc.edu/support/manuals/LIBCBL/

Rime:

  • /usr/tmp purge policy has been changed. Files not accessed in 7 days are deleted.
  • The NQS queue structure was modified. See "news rime_queues" on rimegate and "news NQS_queues" on rime, for good measure!
  • If you need dedicated timings, simply submit your tested and debugged NQS job to the "sjs" (single job streaming) queue. If your job will take over 2 hours, please contact ARSC consulting in advance, so we can alert the SX-6 operators.

    Once sumbitted to sjs, your job will wait in the sjs queue until the next SJS time _or_ the next unused dedicated time (see "news rime_sched" on rimegate for the current week's schedule), and then it will be run as the only job on the system. This is an expensive use of the SX-6, so please verify the correctness of your scripts, data, program, etc. using the normal "batch" queues first. On the other hand, dedicated testing is a vital final step in benchmarking codes and we encourage you to do it. So when you're ready, don't be shy.

    As always, contact consult@arsc.edu with any concerns or questions.

Web Documentation:

  • Our web has had some minor changes. See our top level documentation page: http://www.arsc.edu/support/documentation.html
  • CrayDocs: http://www.arsc.edu/craydocs/
  • Third party product (IMSL, HDF, PGI, Idesk, Ferret) as well as Cray bioinformatics library (CBL) docs: http://www.arsc.edu/support/manuals.html

Iditarod and Next HPC Newsletter

The editors will be attending Cray X1 training in two weeks, so the next newsletter won't appear till March 14th.

We're excited about the training, but to our chagrin, we'll be out of town for the Iditarod Sled Dog Race, which, for the first time ever, is starting here in Fairbanks.

The ceremonial start will be in Anchorage, as usual. But due to lack of snow and cold, the so-called "re-start" has been moved from it's usual location in Wasilla to our fair city, 300 miles farther North. For the curious:

http://www.adn.com/iditarod/news/prerace/story/2604849p-2651411c.html

Quick-Tip Q & A


A:[[ A C style question. Don't worry what any of this does, the goal is
  [[ just to compare a couple things and choose one:
  [[
  [[
  [[    Option 1:
  [[   ---------
  [[  vel = (envel / scaleFct) / ((*outv)[n] + posStrt) < minvel ? 
  [[       (envel / scaleFct) / ((*outv)[n] + posStrt) : 
  [[       minvel;
  [[
  [[    Option 2:
  [[   ---------
  [[  vel = (envel / scaleFct) / ((*outv)[n] + posStrt);
  [[ if ( ! (vel < minvel) ) 
  [[   vel = minvel;
  [[  
  [[
  [[  Isn't there a cleaner way?  If not, which should I prefer?

  # 
  # Thanks to Dale Clark of ARSC: 
  # 
  I would format the conditional expression like this:
  
       vel = (envel / scaleFct) / ((*outv)[n] + posStrt) < minvel
           ? (envel / scaleFct) / ((*outv)[n] + posStrt)
           : minvel;
  
  But that's pretty idiosyncratic, and may result in the arithmetic
  expression being computed twice. So, even better (both stylistically and
  functionally):
  
       vel = min((envel / scaleFct) / ((*outv)[n] + posStrt),minvel);
  
  where 'min' is the C equiv. of something like (lapsing into Perl here):
  
  sub min
  {
    my($Min,@Rest) = @_;
  
    for (@Rest) { $Min = $_ if $_ < $Min }
  
    return $Min;
  }
  
  # 
  # Thanks to Brad Chamberlain: 
  # 
  Of these two, I'd stick to the second, since it only writes the new
  velocity expression once rather than twice.  This will make the code
  easier to read, easier to maintain, and possibly more efficient if the
  compiler is unable to factor the common subexpression into a temp (I tend
  to try not to rely on the compiler to optimize my code, especially when it
  also keeps my code cleaner).
  
  To further increase the readability of the second, I'd consider rewriting
  the condition as:
  
        if (vel >= minvel)
  
  since it doesn't rely on the reader to mentally negate the sense of the
  expression.
  
  If you really like conditional expressions, and don't equate shorter code
  with cleaner code, I would also consider doing something like the
  following (naming things often makes them cleaner, and I like that it only
  assigns vel once:
  
        newvel = (envel / scaleFct) / ((*outv)[n] + posStrt);
        vel = (newvel < minvel) ? newvel : minvel;
  
  You could also consider moving this expression to a macro, though it
  wouldn't change the performance or implementation at all, merely how
  the code reads:
  
        #define cap(newvel, minvel) ((newvel < minvel) ? (newvel) : (minvel))
  
        vel = cap((envel / scaleFct) / ((*outv)[n] + posStrt), minvel);




Q: Here's part of my PATH on ARSC's SX-6 frontend host, "rimegate": 

     rimegate$ echo $PATH
     /SX/opt/crosskit/inst/bin:/SX/opt/sxcc/inst/bin:/SX/opt/sxc++/inst/b
     in:/SX/opt/sxf90/inst/bin:/usr/psuite:/SX/opt/mpi2sx/inst:/SX/opt/mp 
     isx/inst:/usr/psuite:/SX/opt/vampirsx/inst/bin:/usr/local/krb5/bin:/ 
     usr/sbin:etc...

   Can I grep through the separate paths?  For instance, can I tell grep
   to use colons rather than newlines for the delimiter?  I want to see
   all the paths containing "SX", like this:

     echo $PATH 
 grep SX

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top