ARSC HPC Users' Newsletter 264, February 21, 2003
ARSC HPC Technology Watch
Keeping up with all the developments in high performance computing presents a challenge. New systems, rapidly evolving and maturing software, different approaches to solving old problems, and entirely new areas of HPC application present a flurry of activity to follow.
An informal group is forming which will be known as the "ARSC HPC Technology Watch."
This group will hold its first meeting February 26 in Butrovich 212, and will serve as a forum in which interested and like minded individuals can gather to discuss the latest developments in HPC and share experiences in an informal setting. Each meeting will have a focus topic that will be discussed for the first half hour followed by a general session of open discussion/questions/experiences etc. Any interested parties are encouraged to attend.
Meetings will be held on the last Wednesday of the month, noon to 1 p.m., on the UAF campus, with the possibility of later meetings including contributions over the Access Grid. Starter topics for the first few meetings are described below:
February 26th: Room 212a, Butrovich
What is HPC technology watch? ARSC Director, Frank Williams, and Associate Director, Barbara Horner-Miller, will describe current activities at ARSC and answer questions.
March 26th: Room 106a, Butrovich
Q+A on storage issues. How to share data between systems, visualize data sets, keeping track of all your results.
April 30th: Room 204/Sherman Carter Conf Room, Butrovich
First experiences with recent arrivals at ARSC. Some benchmarks from new systems at ARSC over the past six months, new libraries, experiences.
Ideas for topics for future meetings are welcome. Please contact Guy Robinson, robinson@arsc.edu , if you plan to attend.
Halting (Or Not) on Numerical Exceptions
How to, or not to, stop on /0.0 or (-1)*0.5?
ARSC now has a variety of different systems and they have different behaviors when encountering some of the common numerical errors which can occur even in the best of programs.
In the following sample sessions, a simple test code (given at the end of the article) which generates the classic error of a divide by zero and the more interesting negative number raised to a fractional power is used to show how to control behavior of code when encountering each error.
Naturally, users should be respecting such errors and writing code not to divide by zero or to perform illegal operations. However all practical programmers know that often one is using libraries and other shared codes and there might often be parts of an array which generate these errors but these parts are not used/referenced in your program. Also sometimes the error we are looking for is a numerical aspect of the current problem being solved and not a simple coding error. The techniques below can let you get round problems but are to be used with caution.
Cray SV1exOn the SV1ex, the code stops and reports the error.
chilkoot% f90 -o demo demo.f
chilkoot% ./demo
int 512, 0.9980506822612085, 2*0.E+0
Floating point exception
Beginning of Traceback:
Interrupt at address 465d in routine 'DEMO_ZEROS'.
Called from line 350 (address 22655c) in routine '$START$'.
End of Traceback.
Floating exception (core dumped)
How can we get things past the stop if we know it isn't serious?
The "cpu" command permits us to control much about the environment an executable runs in, on the SV1 several hardware options can be controlled like cache and memory access but also the control of exceptions can be disabled. (See "man cpu" for full details.)
For example, we can allow the program to continue even after a floating point exception occurs:
chilkoot% /etc/cpu -m fpeoff ./demo
int 512, 0.9980506822612085, 2*0.E+0
loop1 512, 0.9980506822612085, 2*0.E+0
libm-335 : UNRECOVERABLE
Vector Real raised by a scalar or vector Real has negative base.
Abort
Beginning of Traceback:
[ ... cut ... ]
Called from line 35 (address 526a) in routine 'DEMO_ZEROS'.
Called from line 350 (address 22655c) in routine '$START$'.
End of Traceback.
Abort (core dumped)
Trapping the fractional power error is a more complicated process and we'll discuss this in a later item.
Cray SX-6
On the Cray SX-6 the code runs but reports errors. The default limit for the number of warnings before halting is 49 so we get many messages before the code halts.
rime> f90 -o demo demo.f
demo.f:
f90: vec(1): demo.f, line 10: Vectorized loop.
f90: vec(1): demo.f, line 23: Vectorized loop.
f90: vec(1): demo.f, line 34: Vectorized loop.
f90: demo.f, demo_zeros: There are 3 diagnoses.
rime > ./demo
int 512 0.9980507 0.000000E+00 0.000000E+00
* 252 Floating-point zero divide PROG=demo_zeros ELN=25(400001580)
loop1 512 0.9980507 0.000000E+00 inf
* 274 RPWR -> R1<0 in R1**R2 PROG=demo_zeros ELN=36(4000017e8)
[ ... 48 identical lines cut ... ]
**** 99 Execution suspended PROG=demo_zeros ELN=36(4000017e8)
We can modify the number of error messages by setting the "F_ERRCNT" environment variable to another value. But this doesn't really help matters if we don't want to see any messages and aren't interested in them. However, not stopping on the first error can sometime be helpful as we get to see more of the numerical behavior of the code which might be enlightening, as the scientist attempts to understand the complexity of his/her algorithm's numerical behavior and what is happening to cause the bad numerical values.
On the SX-6, the handling of zero divide errors is through the compiler. (Compiler flags control not only zero divide, but underflow, overflow, and inexact.) We recompile with the appropriate flag:
rime > f90 -o demo -Wf"-M nozdiv " demo.f demo.f: f90: vec(1): demo.f, line 10: Vectorized loop. f90: vec(1): demo.f, line 23: Vectorized loop. f90: vec(1): demo.f, line 34: Vectorized loop. f90: demo.f, demo_zeros: There are 3 diagnoses.
When run, the zero divide error is passed, but the program is halted on the other error:
rime > ./demo
int 512 0.9980507 0.000000E+00 0.000000E+00
loop1 512 0.9980507 0.000000E+00 inf
* 274 RPWR -> R1<0 in R1**R2 PROG=demo_zeros ELN=36(4000017e8)
[ ... 48 identical lines cut ... ]
**** 99 Execution suspended PROG=demo_zeros ELN=36(4000017e8)
Stopping on the second error requires a little more attention to detail. This error is specific, a negative number raised to a fractional power, and has its very own code, 274 (see above error diagnostics). We can tell the system to ignore any specific error by setting the environment variable "F_ERROPTn". Details, from the SX-6 online fortran manual:
2.4.1.5 F_ERROPTn
If an error is detected during execution, an error message is issued or
the program is terminated according to the error processing control data
set for each error (see Section 8.3). This option changes the error
processing control data, thereby changing error processing.
For sh:
F_ERROPTn = n1, n2, alt, err, m, t, a, cnt
export F_ERROPTn
For csh:
setenv F_ERROPTn n1, n2, alt, err, m, t, a, cnt
Option values are:
n The priority of handling error-processing control. Select from
numbers 1 to 9.
n1 The first error number in the range for which error-processing
control data is changed.
n2 The last error number in the range for which error-processing
control data is changed.
alt Determines whether a user-defined error-processing routine is
executed when an error is detected.
0 = Not changed..
1 = A user-defined error-processing routine is executed.
2 = A user-defined error-processing routine is not executed.
err Determines whether control is passed to the specified statement
when an error specifier is included in an input/output statement.
0 = Not changed.
1 = Control is passed to the specified statement
2 = Control is not passed to the specified statement
m Determines if error messages are issued.
0 = Not changed.
1 = Error messages are issued
2 = Error messages are not issued
t Determines if a trace back message is issued.
0 = Not changed.
1 = Trace-back messages are issued
2 = Trace-back messages are not issued
a Determines if the program is terminated when an error is
detected.
0 = Not changed.
1 = The program is terminated abnormally
2 = The program continues processing
cnt Determines whether the number of errors that occurred is counted
when an error is detected.
0 = Not changed.
1 = The number of errors is counted
2 = The number of errors is not counted
We decide on the following: not to change the behavior for error 274 only, not to have a user defined routine, not to trap IO errors, not to issue error messages, not to trace back, not to terminate, and finally, not to count the errors in this class. Here's the appropriate setting, and a sample run:
rime > setenv F_ERROPT1 274,274,0,0,2,2,2,2 rime > ./demo int 512 0.9980507 0.000000E+00 0.000000E+00 loop1 512 0.9980507 0.000000E+00 inf loop2 512 0.9980507 0.000000E+00 0.000000E+00 loop2a 256 0.9961089 -0.9961089 nan (Note the control we get with of F_ERROPTn: we would still stop, count, and terminate on other numerical errors outside of the 274 class.)
IBM Regatta
With the default options, the program will run straight through numerical exceptions without halting or warnings:
f1n2 83% xlf -o demo demo.f
** demo_zeros === End of Compilation 1 ===
1501-510 Compilation successful for file demo.f.
f1n2 84% ./demo
int 512 0.9980506897 0.0000000000E+00 0.0000000000E+00
loop1 512 0.9980506897 0.0000000000E+00 INF
loop2 512 0.9980506897 0.0000000000E+00 0.0000000000E+00
loop2a 256 0.9961089492 -0.9961089492 NaNQ
We can ask the code to stop by using simple compiler options. This option, -qflttrap=zerodivide:enable forces the code to stop on zero divides. E.g.:
f1n2 87% xlf -o demo -qflttrap=zerodivide:enable demo.f
** demo_zeros === End of Compilation 1 ===
1501-510 Compilation successful for file demo.f.
f1n2 88% ./demo
int 512 0.9980506897 0.0000000000E+00 0.0000000000E+00
Breakpoint (core dumped)
This option, -qflttrap=invalid:enable stops on invalid operations, a class which includes (-1.0)**(0.5)
f1n2 89% xlf -o demo -qflttrap=invalid:enable demo.f
** demo_zeros === End of Compilation 1 ===
1501-510 Compilation successful for file demo.f.
f1n2 90% ./demo
int 512 0.9980506897 0.0000000000E+00 0.0000000000E+00
loop1 512 0.9980506897 0.0000000000E+00 INF
Breakpoint (core dumped)
Compiling with both simply stops at the first error encountered.
Discussion:
It is best to be a responsible user and not do illegal operations. The author recently received a large code which had been developed in the workstation environment. There were several segments where the code would divide by zero yet never use the result. While purists would suggest going through all the lines of code to modify the code it would have been a lengthy process. Using the above flags permitted the benchmarks to be performed, performance of the code measured, with the results being validated to ensure the use of the flags didn't result in bad behavior on the systems. (It is hoped to do some measurements to assess if the generation of errors has an impact on the performance of the systems later.)
Perhaps the best conclusion would be that developing code with the flags set to trap any errors is a good practice, strongly encouraged. I've often commented if computers were pieces of workshop machinery we'd not operate them without the safety guards in place would we?
Next, what is the impact of testing on performance?
Here's the code used in the above examples:
demo.f
program demo_zeros
integer nsize
parameter (nsize=1024)
real, dimension(nsize) :: a,b,c
nhalf=nsize/2
do n=1,nsize
a(n)=real(n)/real(n+1)
b(n)=real(n-nhalf)/real(n+1)
c(n)=0.0
enddo
write(6,*) ' int ',nhalf,a(nhalf),b(nhalf),c(nhalf)
!!
do n=1,nsize
c(n)=a(n)/b(n)
enddo
write(6,*) ' loop1 ',nhalf,a(nhalf),b(nhalf),c(nhalf)
!!
do n=1,nsize
c(n)=b(n)**a(n)
enddo
write(6,*) ' loop2 ',nhalf,a(nhalf),b(nhalf),c(nhalf)
nprint=nhalf/2
write(6,*) ' loop2a ',nprint,a(nprint),b(nprint),c(nprint)
stop
end
SX-6 Quick Look at 32 vs 64 Bit Performance
[ Thanks to Andrew Lee of ARSC for this article. ]
A user on the SX-6 asked: what is the performance difference between using -dw and -ew? The two compiler options disable and enable wide precision. The option -dw (default) specifies that the size of the numeric storage unit is 4 bytes. While -ew specifies that the size of numeric storage unit is changed to eight bytes. ARSC recommends that users stick with -ew unless they have a specific reason to do otherwise.
I sought to answer this question using a small test code written by Guy Robinson for Newsletter 213 :
The code compiled and ran without error on rime using the -ew option. However using "-dw" resulted in run-time errors "... Floating-point data overflow...". I modified the program by removing 24 of the "c(i-1,j)*(" lines and 8 of their corresponding "+1)-1)+1)" lines from each of the three large assignment statements in the main loop. This worked for both options, but the runtime was under 5 seconds, so I increased the number of iterations from 20K to 200K. Below are the compiler options all performance data obtained by setting "F_PROGINF DETAIL" for two runs of the modified program.
As you can see, the program achieves over 6.5 GFLOPS on one CPU in both the "-dw" and "-ew" runs, and the measurements of every other metric are also very close. The difference may be attributable to the activity of other users on the system. Thus, for this code, performance does not appear to be a factor in deciding whether to use "-dw" or "-ew".
As a disclaimer, this program does not test all performance related precision issues such as i/o, memory, and memory bandwidth.
Reasons a user might compile with the lower precision include compatibility of the program's binary i/o files with other programs and systems and need to squeeze more data into available memory.
Results:
gflop.f compiled with the disable wide option "-dw", run on 1 CPU, compiled with:
rime> f90 -Wf,-L,fmtlist,map -dw -C hopt -G global -sx6 -c gflop.f
****** Program Information ******
Real Time (sec) : 42.097395
User Time (sec) : 42.001194
Sys Time (sec) : 0.037286
Vector Time (sec) : 42.000494
Inst. Count : 2974633689.
V. Inst. Count : 2652801569.
V. Element Count : 320326821000.
FLOP Count : 280483814445.
MOPS : 7634.274625
MFLOPS : 6677.996139
VLEN : 120.750389
V. Op. Ratio (%) : 99.899631
Memory Size (MB) : 48.031250
MIPS : 70.822597
I-Cache (sec) : 0.001511
O-Cache (sec) : 0.002131
Bank (sec) : 0.000001
Start Time (date) : 2003/02/20 12:26:18
End Time (date) : 2003/02/20 12:27:00
gflop.f compiled with the enable wide option "-ew", run on 1 CPU, compiled with:
f90 -Wf,-L,fmtlist,map -ew -C hopt -G global -sx6 -c gflop.f
****** Program Information ******
Real Time (sec) : 42.783361
User Time (sec) : 42.716838
Sys Time (sec) : 0.030750
Vector Time (sec) : 42.716304
Inst. Count : 2974634175.
V. Inst. Count : 2652801625.
V. Element Count : 320326859400.
FLOP Count : 280483814445.
MOPS : 7506.376965
MFLOPS : 6566.118299
VLEN : 120.750401
V. Op. Ratio (%) : 99.899631
Memory Size (MB) : 48.031250
MIPS : 69.636103
I-Cache (sec) : 0.001148
O-Cache (sec) : 0.001526
Bank (sec) : 0.000000
Start Time (date) : 2003/02/20 12:24:19
End Time (date) : 2003/02/20 12:25:02
(If you're a rime user, please see detailed explanations of how -ew and -dw affects the size of each data type: table 2.3, Chapter 2.2.1, "f90 and sxf90 Command Options," found at:
http://www.arsc.edu/craydocs/sx6docs/g1af07e/chap2.html
)
ARSC User News: SV1ex, T3E, and SX-6 Users, and Web Docs
Chilkoot and Yukon:
- Programming environment 3.6 (PE3.6) remains available for testing, by switching your environment from PrgEnv to PrgEnv.new. This was first announced in issue 257 :
- f90 3.6.0.1 and CC 3.6.0.1 have been installed and correct a couple minor problems found in 3.6.0.0. Next weds, Feb. 26th, they'll be added to PrgEnv.new
- The default programming PrgEnv will be updated to PE3.6 (PrgEnv.new) on March 12.
Chilkoot:
- BIOLIB 1.1 has been installed as the default. It is loaded automatically with the command: "module load biolib". This provides a library of low-level routines, optimized for the SV1ex architecture, for use in bioinformatics applications. See: http://www.arsc.edu/support/manuals/LIBCBL/
Rime:
- /usr/tmp purge policy has been changed. Files not accessed in 7 days are deleted.
- The NQS queue structure was modified. See "news rime_queues" on rimegate and "news NQS_queues" on rime, for good measure!
-
If you need dedicated timings, simply submit your tested and debugged NQS job to the "sjs" (single job streaming) queue. If your job will take over 2 hours, please contact ARSC consulting in advance, so we can alert the SX-6 operators.
Once sumbitted to sjs, your job will wait in the sjs queue until the next SJS time _or_ the next unused dedicated time (see "news rime_sched" on rimegate for the current week's schedule), and then it will be run as the only job on the system. This is an expensive use of the SX-6, so please verify the correctness of your scripts, data, program, etc. using the normal "batch" queues first. On the other hand, dedicated testing is a vital final step in benchmarking codes and we encourage you to do it. So when you're ready, don't be shy.
As always, contact consult@arsc.edu with any concerns or questions.
Web Documentation:
- Our web has had some minor changes. See our top level documentation page: http://www.arsc.edu/support/documentation.html
- CrayDocs: http://www.arsc.edu/craydocs/
- Third party product (IMSL, HDF, PGI, Idesk, Ferret) as well as Cray bioinformatics library (CBL) docs: http://www.arsc.edu/support/manuals.html
Iditarod and Next HPC Newsletter
The editors will be attending Cray X1 training in two weeks, so the next newsletter won't appear till March 14th.
We're excited about the training, but to our chagrin, we'll be out of town for the Iditarod Sled Dog Race, which, for the first time ever, is starting here in Fairbanks.
The ceremonial start will be in Anchorage, as usual. But due to lack of snow and cold, the so-called "re-start" has been moved from it's usual location in Wasilla to our fair city, 300 miles farther North. For the curious:
http://www.adn.com/iditarod/news/prerace/story/2604849p-2651411c.htmlQuick-Tip Q & A
A:[[ A C style question. Don't worry what any of this does, the goal is
[[ just to compare a couple things and choose one:
[[
[[
[[ Option 1:
[[ ---------
[[ vel = (envel / scaleFct) / ((*outv)[n] + posStrt) < minvel ?
[[ (envel / scaleFct) / ((*outv)[n] + posStrt) :
[[ minvel;
[[
[[ Option 2:
[[ ---------
[[ vel = (envel / scaleFct) / ((*outv)[n] + posStrt);
[[ if ( ! (vel < minvel) )
[[ vel = minvel;
[[
[[
[[ Isn't there a cleaner way? If not, which should I prefer?
#
# Thanks to Dale Clark of ARSC:
#
I would format the conditional expression like this:
vel = (envel / scaleFct) / ((*outv)[n] + posStrt) < minvel
? (envel / scaleFct) / ((*outv)[n] + posStrt)
: minvel;
But that's pretty idiosyncratic, and may result in the arithmetic
expression being computed twice. So, even better (both stylistically and
functionally):
vel = min((envel / scaleFct) / ((*outv)[n] + posStrt),minvel);
where 'min' is the C equiv. of something like (lapsing into Perl here):
sub min
{
my($Min,@Rest) = @_;
for (@Rest) { $Min = $_ if $_ < $Min }
return $Min;
}
#
# Thanks to Brad Chamberlain:
#
Of these two, I'd stick to the second, since it only writes the new
velocity expression once rather than twice. This will make the code
easier to read, easier to maintain, and possibly more efficient if the
compiler is unable to factor the common subexpression into a temp (I tend
to try not to rely on the compiler to optimize my code, especially when it
also keeps my code cleaner).
To further increase the readability of the second, I'd consider rewriting
the condition as:
if (vel >= minvel)
since it doesn't rely on the reader to mentally negate the sense of the
expression.
If you really like conditional expressions, and don't equate shorter code
with cleaner code, I would also consider doing something like the
following (naming things often makes them cleaner, and I like that it only
assigns vel once:
newvel = (envel / scaleFct) / ((*outv)[n] + posStrt);
vel = (newvel < minvel) ? newvel : minvel;
You could also consider moving this expression to a macro, though it
wouldn't change the performance or implementation at all, merely how
the code reads:
#define cap(newvel, minvel) ((newvel < minvel) ? (newvel) : (minvel))
vel = cap((envel / scaleFct) / ((*outv)[n] + posStrt), minvel);
Q: Here's part of my PATH on ARSC's SX-6 frontend host, "rimegate":
rimegate$ echo $PATH
/SX/opt/crosskit/inst/bin:/SX/opt/sxcc/inst/bin:/SX/opt/sxc++/inst/b
in:/SX/opt/sxf90/inst/bin:/usr/psuite:/SX/opt/mpi2sx/inst:/SX/opt/mp
isx/inst:/usr/psuite:/SX/opt/vampirsx/inst/bin:/usr/local/krb5/bin:/
usr/sbin:etc...
Can I grep through the separate paths? For instance, can I tell grep
to use colons rather than newlines for the delimiter? I want to see
all the paths containing "SX", like this:
echo $PATH
grep SX
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
