ARSC T3D Users' Newsletter 101, August 23, 1996
ARSC Upgrades to UNICOS MAX 1.3.0.2
Last Tuesday, we upgraded the T3D's MAX operating system from version 1.2.0.5 to version 1.3.0.2. This upgrade should be transparent, but you should recompile and relink all of your your T3D executables as it affects include and library files as well as kernel routines and the user environment.
Here are the release contents for both MAX 1.3.0.0 and 1.3.0.2. In each case, point #1 is the most important.
Release contents
----------------
The UNICOS MAX 1.3.0.0 release includes the following changes:
1) Added a series of fixes designed to enhance system stability
2) Added support for preallocation of the roll file
3) Added binary executables for SAM, mppview, and URM
4) Added support for Phase III I/O
Release contents
----------------
The UNICOS MAX 1.3.0.2 release includes the following changes:
1) Added a series of fixes designed to enhance system stability
2) Added some improvements to the XDR routines, primarily to improve
the performance by converting numbers in large blocks. This allows
the conversion to vectorize on PVP systems and to execute in a small
(icache) loop on MPP systems.
Use f90 for Loopmark Listings of T3D Codes
If you use the "-rm" flag, CRI's f90 compiler will create a listing file with loops marked and optimizations explained. It can provide this for either T3D or Y-MP compilations. This is a big improvement over cf77, which only does "loopmark listing" of Y-MP compiles.It's nice to know how a compiler alters your code when it optimizes it. Some optimizations reduce precision. Others, if you mislead the compiler (for instance, telling the Y-MP compiler to ignore vector dependencies when it shouldn't) can lead to incorrect results.
In Newsletter #99, I gave a program which timed the following loop:
ccc
parameter (N=1000000)
a = K ! A constant
x = a
do i=1,N
x = x * a
enddo
ccc
When I compiled it for the T3D and Y-MP by setting the TARGET environment variable accordingly and then using the cf77 commands:
T3D: "cf77 prog.f -o t3d.exe" Y-MP: "cf77 prog.f -o ymp.exe"
I was surprised by the timings:
T3D: 200,000 mflop/s Y-MP: 150 mflop/s
It was easy to find out what the Y-MP compiler had done, as a recompile with the cf77 flag, -Wf"-em":
Y-MP: "cf77 -Wf"-em" prog.f -o ymp.exe"
produced a "loopmark listing" which showed that the loop had vectorized. Good enough.
I assumed that the T3D compiler had actually eliminated the loop, but as "loopmark listing" is not available under cf77 for T3D codes, I didn't know how to prove it. Eventually, I discovered that I could recompile with the -Wf"-cm" flag:
T3D: "cf77 -Wf"-cm" prog.f -o t3d.exe"
which produced a CIF (Compiler Information File). CIF's contain human unreadable data, but in the CIF manual, I found a C program which extracts "compiler messages" from CIFs. I copied, compiled, and ran this C program on my CIF to get the following information:
"message at line 22: A loop was eliminated by optimization."
This was moderately satisfying, at best. As far as I know, it's the most information on T3D optimizations you can get from the cf77 compiling system (if anyone knows a better way, let me know, and I'll pass it on).
The solution I found was to use f90.
In f90, you can compile with the same flags for either T3D or Y-MP to get various human readable listing files. For instance:
T3D: "f90 -rm loops.f -o loops" Y-MP: "f90 -rm loops.f -o loops"
will produce a listing similar to cf77's Y-MP loopmark listing. To provide a T3D vs Y-MP example, I used these compile commands on the following code:
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
program loops
implicit none
integer N
parameter (N=1000)
integer i
real slamch
real xarr(N), yarr(n), zarr(n), a, x, eps
eps = slamch('E')
a = 1.0 - eps
x = a
do i=1,N
x = x * a
enddo
print*, "(1.0 - eps) ^ ", N, " = ", x
do i=1,N
xarr = i * eps
yarr = i + eps
zarr = eps
enddo
call dummy (xarr, yarr, zarr)
do i=1,N
yarr(i) = xarr(i) * i
enddo
call dummy (xarr, yarr, zarr)
end
ccc
subroutine dummy (x,y,z)
real x,y,z
end
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Here's an excerpt from the resulting Y-MP listing file (automatically named "loops.l"):
12 a = 1.0 - eps
13 x = a
14 1 -------- do i=1,N
15 1 x = x * a
16 1 -------> enddo
17 print*, "(1.0 - eps) ^ ", N, " = ", x
18
19 1 -------- do i=1,N
20 VecArrOps xarr = i * eps
21 ArrayOps yarr = i + eps
22 ArrayOps zarr = eps
23 1 -------> enddo
24
25 call dummy (xarr, yarr, zarr)
26
27 v -------- do i=1,N
28 v yarr(i) = xarr(i) * i
29 v -------> enddo
30
f90 Compiler - 6 messages:
1) <f90-6002,Scalar> A loop starting at line 14 was eliminated
by optimization.
2) <f90-6204,Vector> A loop starting at line 20 was vectorized.
3) <f90-6009,Scalar> A floating point expression involving an
induction variable was strength reduced
by optimization. This may cause numerical
differences.
4) <f90-6004,Scalar> A loop starting at line 21 was fused with
the loop starting at line 20.
5) <f90-6004,Scalar> A loop starting at line 22 was fused with
the loop starting at line 20.
6) <f90-6204,Vector> A loop starting at line 27 was vectorized.
Running the "explain" command on any of these messages provides even more help (but the error codes should start with "cf90", not "f90"). For instance:
denali$ explain cf90-6204 Vector code was generated for the loop. The compiler vectorizes a loop when it can be determined that the meaning of the loop will not change by doing so. However, the order of expression evaluation may change, and results may differ. Generally, the vector version of a loop executes much faster than the scalar version.
Here's the excerpt from the T3D listing:
12 a = 1.0 - eps
13 x = a
14 1 -------- do i=1,N
15 1 x = x * a
16 1 --------> enddo
17 print*, "(1.0 - eps) ^ ", N, " = ", x
18
19 1 -------- do i=1,N
20 ArrayOps xarr = i * eps
21 ArrayOps yarr = i + eps
22 ArrayOps zarr = eps
23 1 -------> enddo
24
25 call dummy (xarr, yarr, zarr)
26
27 1 -------- do i=1,N
28 1 yarr(i) = xarr(i) * i
29 1 -------> enddo
30
f90 Compiler - 4 messages:
1) <f90-6002,Scalar> A loop starting at line 14 was eliminated by
optimization.
2) <f90-6009,Scalar> A floating point expression involving an
induction variable was strength reduced
by optimization. This may cause numerical
differences.
3) <f90-6004,Scalar> A loop starting at line 21 was fused with
the loop starting at line 20.
4) <f90-6004,Scalar> A loop starting at line 22 was fused with
the loop starting at line 20.
This consistent behavior across platforms is really nice. A good reason to use f90 instead of cf77.
The 'mppfixpe' Command and Plastic Executables
This may not seem like the most useful command (why would one want to sacrifice flexibility?), but there are good reasons to use fixed executables. For instance, we had visitors working on-site last week who got a 2:1 speedup in the load time of a program when they switched from plastic to fixed. This was a boon because they wanted to do multiple, short test runs, and the load time had become a major percentage of the total time spent on each run.
They used 'mppldr -X $(NPES) ...' to re-link the program with a fixed number of pes. However, had they no longer had access to the source or object files, mppldr would not have worked, and they could have used mppfixpe.
For a thorough discussion of plastic and fixed executables, see Newsletter #44 . Here, however, is a quick comparison:
-
Advantages of fixed executables:
- mppldr not called on each run
- smaller file size
-
Advantages of plastic executables:
- number of PEs is flexible -- determined at runtime
- can usually be converted to fixed, whenever desired, using mppfixpe
This is from CRI's man page:
NAME
mppfixpe - Reconfigures a CRAY T3D absolute for a different
number of PEs
SYNOPSIS
mppfixpe -o newname -X npes [-M opts] [-V] oldname
DESCRIPTION
The mppfixpe utility reads an existing CRAY T3D absolute (plastic
a.out file) and, if possible, changes it so that it will execute using
a different number of processing elements (PEs).
A plastic a.out file refers to an a.out file on a CRAY T3D system that
has been created without using either compiler or loader directives to
specify (or fix) the number of processing elements. This lets you
specify the at execution time the number of PEs. For example:
/mpp/bin/cft77 t.f
/mpp/bin/mppldr t.o
a.out npes 128
If you fix the number of PEs on either the cf77 or the mppldr command
line, the resulting a.out file no longer is considered to be plastic,
and you cannot specify the number of PEs to use at run time.
A plastic a.out file is assumed to have been targeted for 0 PEs.
The mppfixpe utility accepts the following options:
-o newname Specifies the path name where the new absolute is to be
stored.
-X npes Specifies the number of PEs for which the new absolute
is to be configured.
-M opts Requests that the loader produce a map of the new
absolute. The opts values are those known to mppldr(1).
-V Causes the mppfixpe utility to write its version
identification to stderr.
oldname Specifies the path name of the existing CRAY T3D
absolute.
NOTES
The mppldr and mppfixpe utilities assume that fairly ordinary things
are being done. However, if you are changing the loader's CALLXFER
directive, things may not work the way you want.
Quick-Tip Q & A
- Q: How can you delete a file named "-i" ???
- (You would create it if, for instance, you accidentally typed "cp txt -i" instead of "cp -i txt txt2".)
- A: ???
- (Sorry... not till next week...)
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
