ARSC T3D Users' Newsletter 101, August 23, 1996
ARSC Upgrades to UNICOS MAX 184.108.40.206
Last Tuesday, we upgraded the T3D's MAX operating system from version 220.127.116.11 to version 18.104.22.168. This upgrade should be transparent, but you should recompile and relink all of your your T3D executables as it affects include and library files as well as kernel routines and the user environment.
Here are the release contents for both MAX 22.214.171.124 and 126.96.36.199. In each case, point #1 is the most important.
Release contents ---------------- The UNICOS MAX 188.8.131.52 release includes the following changes: 1) Added a series of fixes designed to enhance system stability 2) Added support for preallocation of the roll file 3) Added binary executables for SAM, mppview, and URM 4) Added support for Phase III I/O Release contents ---------------- The UNICOS MAX 184.108.40.206 release includes the following changes: 1) Added a series of fixes designed to enhance system stability 2) Added some improvements to the XDR routines, primarily to improve the performance by converting numbers in large blocks. This allows the conversion to vectorize on PVP systems and to execute in a small (icache) loop on MPP systems.
Use f90 for Loopmark Listings of T3D CodesIf you use the "-rm" flag, CRI's f90 compiler will create a listing file with loops marked and optimizations explained. It can provide this for either T3D or Y-MP compilations. This is a big improvement over cf77, which only does "loopmark listing" of Y-MP compiles.
It's nice to know how a compiler alters your code when it optimizes it. Some optimizations reduce precision. Others, if you mislead the compiler (for instance, telling the Y-MP compiler to ignore vector dependencies when it shouldn't) can lead to incorrect results.
In Newsletter #99, I gave a program which timed the following loop:
ccc parameter (N=1000000) a = K ! A constant x = a do i=1,N x = x * a enddo ccc
When I compiled it for the T3D and Y-MP by setting the TARGET environment variable accordingly and then using the cf77 commands:
T3D: "cf77 prog.f -o t3d.exe" Y-MP: "cf77 prog.f -o ymp.exe"
I was surprised by the timings:
T3D: 200,000 mflop/s Y-MP: 150 mflop/s
It was easy to find out what the Y-MP compiler had done, as a recompile with the cf77 flag, -Wf"-em":
Y-MP: "cf77 -Wf"-em" prog.f -o ymp.exe"
produced a "loopmark listing" which showed that the loop had vectorized. Good enough.
I assumed that the T3D compiler had actually eliminated the loop, but as "loopmark listing" is not available under cf77 for T3D codes, I didn't know how to prove it. Eventually, I discovered that I could recompile with the -Wf"-cm" flag:
T3D: "cf77 -Wf"-cm" prog.f -o t3d.exe"
which produced a CIF (Compiler Information File). CIF's contain human unreadable data, but in the CIF manual, I found a C program which extracts "compiler messages" from CIFs. I copied, compiled, and ran this C program on my CIF to get the following information:
"message at line 22: A loop was eliminated by optimization."
This was moderately satisfying, at best. As far as I know, it's the most information on T3D optimizations you can get from the cf77 compiling system (if anyone knows a better way, let me know, and I'll pass it on).
The solution I found was to use f90.
In f90, you can compile with the same flags for either T3D or Y-MP to get various human readable listing files. For instance:
T3D: "f90 -rm loops.f -o loops" Y-MP: "f90 -rm loops.f -o loops"
will produce a listing similar to cf77's Y-MP loopmark listing. To provide a T3D vs Y-MP example, I used these compile commands on the following code:
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc program loops implicit none integer N parameter (N=1000) integer i real slamch real xarr(N), yarr(n), zarr(n), a, x, eps eps = slamch('E') a = 1.0 - eps x = a do i=1,N x = x * a enddo print*, "(1.0 - eps) ^ ", N, " = ", x do i=1,N xarr = i * eps yarr = i + eps zarr = eps enddo call dummy (xarr, yarr, zarr) do i=1,N yarr(i) = xarr(i) * i enddo call dummy (xarr, yarr, zarr) end ccc subroutine dummy (x,y,z) real x,y,z end cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Here's an excerpt from the resulting Y-MP listing file (automatically named "loops.l"):
12 a = 1.0 - eps 13 x = a 14 1 -------- do i=1,N 15 1 x = x * a 16 1 -------> enddo 17 print*, "(1.0 - eps) ^ ", N, " = ", x 18 19 1 -------- do i=1,N 20 VecArrOps xarr = i * eps 21 ArrayOps yarr = i + eps 22 ArrayOps zarr = eps 23 1 -------> enddo 24 25 call dummy (xarr, yarr, zarr) 26 27 v -------- do i=1,N 28 v yarr(i) = xarr(i) * i 29 v -------> enddo 30 f90 Compiler - 6 messages: 1) <f90-6002,Scalar> A loop starting at line 14 was eliminated by optimization. 2) <f90-6204,Vector> A loop starting at line 20 was vectorized. 3) <f90-6009,Scalar> A floating point expression involving an induction variable was strength reduced by optimization. This may cause numerical differences. 4) <f90-6004,Scalar> A loop starting at line 21 was fused with the loop starting at line 20. 5) <f90-6004,Scalar> A loop starting at line 22 was fused with the loop starting at line 20. 6) <f90-6204,Vector> A loop starting at line 27 was vectorized.
Running the "explain" command on any of these messages provides even more help (but the error codes should start with "cf90", not "f90"). For instance:
denali$ explain cf90-6204 Vector code was generated for the loop. The compiler vectorizes a loop when it can be determined that the meaning of the loop will not change by doing so. However, the order of expression evaluation may change, and results may differ. Generally, the vector version of a loop executes much faster than the scalar version.
Here's the excerpt from the T3D listing:
12 a = 1.0 - eps 13 x = a 14 1 -------- do i=1,N 15 1 x = x * a 16 1 --------> enddo 17 print*, "(1.0 - eps) ^ ", N, " = ", x 18 19 1 -------- do i=1,N 20 ArrayOps xarr = i * eps 21 ArrayOps yarr = i + eps 22 ArrayOps zarr = eps 23 1 -------> enddo 24 25 call dummy (xarr, yarr, zarr) 26 27 1 -------- do i=1,N 28 1 yarr(i) = xarr(i) * i 29 1 -------> enddo 30 f90 Compiler - 4 messages: 1) <f90-6002,Scalar> A loop starting at line 14 was eliminated by optimization. 2) <f90-6009,Scalar> A floating point expression involving an induction variable was strength reduced by optimization. This may cause numerical differences. 3) <f90-6004,Scalar> A loop starting at line 21 was fused with the loop starting at line 20. 4) <f90-6004,Scalar> A loop starting at line 22 was fused with the loop starting at line 20.
This consistent behavior across platforms is really nice. A good reason to use f90 instead of cf77.
The 'mppfixpe' Command and Plastic Executables
This may not seem like the most useful command (why would one want to sacrifice flexibility?), but there are good reasons to use fixed executables. For instance, we had visitors working on-site last week who got a 2:1 speedup in the load time of a program when they switched from plastic to fixed. This was a boon because they wanted to do multiple, short test runs, and the load time had become a major percentage of the total time spent on each run.
They used 'mppldr -X $(NPES) ...' to re-link the program with a fixed number of pes. However, had they no longer had access to the source or object files, mppldr would not have worked, and they could have used mppfixpe.
For a thorough discussion of plastic and fixed executables, see Newsletter #44 . Here, however, is a quick comparison:
Advantages of fixed executables:
- mppldr not called on each run
- smaller file size
Advantages of plastic executables:
- number of PEs is flexible -- determined at runtime
- can usually be converted to fixed, whenever desired, using mppfixpe
This is from CRI's man page:
NAME mppfixpe - Reconfigures a CRAY T3D absolute for a different number of PEs SYNOPSIS mppfixpe -o newname -X npes [-M opts] [-V] oldname DESCRIPTION The mppfixpe utility reads an existing CRAY T3D absolute (plastic a.out file) and, if possible, changes it so that it will execute using a different number of processing elements (PEs). A plastic a.out file refers to an a.out file on a CRAY T3D system that has been created without using either compiler or loader directives to specify (or fix) the number of processing elements. This lets you specify the at execution time the number of PEs. For example: /mpp/bin/cft77 t.f /mpp/bin/mppldr t.o a.out npes 128 If you fix the number of PEs on either the cf77 or the mppldr command line, the resulting a.out file no longer is considered to be plastic, and you cannot specify the number of PEs to use at run time. A plastic a.out file is assumed to have been targeted for 0 PEs. The mppfixpe utility accepts the following options: -o newname Specifies the path name where the new absolute is to be stored. -X npes Specifies the number of PEs for which the new absolute is to be configured. -M opts Requests that the loader produce a map of the new absolute. The opts values are those known to mppldr(1). -V Causes the mppfixpe utility to write its version identification to stderr. oldname Specifies the path name of the existing CRAY T3D absolute. NOTES The mppldr and mppfixpe utilities assume that fairly ordinary things are being done. However, if you are changing the loader's CALLXFER directive, things may not work the way you want.
Quick-Tip Q & A
- Q: How can you delete a file named "-i" ???
- (You would create it if, for instance, you accidentally typed "cp txt -i" instead of "cp -i txt txt2".)
- A: ???
- (Sorry... not till next week...)
[ Answers, questions, and tips graciously accepted. ]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.