ARSC T3D Users' Newsletter 43, July 7, 1995
Benchmarking Switches: -D "rdahead=on" and -X npes
Benchmarking is sometimes a funny activity because the goals are so open-ended. Basically a benchmarker is supposed to squeeze the best performance out of a code given the current hardware and software available. Getting a 10X or 10% speedup isn't enough, the benchmarker isn't done until there's nothing left to squeeze. Usually, the hardware configuration is fixed, so the concentration is on the benchmark code itself or the software environment. If the rules of the benchmark call for no modifications of the source code (like the 100 by 100 linpack benchmark) then the benchmarker has to work with software tools like the compilers, loaders and maybe the libraries.
Benchmarkers are famous for pushing software releases before they are ready. A benchmarker thinks: "As long as the answers are correct for this benchmark then the software release is ready". But not all optimizations produce the correct answers or best speedups for each code, so software writers have protected the users (and themselves) by making possibly unsafe or restrictive optimizations available with a switch. The underlying idea is that when a user uses a switch different than the defaults, then the user accepts responsibility for wrong answers or limited use. It seems an honest situation that the tradeoff between speed and safety is made so explicit.
Now the question becomes: "Which switches should I use?" Well, of course the answer depends on what they do for the code in terms of:
- Answers still correct?
- Did the code speed up?
Using the -D "rdahead=on" Switch on the mppldr
As described in Jeff Brooks' paper on single PE optimization (available on the ARSC anonymous ftp server), it is possible to put the T3D processor in "read-ahead mode". When the processor is in this mode, a cache line load will cause the next consecutive cache line to be read into a "read-ahead buffer" where it will be loaded into cache almost twice as fast as a cache line from memory. So in this mode, sequential access is greatly sped up.FFT routines are not known for their sequential access but it is easy to re-run an existing program with only one change, a new switch on the loader. So I relinked Chris Yerkes's 2D FFT program now as:
/mpp/bin/mppldr -o a.out -D "rdahead=on" $(SUBS)and got the results of the second table below:
Table 1
Performance (MFLOPS) on Chris Yerkes' 2D FFT on ARSC's T3D
(original results from newsletter #40, 6/16/95 )
Side of
square <--------------------number of PEs------------------------->
array
1 2 4 8 16 32 64 128
8 5.1 6.0 6.8 6.2 7.6 4.8 2.7 1.4
16 11.0 16.1 21.1 23.8 20.6 23.1 13.7 7.3
32 14.4 24.0 39.2 54.8 65.8 56.8 61.3 35.3
64 16.2 29.0 55.6 92.4 139.1 169.1 145.8 152.3
128 19.5 37.6 71.9 128.2 228.3 334.2 412.3 352.3
256 22.8 44.4 86.7 166.8 311.6 519.2 786.0 957.5
512 21.2 41.8 82.7 162.5 316.0 599.2 1069.2 1620.8
1024 18.6 36.9 73.5 145.8 288.6 565.1 1087.3 1986.9
2048 memory 34.8 69.4 138.6 276.0 547.0 1074.8 2062.7
4096 memory memory memory 134.8 269.6 536.8 1066.5 2096.1
8192 memory memory memory memory memory 464.8 927.3 1842.3
16384 memory memory memory memory memory memory memory 1981.8
32768 memory memory memory memory memory memory memory memory
Table 2
Performance (MFLOPS) on Chris Yerkes' 2D FFT on ARSC's T3D
(with /mpp/bin/mppldr -D "rdahead=on" )
Side of
square <--------------------number of PEs------------------------->
array
1 2 4 8 16 32 64 128
8 5.2 3.1 6.9 6.2 7.7 4.8 2.6 1.4
16 11.5 16.6 22.0 24.2 20.8 23.1 13.7 7.3
32 14.9 25.1 39.0 57.9 67.3 57.5 58.6 35.3
64 19.2 33.9 67.3 105.6 154.4 170.6 149.2 152.3
128 21.9 41.9 79.2 140.0 252.3 358.5 424.4 359.7
256 25.7 49.6 95.9 182.8 337.6 555.7 855.7 1000.1
512 22.8 44.8 87.9 172.3 334.3 631.0 1120.8 1759.6
1024 19.3 38.2 75.8 150.1 296.6 580.5 1116.8 2029.9
2048 memory 35.7 71.3 142.1 282.4 559.2 1098.4 2114.3
4096 memory memory memory 137.9 275.5 548.0 1087.7 2137.7
8192 memory memory memory memory memory 472.4 941.8 1870.4
16384 memory memory memory memory memory memory memory 2011.4
32768 memory memory memory memory memory memory memory memory
A comparison of the tables show a consistent improvement even for FFT access of memory. This improvement is something that a benchmarker could get excited about.
Plastic vs. Compiled for a Fixed Number of PEs
The "-X npes" switch is described in both the Craft Fortran compiler (cf77) and the Fortran 90 (f90) compiler man pages:man 1m cf77 and man 1m f90Without this switch, the compilers produce code that can determine the number of processors at run time and execute correctly. An executable compiled this way is called "plastic" as it is flexible enough to fit into the number of PEs in the runtime enviroment. But if the code is compiled with with the "-X npes" switch then the compiler will assume that the target of the executable will be only a T3D configuration of npes PEs. With the number of PEs fixed the compilers can produce more efficient code, in both speed and size.
To investigate this switch, I compiled and ran one of my lab exercises from the ARSC T3D class. This simple program:
- Initializes a 128 by 128 array with a spike of influence
- Applies a relaxation operator until the residual from the previous time step is small
Timings (seconds) for three phases of a relaxation program; effect
of compiling a "plastic" executable vs. for fixed number of PEs
<-------------phases------------->
Compilation # of initial- relaxation residual
options PEs ization calculation
plastic 1 0.010992 3.601141 1.021725
with -X1 1 0.010328 3.568962 1.027948
plastic 2 0.011143 2.314543 0.662548
with -X2 2 0.011134 2.323314 0.662165
plastic 4 0.011221 1.162306 0.255969
with -X4 4 0.011217 1.156340 0.255204
plastic 8 0.011261 0.581741 0.131107
with -X8 8 0.011259 0.578826 0.130379
plastic 16 0.011280 0.291116 0.065957
with -X16 16 0.011275 0.289595 0.065732
plastic 32 0.011290 0.145783 0.023971
with -X32 32 0.011288 0.145234 0.023458
plastic 64 0.011293 0.074228 0.014759
with -X64 64 0.011291 0.073646 0.014307
plastic 128 0.011290 0.040066 0.011630
with -X128 128 0.011002 0.037600 0.010887
With this switch the timings are consistently affected in the right direction but the effect is minimal.
If any users have similar experiences in "switchology", I'd be happy to pass them on through this newsletter.
List of Differences Between T3D and Y-MP
The current list of differences between the T3D and the Y-MP is:- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
- Missing Linpack and Eispack routines in libsci (Newsletter #25)
- F90 manual for Y-MP, no manual for T3D (Newsletter #31)
- RANF() and its manpage differ between machines (Newsletter #37)
- CRAY2IEG is available only on the Y-MP (Newsletter #40)
- Missing sort routines on the T3D (Newsletter #41)
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
