ARSC T3D Users' Newsletter 43, July 7, 1995

Benchmarking Switches: -D "rdahead=on" and -X npes

Benchmarking is sometimes a funny activity because the goals are so open-ended. Basically a benchmarker is supposed to squeeze the best performance out of a code given the current hardware and software available. Getting a 10X or 10% speedup isn't enough, the benchmarker isn't done until there's nothing left to squeeze. Usually, the hardware configuration is fixed, so the concentration is on the benchmark code itself or the software environment. If the rules of the benchmark call for no modifications of the source code (like the 100 by 100 linpack benchmark) then the benchmarker has to work with software tools like the compilers, loaders and maybe the libraries.

Benchmarkers are famous for pushing software releases before they are ready. A benchmarker thinks: "As long as the answers are correct for this benchmark then the software release is ready". But not all optimizations produce the correct answers or best speedups for each code, so software writers have protected the users (and themselves) by making possibly unsafe or restrictive optimizations available with a switch. The underlying idea is that when a user uses a switch different than the defaults, then the user accepts responsibility for wrong answers or limited use. It seems an honest situation that the tradeoff between speed and safety is made so explicit.

Now the question becomes: "Which switches should I use?" Well, of course the answer depends on what they do for the code in terms of:

  1. Answers still correct?
  2. Did the code speed up?
A further complication is that effects of these switches are not mutually exclusive. In combination they have effects different than when used separately. So a benchmarker's job sometimes is to try all possible switches in all possible combinations. Then the combination that produces the fastest code AND the correct answers is the "right" combination for this benchmark. Looking at the table of the 100 by 100 linpack benchmark results shows how this "switchology" can get out of hand. Below is description of two switches for the T3D.

Using the -D "rdahead=on" Switch on the mppldr

As described in Jeff Brooks' paper on single PE optimization (available on the ARSC anonymous ftp server), it is possible to put the T3D processor in "read-ahead mode". When the processor is in this mode, a cache line load will cause the next consecutive cache line to be read into a "read-ahead buffer" where it will be loaded into cache almost twice as fast as a cache line from memory. So in this mode, sequential access is greatly sped up.

FFT routines are not known for their sequential access but it is easy to re-run an existing program with only one change, a new switch on the loader. So I relinked Chris Yerkes's 2D FFT program now as:


  /mpp/bin/mppldr -o a.out -D "rdahead=on" $(SUBS)
and got the results of the second table below:

Table 1


  Performance (MFLOPS) on Chris Yerkes' 2D FFT on ARSC's T3D
       (original results from newsletter #40, 6/16/95 )

  Side of
   square  <--------------------number of PEs------------------------->
    array
              1       2       4       8      16      32      64     128

      8     5.1     6.0     6.8     6.2     7.6     4.8     2.7     1.4
     16    11.0    16.1    21.1    23.8    20.6    23.1    13.7     7.3
     32    14.4    24.0    39.2    54.8    65.8    56.8    61.3    35.3
     64    16.2    29.0    55.6    92.4   139.1   169.1   145.8   152.3
    128    19.5    37.6    71.9   128.2   228.3   334.2   412.3   352.3
    256    22.8    44.4    86.7   166.8   311.6   519.2   786.0   957.5
    512    21.2    41.8    82.7   162.5   316.0   599.2  1069.2  1620.8
   1024    18.6    36.9    73.5   145.8   288.6   565.1  1087.3  1986.9
   2048  memory    34.8    69.4   138.6   276.0   547.0  1074.8  2062.7
   4096  memory  memory  memory   134.8   269.6   536.8  1066.5  2096.1
   8192  memory  memory  memory  memory  memory   464.8   927.3  1842.3
  16384  memory  memory  memory  memory  memory  memory  memory  1981.8
  32768  memory  memory  memory  memory  memory  memory  memory  memory

  
Table 2
 

  Performance (MFLOPS) on Chris Yerkes' 2D FFT on ARSC's T3D
           (with /mpp/bin/mppldr -D "rdahead=on" )

  Side of
   square  <--------------------number of PEs------------------------->
    array
              1       2       4       8      16      32      64     128

      8     5.2     3.1     6.9     6.2     7.7     4.8     2.6     1.4
     16    11.5    16.6    22.0    24.2    20.8    23.1    13.7     7.3
     32    14.9    25.1    39.0    57.9    67.3    57.5    58.6    35.3
     64    19.2    33.9    67.3   105.6   154.4   170.6   149.2   152.3
    128    21.9    41.9    79.2   140.0   252.3   358.5   424.4   359.7
    256    25.7    49.6    95.9   182.8   337.6   555.7   855.7  1000.1
    512    22.8    44.8    87.9   172.3   334.3   631.0  1120.8  1759.6
   1024    19.3    38.2    75.8   150.1   296.6   580.5  1116.8  2029.9
   2048  memory    35.7    71.3   142.1   282.4   559.2  1098.4  2114.3
   4096  memory  memory  memory   137.9   275.5   548.0  1087.7  2137.7
   8192  memory  memory  memory  memory  memory   472.4   941.8  1870.4
  16384  memory  memory  memory  memory  memory  memory  memory  2011.4
  32768  memory  memory  memory  memory  memory  memory  memory  memory
A comparison of the tables show a consistent improvement even for FFT access of memory. This improvement is something that a benchmarker could get excited about.

Plastic vs. Compiled for a Fixed Number of PEs

The "-X npes" switch is described in both the Craft Fortran compiler (cf77) and the Fortran 90 (f90) compiler man pages:

  man 1m cf77   
and

  man 1m f90
Without this switch, the compilers produce code that can determine the number of processors at run time and execute correctly. An executable compiled this way is called "plastic" as it is flexible enough to fit into the number of PEs in the runtime enviroment. But if the code is compiled with with the "-X npes" switch then the compiler will assume that the target of the executable will be only a T3D configuration of npes PEs. With the number of PEs fixed the compilers can produce more efficient code, in both speed and size.

To investigate this switch, I compiled and ran one of my lab exercises from the ARSC T3D class. This simple program:

  1. Initializes a 128 by 128 array with a spike of influence
  2. Applies a relaxation operator until the residual from the previous time step is small
The two numbers at each entry in the table below shows the effect of the compilation switch on the execution time. The first number is the time of the plastic executable, the second is the time when compiled for a fixed number of PEs. Three different phases of the program are timed.

  Timings (seconds) for three phases of a relaxation program; effect
  of compiling a "plastic" executable vs. for fixed number of PEs

                     <-------------phases------------->

  Compilation  # of  initial-   relaxation   residual
  options       PEs  ization                calculation
   
  plastic       1    0.010992    3.601141    1.021725
  with -X1      1    0.010328    3.568962    1.027948
  
  plastic       2    0.011143    2.314543    0.662548
  with -X2      2    0.011134    2.323314    0.662165
 
  plastic       4    0.011221    1.162306    0.255969
  with -X4      4    0.011217    1.156340    0.255204
   
  plastic       8    0.011261    0.581741    0.131107
  with -X8      8    0.011259    0.578826    0.130379
   
  plastic      16    0.011280    0.291116    0.065957
  with -X16    16    0.011275    0.289595    0.065732
   
  plastic      32    0.011290    0.145783    0.023971
  with -X32    32    0.011288    0.145234    0.023458
   
  plastic      64    0.011293    0.074228    0.014759
  with -X64    64    0.011291    0.073646    0.014307
   
  plastic     128    0.011290    0.040066    0.011630
  with -X128  128    0.011002    0.037600    0.010887
With this switch the timings are consistently affected in the right direction but the effect is minimal.

If any users have similar experiences in "switchology", I'd be happy to pass them on through this newsletter.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
  12. RANF() and its manpage differ between machines (Newsletter #37)
  13. CRAY2IEG is available only on the Y-MP (Newsletter #40)
  14. Missing sort routines on the T3D (Newsletter #41)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top