ARSC T3D Users' Newsletter 48, August 18, 1995

A Comparison of the SRC ac Compiler and the CRI cc Compiler

When the AC compiler was announced at the Spring CUG meeting, there was a lot of interest because it was supposedly faster than the Standard C compiler provided by CRI. I believe CRI's contention was that the speed differences vary from program to program. To investigate this claim, I tried some standard benchmarks and some of my own benchmarks to put some numbers into the argument. The two tables below show the effect of the standard optimization switches for each compiler. Each table has the same format. (When results are given in MFLOPS, Dhrystones or Whetstones then bigger is better, when the results are in seconds then small is beautiful.)


Table 1


  Performance results for the CRI Standard C compiler (PE 1.2.2):
  
  compiler switch: NONE     -O    -O0    -O1    -O2    -O3    
  
  Livermore loops   
    long loops     9.98  10.03   1.84   8.59  10.01  11.41  MFLOPS
    medium loops   9.72   9.73   1.84   8.38   9.73  10.27  MFLOPS
    short loops    9.73   9.74   1.84   8.40   9.37  10.01  MFLOPS
  
  Linpack
    100
      single       14.6   14.6    2.7   14.6   16.4   16.3  MFLOPS
      double       12.6   12.6    2.6   12.4   13.8   13.8  MFLOPS
    1000
      single       15.3   15.3    2.8   15.3   15.8   15.8  MFLOPS
      double       10.8   10.8    2.6   10.8   11.0   11.0  MFLOPS
  
  Dhrystones
    with          37037  37037  28571  45454  47619  55556  Dhrystones
    without       37037  38461  28571  45454  45454  55556  Dhrystones
  
  Whetstones      36405  36404  36405  36405  36406  36397  Whetstones
  
  Puzzle           45.8   45.8  135.0   40.8   45.8   48.2  seconds
  
  Nest calls      112.8  112.8  175.9    1.3    0.0    0.0  seconds
      inline        0.2    0.2   72.5    1.7    0.2    0.2  seconds
  
 
  
Table 2


  Performance results for the IDA SRC Gnu C compiler (AC 2.6.2):
  
  compiler switch: NONE     -O    -O0    -O1    -O2    -O3    -O4    -O5
  
  Livermore loops
    long loops     2.37   8.26   2.42   8.24   9.52   9.51   9.51   9.51  MFLOPS
    medium loops   2.33   7.47   2.33   7.47   8.36   8.35   8.34   8.34  MFLOPS
    short loops    2.33   7.33   2.33   7.34   8.14   8.18   8.17   8.15  MFLOPS
  
  Linpack
    100
      single        3.7   15.2    3.7   15.2   17.0   17.0   17.0   17.0  MFLOPS
      double        3.6   13.0    3.7   13.0   14.3   14.3   14.3   14.3  MFLOPS
   1000
      single        3.9   15.4    3.9   15.4   16.8   16.8   16.7   16.7  MFLOPS
      double        3.5   11.0    3.5   11.0   11.6   11.6   11.6   11.6  MFLOPS
  
  Dhrystones
    with          45454  66667  45454  66667  90909  90909  90909  90909  Dhrystones
    without       43478  66667  43478  66667  90909 100000  90909  90909  Dhrystones
  
  Whetstones      24160  37896  24160  37896  38612  38613  38611  38611  Whetstones
  
  Puzzle          113.8   36.5  113.8   36.5   33.6   35.2   33.6   33.6  seconds
  
  Nest calls      223.5   75.1  223.5   75.1  120.4  120.4  120.3  120.4  seconds
      inline      125.2   18.1  125.2   18.1   17.9   17.9   17.9   17.9  seconds
Both compilers support other performance switches (the gnu C compiler presents a "switch Heaven" for those inclined), but I did not test them. Similarly the sources and timers I used were exactly the same, but modification to the codes could have dramatically changed the results. The source for each C benchmark is available from the netlib ftp site at netlib2.cs.utk.edu or from me. Below is a short description of each benchmark and its results:

Livermore Loops - this is the standard Fortran benchmark converted to C. It times 24 loops, which consist of mostly floating point operations. What is shown above is the harmonic mean for all 24 loops when run with loop lengths short (average 18), medium (average 89) and long (average 468). On both compilers, there is a slight increase in the MFLOPS rate with increasing loop length.

Linpacks - this code is in C and is not comparable to the published Fortran results. This benchmark is really just a timing of a single saxpy loop but that loop is still the most common loop of linear algebra. Both compilers did best on the version that unrolled that saxpy loop. There is maybe a 50% performance improvement for single precision over double precision on this benchmark.

Dhrystones - a character and integer benchmark, it's been around a long time and some compilers have accumulated "dhrystone tricks" over the years.

Whetstones - times mostly elementary functions from libm, which is the same /mpp/lib/libm.a for both compilers.

Puzzle - this is my own benchmark that tests integer arithmetic, array references and deeply nested loops.

Nest - this benchmark computes N! with deeply nested loops and a counter in the innermost loop. If the compiler can inline automatically and simplify the loop nest, then there are real performance differences.

General Conclusions

  1. On the Gnu compiler we have that:
    
      no compiler optimization switches = -O0
                                     -O = -O1
    
    I couldn't detect similar simple rules for the CRI compiler.
  2. It is not always the case, for either compiler, that performance is a monotonic function in O(z) where z is increasing. (This to me, is always a surprise, I guess I should stop being surprised.)
  3. At the highest level of optimization for both compilers, the AC compiler is not always faster than the CRI compiler. In particular, the CRI compiler is faster on the Livermore Loops, but slower on Linpack.
Using the AC at ARSC is described in newsletter #46 (8/4/95)

The 1.2.2 Release of the Programming Environment

As of the next downtime (6:00 PM, August 22, 1995) ARSC will be running the 1.2.2 Programming Release as the default. If you have any problems with this release please contact Mike Ess.

Announcement on the t3d@cug.org

The following announcement appeared this week on the t3d@cug.org reflector:

  >
  > Announcement for anyone interested in a T3D tool for partitioning
  > unstructured problems.
  > 
  > I have developed a program called pmrsb (Parallel Multilevel Recursive
  > Spectral Bisection) that partitions graphs and finite-element meshes
  > in parallel on the T3D.  It determines processor assignments for
  > vertices of a graph or elements of a mesh that simultaneously balance
  > load and minimize interprocessor communication.  In addition to
  > partitioning a graph, pmrsb can generate a dual graph from a
  > finite-element mesh.  It should be able to handle very large problems
  > (> 10**6 elements).
  > 
  > The pmrsb code is not an officially supported Cray Research product
  > but I can make it available to interested customers.  Please let me
  > know if you would like to try it.
  > 
  >         Steve Barnard
  >         stb@cray.com
  > 

The T3D Reflector

There is a T3D news reflector that you can subscribe to by sending e-mail to t3d-request@cug.org with a short note saying you would like to be on the list of recipients. Bob Stock and Rich Raymond of the Pittsburgh Supercomputer Center and Fred Johnson of the NIST are responsible for setting it up. The above announcement was circulated through this reflector.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
  12. RANF() and its manpage differ between machines (Newsletter #37)
  13. CRAY2IEG is available only on the Y-MP (Newsletter #40)
  14. Missing sort routines on the T3D (Newsletter #41)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top