| Newsletter Index | Quick-Tip Index | Search Newsletters |
When the AC compiler was announced at the Spring CUG meeting, there was a lot of interest because it was supposedly faster than the Standard C compiler provided by CRI. I believe CRI's contention was that the speed differences vary from program to program. To investigate this claim, I tried some standard benchmarks and some of my own benchmarks to put some numbers into the argument. The two tables below show the effect of the standard optimization switches for each compiler. Each table has the same format. (When results are given in MFLOPS, Dhrystones or Whetstones then bigger is better, when the results are in seconds then small is beautiful.)
Table 1
Performance results for the CRI Standard C compiler (PE 1.2.2):
compiler switch: NONE -O -O0 -O1 -O2 -O3
Livermore loops
long loops 9.98 10.03 1.84 8.59 10.01 11.41 MFLOPS
medium loops 9.72 9.73 1.84 8.38 9.73 10.27 MFLOPS
short loops 9.73 9.74 1.84 8.40 9.37 10.01 MFLOPS
Linpack
100
single 14.6 14.6 2.7 14.6 16.4 16.3 MFLOPS
double 12.6 12.6 2.6 12.4 13.8 13.8 MFLOPS
1000
single 15.3 15.3 2.8 15.3 15.8 15.8 MFLOPS
double 10.8 10.8 2.6 10.8 11.0 11.0 MFLOPS
Dhrystones
with 37037 37037 28571 45454 47619 55556 Dhrystones
without 37037 38461 28571 45454 45454 55556 Dhrystones
Whetstones 36405 36404 36405 36405 36406 36397 Whetstones
Puzzle 45.8 45.8 135.0 40.8 45.8 48.2 seconds
Nest calls 112.8 112.8 175.9 1.3 0.0 0.0 seconds
inline 0.2 0.2 72.5 1.7 0.2 0.2 seconds
Table 2
Performance results for the IDA SRC Gnu C compiler (AC 2.6.2):
compiler switch: NONE -O -O0 -O1 -O2 -O3 -O4 -O5
Livermore loops
long loops 2.37 8.26 2.42 8.24 9.52 9.51 9.51 9.51 MFLOPS
medium loops 2.33 7.47 2.33 7.47 8.36 8.35 8.34 8.34 MFLOPS
short loops 2.33 7.33 2.33 7.34 8.14 8.18 8.17 8.15 MFLOPS
Linpack
100
single 3.7 15.2 3.7 15.2 17.0 17.0 17.0 17.0 MFLOPS
double 3.6 13.0 3.7 13.0 14.3 14.3 14.3 14.3 MFLOPS
1000
single 3.9 15.4 3.9 15.4 16.8 16.8 16.7 16.7 MFLOPS
double 3.5 11.0 3.5 11.0 11.6 11.6 11.6 11.6 MFLOPS
Dhrystones
with 45454 66667 45454 66667 90909 90909 90909 90909 Dhrystones
without 43478 66667 43478 66667 90909 100000 90909 90909 Dhrystones
Whetstones 24160 37896 24160 37896 38612 38613 38611 38611 Whetstones
Puzzle 113.8 36.5 113.8 36.5 33.6 35.2 33.6 33.6 seconds
Nest calls 223.5 75.1 223.5 75.1 120.4 120.4 120.3 120.4 seconds
inline 125.2 18.1 125.2 18.1 17.9 17.9 17.9 17.9 seconds
Both compilers support other performance switches (the gnu C compiler
presents a "switch Heaven" for those inclined), but I did not test them.
Similarly the sources and timers I used were exactly the same, but
modification to the codes could have dramatically changed the results.
The source for each C benchmark is available from the netlib ftp site at
netlib2.cs.utk.edu or from me. Below is a short description
of each benchmark and its results:
Livermore Loops - this is the standard Fortran benchmark converted to C. It times 24 loops, which consist of mostly floating point operations. What is shown above is the harmonic mean for all 24 loops when run with loop lengths short (average 18), medium (average 89) and long (average 468). On both compilers, there is a slight increase in the MFLOPS rate with increasing loop length.Linpacks - this code is in C and is not comparable to the published Fortran results. This benchmark is really just a timing of a single saxpy loop but that loop is still the most common loop of linear algebra. Both compilers did best on the version that unrolled that saxpy loop. There is maybe a 50% performance improvement for single precision over double precision on this benchmark.
Dhrystones - a character and integer benchmark, it's been around a long time and some compilers have accumulated "dhrystone tricks" over the years.
Whetstones - times mostly elementary functions from libm, which is the same /mpp/lib/libm.a for both compilers.
Puzzle - this is my own benchmark that tests integer arithmetic, array references and deeply nested loops.
Nest - this benchmark computes N! with deeply nested loops and a counter in the innermost loop. If the compiler can inline automatically and simplify the loop nest, then there are real performance differences.
no compiler optimization switches = -O0
-O = -O1
I couldn't detect similar simple rules for the CRI compiler.
> > Announcement for anyone interested in a T3D tool for partitioning > unstructured problems. > > I have developed a program called pmrsb (Parallel Multilevel Recursive > Spectral Bisection) that partitions graphs and finite-element meshes > in parallel on the T3D. It determines processor assignments for > vertices of a graph or elements of a mesh that simultaneously balance > load and minimize interprocessor communication. In addition to > partitioning a graph, pmrsb can generate a dual graph from a > finite-element mesh. It should be able to handle very large problems > (> 10**6 elements). > > The pmrsb code is not an officially supported Cray Research product > but I can make it available to interested customers. Please let me > know if you would like to try it. > > Steve Barnard > stb@cray.com >
Contact:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Craig Stephenson ARSC User Consultant ph: 907-450-8653 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.E-mail Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources