ARSC T3D Users' Newsletter 84, April 26, 1996
Asymptotic Benchmark Results at ARSC
I have a few results that I wanted to share before I leave ARSC. As I've pointed out in the ARSC training courses, there is a big difference between the floating point performance of RISC processors and the memory systems that supply operands for these processors. The T3D processor is one such RISC and if the operands are not in cache, the memory can not keep up with the processor's floating point units. The Y-MP line of computers, on the other hand, have a substantial memory system supporting their fast floating point units.
At the end of this article, I show the results of the STREAM benchmark and a SPEED benchmark in the same table. The results are a combination of the timings I've done on machines available at ARSC and timings that are publicly available.
Tom Parker's SPEED Benchmark
From Tom Parker of Consulting Services at the National Center for Atmospheric Research in Boulder, Colorado, I received a small Fortran program that gets a very high MFLOPS rate for both vector processors and RISC processors. The program is short and is given below:
ccccccccccccccccccccccccccccccccccccccccccccccccc
c Get MFLOP rate of CRAY.
c
c 20SEP95 Tom Parker, SCD Consulting Office.
ccccccccccccccccccccccccccccccccccccccccccccccccc
parameter ( n = 1 000 000 )
double precision x(n),z(n),a(0:10)
double precision second, t0, t
do i = 0, 10
a( i ) = i + .1 ! initialize polynomial coefficients
enddo
do i = 1, n
x( i ) = i + .1 ! initialize values to evaluate polynomial
enddo
t0 = second()
do i = 1,n
z(i)=(((((((((a(0) * ! evaluate polynomial using
& x(i)+a( 1)) * ! Horner's method
& x(i)+a( 2)) *
& x(i)+a( 3)) *
& x(i)+a( 4)) *
& x(i)+a( 5)) *
& x(i)+a( 6)) *
& x(i)+a( 7)) *
& x(i)+a( 8)) *
& x(i)+a( 9)) *
& x(i)+a(10)
enddo
t=second()-t0
write ( 6, 600 ) ( 20.0 * n ) / ( t * 1000000.0 ) ! mflop/s
600 format( f10.3 )
call dummy( z ) ! Fool the compiler into doing the work !
end
subroutine dummy( z )
end
The purpose of the program is to attain near peak performance in Fortran and still do something useful. I used to think that the vendor's implementation of the 1000x1000 linpack case was the closest you could get to peak performance but those sources are probably not publicly available, certainly not simple, and may not be in Fortran.
Parker's program should run well on both vector processors and RISC processors. Of course, it vectorizes on the CRAY Y-MP computers and there is little memory activity, just a vector load in the beginning and a vector store at the end. There are many overlapping or chained vector multiplies and adds in the body of the loop. The compiler can schedule the scalar loads of the invariants a(0:10) without conflict into the 8 scalar registers. On all Cray computers, the -dp switch is used to ensure that double precision is implemented with 64 bits not 128 bits. Also on all Cray computers, no optimization flags are used, as optimization is on by default.
On the RISC processors, the one stream of inputs is accessed in a cache friendly manner and the invariants a(0:10) can reside for the life of the loop in the 32 registers. As on all RISCs, double precision is 64 bits by default. I'm willing to experiment to find the best optimization switch -O?, but nobody except the vendor has time to test all possible compiler switches to find the optimal combo. On both the SGIs and the Crays, the compiler will optimize out the entire loop unless the call to the dummy routine is used to trick the compiler into thinking the computed results will be used. Similarly, we need to use coefficients and values to evaluate that the compiler will not special case.
Both Tom Parker and I are looking to get more timings with this source. We are particularly interested in C90 and T90 single processor results. RISC processors, on this program, don't get as close to their peak performance as do the vector processors. If there's something I've missed for RISC processors I'd like to hear about it.
The STREAM Benchmark
Another publicly available benchmark is the STREAM benchmark by John McCalpin of the University of Delaware, Graduate College of Marine Studies, Ocean Modeling Research Group. There is a great description of the benchmark and results on many machines at:http://perelandra.cms.udel.edu/hpc/stream/A brief description of the benchmark is:
> The STREAM benchmark is a simple synthetic benchmark program that measures > sustainable memory bandwidth (in MB/s) and the corresponding computation rate > for simple vector kernels.The DO loops timed are:
> -------------------------------------------------------- > name kernel bytes/iter FLOPS/iter > -------------------------------------------------------- > COPY: a(i) = b(i) 16 0 > SCALE: a(i) = q*b(i) 16 1 > SUM: a(i) = b(i) + c(i) 24 1 > TRIAD: a(i) = b(i) + q*c(i) 24 2 > --------------------------------------------------------From the source available at the above web site, we have that the length of these DO loops (for loops in the C source) is 2,000,000 iterations. In this aspect, the behavior is similar to Tom Parker's program, in that only asymptotic behavior is measured. But unlike the polynomial evaluation program, these loops have large, varying memory requirements. The number of iterations has been chosen so that the loop operands can not all reside in cache, and therefore, the loops mimic memory system performance for large problems. John McCalpin does a much better job explaining and justifying the benchmark in the papers available on the above web page. But for large programs that are not cache contained, these results are important.
The table below is a mixture of results:
# from Tom Parker
all ARSC timings are measured by Mike Ess
all other from John McCalpin's STREAM web page
<---------Bandwidth (MB/s)----------> Mflops
Machine NCPUs Copy Scale Sum Triad T. Parker's
--------------- ----- -------- -------- -------- -------- --------
measured/
theoretical
Machines at ARSC
Onyx L R4400(150Mhz)onyx2 59.3 57.1 57.1 61.5 38/??
Onyx Reality Engine video1 59.3 57.1 57.1 61.5 38/??
Indy R4600PC(100Mhz)amstel 61.5 57.1 53.3 60.0 27/??
Indy R4600PC(133Mhz)kvasir 45.7 42.1 42.9 50.0 26/??
Indy R4400(100Mhz)guinness 48.5 44.4 44.4 42.1(C, no f77)
T3D(1PE) 173.2 132.6 122.6 108.4 22/150
T3D(1PE,-Drdahead=on) 231.4 181.3 122.6 108.4 24/150
Y-MP M98 1775.0 1801.9 1929.7 1905.2 282/333
Cray machines
Cray_T90 1 11341.5 10717.5 14783.6 13920.0
Cray_C90 16 105497.4 104656.4 101736.1 103812.8
Cray_C90 8 55071.9 55391.8 60843.3 63229.6
Cray_C90 4 27610.3 27789.6 34633.3 35044.1
Cray_C90 2 13866.0 13905.5 18233.2 18246.3
Cray_C90 1 6965.4 6965.4 9378.7 9500.7
Cray_Y/MP 8 19291.6 19294.2 26588.9 26802.2
Cray_Y/MP 4 9685.8 9678.9 13781.4 13851.2
Cray_Y/MP 1 2426.4 2426.2 3454.4 3396.9 305/333#
Cray_J932 32 19007.0 18944.1 19993.9 18870.4
Cray_J932 16 16298.2 15851.5 15657.6 14995.9
Cray_J932 8 9995.2 9726.8 9087.4 8941.3
Cray_J932 4 5255.3 5094.9 4688.3 4657.6
Cray_J932 2 2842.2 2766.3 2493.7 2527.6
Cray_J932 1 1433.6 1408.6 1260.8 1270.0 183/200#
Cray_EL-98 8 2362.8 2310.5 2373.7 2363.8
Cray_EL-98 4 1564.9 1569.8 1933.8 1955.5
Cray_EL-98 2 826.7 833.8 1049.0 1078.2
Cray_EL-98 1 437.2 436.7 536.2 476.8 62/66#
Cray_T3D_(assembly) 512 169677.7 166578.1 114976.8 112126.4
Cray_T3D_(assembly) 256 98303.7 84229.5 57622.6 56078.7
Cray_T3D_(assembly) 128 49132.9 42113.9 28811.1 28032.5
Cray_T3D_(assembly) 64 24577.9 21061.6 14405.8 14020.2
Cray_T3D_(assembly) 32 12288.6 10530.7 7204.2 7010.7
Cray_T3D_(assembly) 1 384.5 329.4 225.4 220.1
Cray_T3D_(Fortran) 512 161479.7 168193.5 91775.6 95304.2
Cray_T3D_(Fortran) 256 98316.2 84241.5 47824.1 45248.1
Cray_T3D_(Fortran) 128 49156.4 42128.2 23912.0 22625.2
Cray_T3D_(Fortran) 64 24580.7 21064.7 11955.7 11312.9
Cray_T3D_(Fortran) 32 12290.3 10533.4 5978.3 5656.4
Cray_T3D_(Fortran) 1 384.2 329.2 187.0 176.8
Cray_CS6400 32 824.1 819.6 885.0 882.6
Cray_CS6400 24 761.9 753.7 775.5 774.5
Cray_CS6400 16 611.5 601.0 596.0 594.6
Cray_CS6400 8 347.9 343.4 341.4 342.6
Cray_CS6400 4 188.9 184.3 188.4 188.8
Cray_CS6400 1 51.1 49.9 50.0 50.2
Other machines of interest
DEC_3000/300 1 33.4 33.5 39.6 38.9
IBM_RS6000-990 1 663.4 533.4 714.5 713.8
Intel_Pentium/133 1 84.4 77.1 85.7 85.9
Notes from the table:
- The collection of SGI workstations at ARSC have varying floating point performance but the STREAM benchmark seems to point to a common underlying memory system.
- I believe that my results on the T3D differ from those submitted by CRI because CRI modified the storage of the arrays to minimize cache and page conflicts. Vendors that supply results are allowed to change the source and experiment with compiler flags. The STREAM web page has the email correspondence of vendors submitting their results, they do not all submit their modified source.
- Both the STREAM benchmark and SPEED benchmark show that the ARSC M98, with it's DRAM memory, takes a performance hit from the usual Y-MP.
- The contrast between the Cray_T3D and Cray_CS6400 is the classic problem of shared memory versus distributed memory. There are many more machines listed on the STREAM webpage. (The DEC_3000/300 is the Dec workstation closest to a single T3D PE.)
-
The T3D timings are exceptional, but the STREAM benchmark almost measures performance as:
MPP performance = (performance of one PE) * (number of PEs)
which is almost never the case with a real application.
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
