ARSC T3D Users' Newsletter 75, February 23, 1996
Matrix-Vector Multiplication on the Y-MP and T3D
For an upcoming class, I wanted to illustrate the 'range of performance' that is available on a single machine as a function of optimization effort. Like many computer scientists before me, I milked the example of matrix-vector multiplication. For my timings, I took the problem of:
Ab = cwhere A is a n by n matrix and b and c are vectors of n elements.
There is a plethora (nice word) of ways of doing this multiplication, I stopped accumulating them at 10.
Top Ten ways of Multiplying a Matrix by a Vector on CRI Hardware
- sgemm - BLAS3 routine, in libsci
- sgemv - BLAS2 routine, in libsci
- mxma - old libsci routine, missing in the T3D version of libsci
- calls to saxpy - BLAS1 routine, in libsci, outer product formulation
- calls to sdot - BLAS1 routine, in libsci, inner product formulation
- calls to Fortran version with a saxpy loop
- calls to Fortran version with a saxpy loop with one IF statement
- calls to Fortran version with a saxpy loop with two IF statements
- calls to Fortran version with a saxpy loop with loop unrolled 4 times
- calls to Fortran version with a sdot loop
Table 1
MFLOPS for matrix-vector multiplication methods on the ARSC Y-MP
size sgemm sgemv mxma call call fsaxpy fsaxpy1 fsaxpy2 fsaxpy3 fsdot
saxpy sdot
1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
3 2 0.3 0.6 0.5 0.5 0.5 0.7 0.6 0.5 0.6 0.5
4 3 0.9 1.4 1.3 1.1 1.0 1.6 1.5 1.1 1.4 1.1
5 4 1.6 2.5 2.3 1.9 1.7 2.7 2.5 1.8 2.6 1.8
6 5 2.5 4.0 3.6 2.8 2.3 4.2 3.8 2.5 3.9 2.5
7 6 3.7 5.8 5.2 3.6 3.0 5.8 5.2 3.3 5.5 3.3
8 7 5.1 7.7 7.0 4.6 3.7 7.6 6.6 4.1 7.2 4.2
9 8 6.6 10.1 9.0 5.6 4.5 9.5 8.2 4.9 9.6 5.0
10 9 8.3 12.5 11.3 6.6 3.6 11.5 9.8 5.8 11.8 5.8
11 10 10.1 15.0 13.5 7.6 4.0 13.5 11.3 6.5 13.6 6.6
12 16 23.8 34.2 30.2 13.5 6.7 26.2 21.4 11.0 29.9 11.2
13 20 34.9 48.4 42.4 17.8 7.6 34.8 27.9 13.4 41.9 13.9
14 30 66.0 86.4 70.2 27.5 11.0 53.8 42.5 18.3 70.6 19.4
15 32 72.7 93.6 75.6 29.3 11.6 56.6 44.9 19.2 77.9 20.6
16 40 97.1 119.6 92.5 35.3 12.8 65.2 52.9 21.7 99.2 23.7
17 50 125.6 148.7 107.3 41.1 15.2 75.5 61.7 24.4 115.5 26.9
18 60 150.6 172.0 118.6 48.6 17.3 83.0 68.8 26.3 131.2 29.6
19 63 153.7 174.9 123.0 50.3 17.9 84.9 70.9 26.9 132.3 30.3
20 64 159.2 179.7 122.7 51.0 18.2 86.4 71.9 27.1 136.5 30.8
21 65 133.7 145.4 103.8 47.8 17.8 74.2 63.4 21.6 117.4 27.2
22 70 143.6 152.7 109.4 50.2 18.9 77.2 66.2 22.5 123.2 29.5
23 80 157.9 170.9 120.6 55.2 21.7 82.7 72.2 24.2 135.1 33.4
24 90 173.2 185.4 127.1 60.9 23.9 88.2 77.6 25.7 142.6 35.9
25 100 180.2 192.7 133.0 65.4 25.9 91.5 80.9 27.0 156.1 38.3
26 128 212.3 221.1 141.6 74.8 31.1 99.4 89.1 29.9 168.0 43.1
27 200 210.2 210.4 139.5 85.1 39.2 97.6 90.9 28.2 168.8 54.0
28 256 231.0 234.5 148.0 94.6 46.1 105.4 99.9 31.2 181.5 59.3
29 300 230.8 231.2 146.8 97.6 47.8 105.9 100.9 30.9 182.1 65.5
30 400 228.4 230.2 147.0 102.5 53.8 105.6 101.5 30.6 180.9 69.2
31 500 234.8 235.8 149.7 108.4 60.9 108.3 106.2 31.8 187.2 72.0
32 512 240.3 242.1 150.6 109.7 61.3 109.2 106.1 31.9 189.7 72.4
33 600 235.0 236.4 149.6 111.7 64.6 108.8 106.4 31.3 185.1 75.0
34 700 240.8 240.3 149.9 113.8 65.8 109.8 108.0 32.3 190.6 78.9
35 800 239.7 238.6 149.9 115.1 67.7 108.9 106.4 31.6 189.2 79.5
36 900 230.0 230.6 149.0 115.1 69.1 109.3 106.9 31.4 187.5 79.1
37 1000 236.1 235.5 150.8 118.9 72.8 109.2 108.0 31.9 190.0 80.4
Table 2
MFLOPS for matrix-vector multiplication methods on the T3D (1PE)
size sgemm sgemv mxma call call fsaxpy fsaxpy1 fsaxpy2 fsaxpy3 fsdot
saxpy sdot
1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1 0.1 0.1 0.0 0.1 0.1 0.5 0.5 0.5 0.3 0.5
3 2 0.4 0.7 0.0 1.0 0.9 2.3 2.2 2.0 1.7 2.3
4 3 1.1 1.8 0.0 2.2 1.9 4.6 4.5 3.9 3.7 5.6
5 4 1.7 2.6 0.0 3.5 3.1 6.5 6.6 5.4 6.1 6.9
6 5 1.9 3.3 0.0 4.0 4.2 8.6 8.6 6.9 7.7 8.7
7 6 2.9 4.8 0.0 5.0 5.3 10.0 10.1 8.2 9.6 11.5
8 7 4.1 6.9 0.0 6.8 6.5 11.2 11.4 9.3 11.1 13.2
9 8 3.4 4.5 0.0 7.7 6.1 11.7 12.0 10.1 16.7 15.0
10 9 3.6 5.1 0.0 7.6 6.8 12.5 13.2 10.6 17.1 15.5
11 10 4.2 6.1 0.0 8.2 7.9 13.0 13.9 11.3 17.5 17.2
12 16 9.0 15.2 0.0 10.1 12.2 14.7 15.5 12.7 24.6 18.9
13 20 9.9 15.1 0.0 11.3 12.0 15.3 16.5 13.4 27.0 20.2
14 30 12.8 23.9 0.0 13.9 13.1 12.4 12.9 11.0 18.6 15.2
15 32 15.5 29.9 0.0 15.4 13.1 12.1 12.5 10.7 19.1 14.7
16 40 15.2 29.2 0.0 16.6 13.1 11.5 11.8 8.2 16.6 13.4
17 50 16.0 32.0 0.0 17.0 12.2 11.1 11.4 9.7 16.5 11.7
18 60 17.3 35.0 0.0 18.5 12.8 11.3 11.6 9.8 17.2 12.0
19 63 16.6 33.1 0.0 18.1 11.4 11.0 11.7 9.8 16.8 11.9
20 64 17.8 36.4 0.0 14.1 5.2 11.3 11.7 9.8 17.4 11.9
21 65 16.6 26.4 0.0 14.3 5.2 11.3 11.6 9.8 17.2 11.9
22 70 16.7 32.9 0.0 14.4 5.2 11.0 11.0 9.8 17.2 11.9
23 80 18.5 37.2 0.0 15.8 5.3 11.5 11.8 9.3 17.8 12.0
24 90 17.9 34.6 0.0 16.4 5.4 11.3 11.0 9.6 16.8 11.1
25 100 17.9 35.7 0.0 17.1 5.4 11.3 11.7 9.7 17.2 9.5
26 128 17.8 38.5 0.0 19.0 5.7 11.1 11.8 9.8 17.4 7.3
27 200 18.6 38.3 0.0 21.2 5.9 11.2 11.6 9.6 17.0 6.2
28 256 18.7 39.4 0.0 22.6 5.9 11.2 11.5 9.6 17.3 6.1
29 300 18.7 39.3 0.0 23.2 5.9 11.1 11.4 9.5 17.2 6.1
30 400 19.0 39.2 0.0 24.4 6.0 10.8 11.2 9.3 17.0 5.9
31 500 18.7 39.0 0.0 24.9 6.1 10.6 10.9 9.2 16.9 5.8
32 512 18.8 39.4 0.0 25.0 6.1 10.6 10.9 9.1 16.9 5.8
33 600 18.8 39.2 0.0 25.4 6.1 10.5 10.8 9.0 16.7 5.7
34 700 18.9 39.4 0.0 25.7 6.1 10.2 10.5 8.8 16.6 5.6
35 800 18.9 39.5 0.0 26.0 6.1 10.1 10.4 8.7 16.4 5.4
36 900 18.8 39.3 0.0 26.1 6.1 9.9 10.2 8.5 16.3 5.4
37 1000 18.9 39.4 0.0 26.3 6.1 9.7 10.0 8.4 16.3 5.4
Graphical Presentation
Both tables contain the results of 3700 timing experiments and it is almost impossible to extract the trends without graphing the data. To do this, I like to use the GNU tool, gnuplot. Below is a typical makefile and plotfile for gnuplot:
#makefile
all: results.t3d.100 t3d.plot
awk '{ print $$2, $$3 }' results.t3d.100 > sgemm
awk '{ print $$2, $$4 }' results.t3d.100 > sgemv
awk '{ print $$2, $$6 }' results.t3d.100 > callsaxpy
awk '{ print $$2, $$7 }' results.t3d.100 > callsdot
awk '{ print $$2, $$8 }' results.t3d.100 > fsaxpy
awk '{ print $$2, $$9 }' results.t3d.100 > fsaxpy1
awk '{ print $$2, $$10 }' results.t3d.100 > fsaxpy2
awk '{ print $$2, $$11 }' results.t3d.100 > fsaxpy3
awk '{ print $$2, $$12 }' results.t3d.100 > fsdot
gnuplot t3d.plot
lprt out
#plotfile - t3d.plot
set output 'out'
set term postscript
set title "Matrix Vector Multiplication on the ARSC T3D"
set yzeroaxis
set samples 37
set xlabel "Order of Matrix"
set ylabel "Mflop/s rate"
#set noborder
plot 'sgemm' with linespoints 1 1, 'sgemv' with linespoints 2 2, \
'fsaxpy3' with linespoints 3 3, 'callsaxpy' with linespoints 5 5, \
'fsaxpy' with linespoints 6 6, 'fsaxpy1' with linespoints 7 7, \
'fsaxpy2' with linespoints 8 8, 'callsdot' with linespoints 9 9, \
'fsdot' with linespoints 10 10
With the results of Table 1 and Table 2 it is easy to extract the following:
On the Y-MP:
- The asymptotic speeds for sgemm and sgemv are almost identical.
- As with all CRI vector machines, the asymptotic speed is almost reached at size 200.
- The cost of doing a problem of size 65 is substantially larger than one of 64. (The overhead of paritioning loops into segments of 64 or less is doubled.)
- The version using the unrolled implementation of saxpy is the fastest of all Fortran implementations.
- Enough IF statements can drive the performance to scalar speeds.
- The breakdown of cache coherency for all methods happens for problems less than size 100.
- The winner is sgemv by a wide margin.
- There is some anomaly for sgemv at size 65. (But the T3D doesn't have vectors!?)
- All asymptotic speeds are below 40 MFLOPS, All Fortran asymptotic speeds are below 20 MFLOPS.
- The simplest Fortran implementation of sdot and saxpy have almost identical speeds.
MPI Keeps on Growing
> Parallel Programming with MPI > March 5 & 6, 1996 at OSC > > The Ohio Supercomputer Center (OSC) is offering a two-day > course on using the Message Passing Interface (MPI) standard > to write parallel programs on several of the OSC MPP systems. > For more information on MPI, see > http://www.osc.edu/Lam.html#MPI on the WWW. > > MPI topics to be covered include a variety of processor-to- > processor communication routines, collective operations > performed by groups of processors, defining and using high- > level processor connection topologies, and user-specified > derived data types for message creation. > > The MPI workshop will be a combination of lectures and > hands-on lab session in which the participants will write > and execute sample MPI programs. > > Interested parties should contact Aline Davis at > aline@osc.edu or (614) 292-9248. Due to the hands-on nature > of the workshop, REGISTRATION IS LIMITED TO 20 STUDENTS.
New High Performance C++ Compiler for the Cray T3D
Not everyone is happy with CRI's C products for the T3D, there have been several efforts to supplement CRI's efforts:- The ACC compiler - ARSC's T3D newsletter#46 (8/4/95)
- The Split C compiler -
> Kuck & Associates, Inc. (KAI) announces the availability of > the Photon C++ compiler for the Cray T3D computer > architecture. Photon C++ has optimizations that allow > developers to use object-oriented design all the way into > the kernels of the application, and still achieve the > performance of C code. As an assist to developing > applications for the Cray T3D, Photon C++ is also available > on every major Unix workstation. > > Photon C++ provides near draft standard syntax and a near > draft standard C++ class library. In addition, for those > with legacy codes, Photon C++ has Cfront 3.0 and 2.1 > compatibility modes. > > Photon C++ optimizes several paradigms used in object- > oriented programming. Photon C++ automatically optimizes > lightweight objects (objects that are created, used, and > destroyed frequently), data abstractions (allowing the > programmer to leave them in object-oriented form), and > control flow to the most efficient form (allowing > structured control flow to be maintained). Photon C++ > eliminates redundant tests, allowing self-checking member > functions to be used efficiently. > > Photon C++ supports name spaces, exceptions, templates > (with automatic instantiation), global constructors, RTTI, > and STL. > > Look at this web page for more information on Photon C++ > on the Cray T3D: > > http://www.kai.com/photon/photon_t3d.html > > If you are looking for a single compiler to use across all > of your development and production systems, consider using > Photon C++ on your Unix workstations. Evaluation copies of > Photon C++ are available now for these workstations. Look > at this web page for more information on Photon C++ for > Unix Workstations: > > http://www.kai.com/photon/photon_what_is.html > > You can contact KAI at: > > Kuck & Associates, Inc. e-mail: kai@kai.com > 1906 Fox Drive Voice: +1-217-356-2288 > Champaign, IL 61820 Fax: +1-217-356-5199 > USA >
List of Differences Between T3D and Y-MP
The current list of differences between the T3D and the Y-MP is:- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
- Missing Linpack and Eispack routines in libsci (Newsletter #25)
- F90 manual for Y-MP, no manual for T3D (Newsletter #31)
- RANF() and its manpage differ between machines (Newsletter #37)
- CRAY2IEG is available only on the Y-MP (Newsletter #40)
- Missing sort routines on the T3D (Newsletter #41)
- Missing compiler allocation flags (Newsletter #52)
- Missing compiler listing flags (Newsletter #53)
- Missing MXMA routine on the T3D (Newsletter #75)
Here is a shortened version of the above timing program:
parameter( lda = 1001, nmax = 1000, ncases = 37, maxtrips = 100 )
integer index( ncases )
real a( lda, 1000 )
real b( nmax ), c( nmax ), d( nmax )
real t( ncases, 10 )
data index / 0,1,2,3,4,5,6,7,8,9,10,16,20,30,32,40,50,60,63,64,65,
+ 70,80,90,100,128,200,256,300,400,500,512,600,700,800,
+ 900,1000/
do 100 j = 1, nmax
do 90 i = 1, nmax
a( i, j ) = j
90 continue
b( j ) = j
100 continue
d( 1 ) = 1.0
do 110 i = 2, nmax
d( i ) = d( i-1 ) + i * i
110 continue
do 130 j = 1, 10
do 120 i = 1, ncases
t( i, j ) = 0.0
120 continue
130 continue
do 2000 ntrips = 1, maxtrips
do 1000 kcase = 1, ncases
n = index( kcase )
tt = second()
call sgemm( 'n','n',n,1,n,1.0,a,lda,b,nmax,0.0,c,nmax)
t( kcase, 1 ) = t( kcase, 1 ) + second() - tt
do 210 i = 1, n
error = c( i ) - d( n )
if( error .ne. 0.0 ) then
print *, ' error with sgemm', kcase, n, i, c( i ), d( n )
stop
endif
210 continue
tt = second()
call sgemv( 'n', n, n, 1.0, a, lda, b, 1, 0.0, c, 1 )
t( kcase, 2 ) = t( kcase, 2 ) + second() - tt
if( n .gt. 0 ) then
tt = second()
call mxma( a, 1, lda, b, 1, nmax, c, 1, nmax, n, n, 1 )
t( kcase, 3 ) = t( kcase, 3 ) + second() - tt
else
t(kcase,3)=1.0
endif
tt = second()
call callsaxpy( a, lda, b, c, n )
t( kcase, 4 ) = t( kcase, 4 ) + second() - tt
tt = second()
call callsdot( a, lda, b, c, n )
t( kcase, 5 ) = t( kcase, 5 ) + second() - tt
tt = second()
call fsaxpy( a, lda, b, c, n )
t( kcase, 6 ) = t( kcase, 6 ) + second() - tt
tt = second()
call fsaxpy1( a, lda, b, c, n )
t( kcase, 7 ) = t( kcase, 7 ) + second() - tt
tt = second()
call fsaxpy2( a, lda, b, c, n )
t( kcase, 8 ) = t( kcase, 8 ) + second() - tt
tt = second()
call fsaxpy3( a, lda, b, c, n )
t( kcase, 9 ) = t( kcase, 9 ) + second() - tt
tt = second()
call fsdot( a, lda, b, c, n )
t( kcase, 10 ) = t( kcase, 10 ) + second() - tt
1000 continue
2000 continue
write( 6, 601 )
write( 6, 602 )
do 3000 i = 1, ncases
ops = maxtrips * index(i) * ( index(i) + index(i)-1 ) / 1.0e6
write( 6, 600 )i,index(i),(ops/t(i,j),j=1,10)
3000 continue
600 format( i3, i5, 10f7.1 )
601 format( 'case size sgemm sgemv mxma call call',
& ' fsaxpy fsaxpy1 fsaxpy2 fsaxpy3 fsdot' )
602 format( ' saxpy sdot' )
end
subroutine callsaxpy( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10 continue
do 20 i = 1, n
call saxpy( n, b( i ), a( 1, i ), 1, c, 1 )
20 continue
end
subroutine fsaxpy( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10 continue
do 20 j = 1, n
do 9 i = 1, n
c( i ) = c( i ) + b( j ) * a( i, j )
9 continue
20 continue
end
subroutine fsaxpy1( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10 continue
do 20 j = 1, n
if( b( j ) .ne. 0.0 ) then
do 11 i = 1, n
c( i ) = c( i ) + b( j ) * a( i, j )
11 continue
endif
20 continue
end
subroutine fsaxpy2( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10 continue
do 20 j = 1, n
if( b( j ) .ne. 0.0 ) then
do 11 i = 1, n
if( a( i, j ) .ne. 0.0 ) then
c( i ) = c( i ) + b( j ) * a( i, j )
endif
11 continue
endif
20 continue
end
subroutine fsaxpy3( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
10 continue
k = 0
do 20 j = 1, n-3, 4
do 11 i = 1, n
c(i)=c(i)+
& b(j)*a(i,j)+b(j+1)*a(i,j+1)+b(j+2)*a(i,j+2)+b(j+3)*a(i,j+3)
11 continue
k = k + 1
20 continue
do 30 j = 4*k+1, n
do 21 i = 1, n
c( i ) = c( i ) + b( j ) * a( i, j )
21 continue
30 continue
end
subroutine callsdot( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = sdot( n, a( i, 1 ), lda, b, 1 )
10 continue
end
subroutine fsdot( a, lda, b, c, n )
real a( lda, 1 ), b( 1 ), c( 1 )
do 10 i = 1, n
c( i ) = 0.0
do 9 j = 1, n
c( i ) = c( i ) + a( i, j ) * b( j )
9 continue
10 continue
end
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
