ARSC T3D Users' Newsletter 87, May 17, 1996

The Linpack Benchmark on the T3D

When is a benchmarking effort done? I think it's done when squeezing more speed out of the benchmark isn't worth the effort. I've reached that level of effort on the linpack benchmark. The only remaining parts, that weren't covered in last week's newsletter, were to optimize SGEFA outside of the main loop and experiment with the solver routine, SGESL. The last entry in the table below incorporates all the changes to SGEFA and all of my experiments with SGESL didn't result in any speed up. The entire listing of the modified SGEFA is given at the end of this section, the remainder of the benchmark is available at netlib2.cs.utk.edu.

Table 1

Times (seconds) for the factorization (SGEFA) of the linpack problem

        problem size    1PE     2PEs     4PEs    8PEs   16PEs   32PEs
        ------------    ---     ----     ----    ----   -----   -----
asis      100x100      .047
results  1000x1000   60.380

lapack    100x100      .018
results  1000x1000   10.080

craft     100x100      .102     .205     .125    .086    .087    .087
         1000x1000  259.079  193.042  101.329  61.096  55.418  50.743

last      100x100      .097     .074     .054    .054    .052    .060
week's   1000x1000   28.045   15.679    8.545   5.179   3.976   4.236

final     100x100      .079     .059     .042    .037    .046    .053
version  1000x1000   26.148   13.766    7.182   4.055   2.974   3.304

Table 2

Times (seconds) for the solver (SGESL) phase of the linpack problem

        problem size    1PE     2PEs     4PEs    8PEs   16PEs   32PEs
        ------------    ---     ----     ----    ----   -----   -----
asis      100x100     .0015
results  1000x1000    .1786

lapack    100x100     .0007
results  1000x1000    .0556

final     100x100     .002      .016     .015    .015    .015    .016
version  1000x1000    .232     1.504    1.449   1.447   1.447   1.474

To summarize the results of these three newsletters we have:
  1. Lapack is hard to beat. But everytime we measure the efficiency of a parallel implementation, we should compare it against the best scalar implementation. Comparing against an unoptimized scalar version is like setting up a strawman.
  2. For large problems, those that can amortize the cost of the subroutine call, the optimized BLAS routines still have benefit.
  3. Distribution of the 2 dimensional matrix can effectively load balance the work of the factorization in SGEFA, but parallelizing the SGESL routine is much harder. (Notice how the parallelization of SGEFA has pushed computation onto SGESL, Table 2.)
  4. Load balancing is not sufficient for an optimal effort; distributed computation must also be as local as possible. Cache utilization can only happen when the distributed computation is local.

Modified Version of SGEFA:


        subroutine sgefa(a,lda,n,ipvt,info)
        integer lda,n,info
        real a(lda,lda)
        real temp( 1024 )
        integer ipvt(lda)
  cdir$ shared a(:,:block(1)), ipvt(:)
        intrinsic my_pe, home
  c
  c     sgefa factors a real matrix by Gaussian elimination.
  c
  c     sgefa is usually called by dgeco, but it can be called
  c     directly with a saving in time if  rcond  is not needed.
  c     (time for dgeco) = (1 + 9/n)*(time for sgefa) .
  c
  c     on entry
  c
  c        a       real(lda, n)
  c                the matrix to be factored.
  c
  c        lda     integer
  c                the leading dimension of the array  a .
  c
  c        n       integer
  c                the order of the matrix  a .
  c
  c     on return
  c
  c        a       an upper triangular matrix and the multipliers
  c                which were used to obtain it.
  c                the factorization can be written  a = l*u  where
  c                l  is a product of permutation and unit lower
  c                triangular matrices and  u  is upper triangular.
  c
  c        ipvt    integer(n)
  c                an integer vector of pivot indices.
  c
  c        info    integer
  c                = 0  normal value.
  c                = k  if  u(k,k) .eq. 0.0 .  this is not an error
  c                     condition for this subroutine, but it does
  c                     indicate that sgesl or dgedi will divide by zero
  c                     if called.  use  rcond  in dgeco for a reliable
  c                     indication of singularity.
  c
  c     linpack. this version dated 08/14/78 .
  c     cleve moler, university of new mexico, argonne national lab.
  c
  c     subroutines and functions
  c
  c     blas saxpy,sscal,isamax
  c
  c     internal variables
  c
        real t
        integer isamax,j,k,kp1,l,nm1
  c
  c     Gaussian elimination with partial pivoting
  c
        me = my_pe()
        info = 0
        nm1 = n - 1
        if (nm1 .lt. 1) go to 70
        do 60 k = 1, nm1
           if( home( a( k, k ) ) .eq. me ) then
  c
  c find l = pivot index
  c
              l = isamax(n-k+1,a(k,k),1) + k - 1
              ipvt(k) = l
  c
  c interchange if necessary
  c
              if (l .eq. k) go to 10
                 t = a(l,k)
                 a(l,k) = a(k,k)
                 a(k,k) = t
     10       continue
  c
  c check for zero pivot
  c
              if( a( k, k ) .eq. 0.0 ) goto 70
  c
  c compute multipliers
  c
              t = -1.0e0/a(k,k)
              call sscal(n-k,t,a(k+1,k),1)
           endif
           call barrier()
           l = ipvt( k )
  c
  c row elimination with column indexing
  c
           me0 = home( a( 1, k+1 ) )
           istart = k+1
           if( me .gt. me0 ) istart = k+1 + ( me - me0 )
           if( me .lt. me0 ) istart = k+1 + ( N$PES - ( me0 - me ) )
  c
  c make local copies
  c
           do 29 j = k+1, n
              temp( j ) = a( j, k )
    29     continue
           do 30 j = istart, n, N$PES
              t = a(l,j)
              if (l .eq. k) go to 20
                 a(l,j) = a(k,j)
                 a(k,j) = t
     20       continue
  c           call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
              call saxpy(n-k,t,temp(k+1),1,a(k+1,j),1)
  c           do 21 i = 1, n-k
  c              a(k+i,j)=t*temp(k+i)+a(k+i,j)
  c              a(k+i,j)=t*a(k+i,k)+a(k+i,j)
  c  21       continue
     30       continue
     60 continue
     70 continue
        ipvt(n) = n
        if (a(n,n) .eq. 0.0e0) info = n
        return
        end

BLACS, PBLAS and Scalapack

In past newsletters, I've passed on the requests for examples of using the Scalapack routines on the T3D. I had hoped that someone would have sent in some examples before I leave, so I could learn how to use these routines. No such luck. This week and next I will report on my progress in using these routines and I will direct my examples to solving the linpack problem (like what else is there?). CRI has an extensive set of man pages on the BLACS, PBLAS and ScaLAPACK on Denali but I haven't found any documents with examples and the overall big picture.

In the documentation available on the Oak Ridge National Labs web server, (http://www.netlib.org/index.html) I found the diagram below (now modified for an ASCII newsletter), that shows the relationship between the different components that are all available for the T3D in libsci.a:


                            ScaLAPACK
                             /     \
                            /       \
                           /       PBLAS
                          /       /   \
  Global routines        /       /     \       Global routines
  ------------------------------------------------------------
  Local routines       /       /         \     Local routines
                      /       /           \
                     /       /             \
                 LAPACK     /             BLACS
                    
      /               

                    
     /                

                    
    /               shmems 
                    
   /
                    BLAS
From the Scalapack test driver of the installation suite, I cobbled together the following program and its output follows. The program below sets up a distributed 128 by 128 array among 8 T3D PEs. Next week, I'll try to solve, with ScaLAPACK routines, the system of linear equations described by this distributed matrix. I hope that from matching up the output with the commented program, a user might understand the flow.

        implicit none
        integer lda
        parameter( lda = 128 )
  c
  c sets up a column-wise distributed array that looks like:
  c      
  c      PE0  PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE0        PE5 PE6 PE7
  c        1   1   1   1   1   1   1   1   1  ....    1   1   1
  c        1   2   1   1   1   1   1   1   1  ....    1   1   1
  c        1   1   3   1   1   1   1   1   1  ....    1   1   1
  c        1   1   1   4   1   1   1   1   1  ....    1   1   1
  c        1   1   1   1   5   1   1   1   1  ....    1   1   1
  c        1   1   1   1   1   6   1   1   1  ....    1   1   1
  c        1   1   1   1   1   1   7   1   1  ....    1   1   1
  c        1   1   1   1   1   1   1   8   1  ....    1   1   1
  c        1   1   1   1   1   1   1   1   9  ....    1   1   1
  c        .   .   .   .   .   .   .   .   .  ....    1   1   1
  c        .   .   .   .   .   .   .   .   .  ....    1   1   1
  c        .   .   .   .   .   .   .   .   .  ....    1   1   1
  c        1   1   1   1   1   1   1   1   1  ....    1   1   1
  c        1   1   1   1   1   1   1   1   1  ....    1   1   1
  c        1   1   1   1   1   1   1   1   1  ....  126   1   1
  c        1   1   1   1   1   1   1   1   1  ....    1 127   1
  c        1   1   1   1   1   1   1   1   1  ....    1   1 128
  c
        integer ictxt      ! context descriptor for grid of processors
        integer nprocs     ! number of processors
        integer iam        ! processor number in the range [0 to nprocs-1] 
        integer info( 1 )  ! an array for passing information with igsum2d
        integer nrow       ! global number of rows
        integer ncol       ! global number of columns
        integer nprow      ! local number of rows
        integer npcol      ! local number of columns
        integer myrow      ! first row of the processor calling BLACS_GRIDINFO
        integer mycol      ! first column of the processor calling BLACS_GRIDINFO
        integer who( 1 )   ! an array for passing information
        integer idesc( 8 ) ! a distributed array descriptor
        integer i, j
        real a( lda, 16 )  ! the local portion of the distributed array
        real work          ! work space for the T3D pblas routines
  c
  c determine how many processors and local processor number
  c
        call blacs_pinfo( iam, nprocs )
        print *, "I am ", iam, " of ", nprocs
  c
  c allocate internal storage area for pblas routines
  c
        call initbuff( work, 100000 )
  
        who( 1 ) = iam
  c
  c initialize processors in Column major grid 
  c
        nprow = 1
        npcol = nprocs
        call blacs_gridinit( ictxt, 'Column-major', nprow, npcol )
  c
  c Sum all local local values of who( 1 ) onto who( 1 )
  c i.e. 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28, see output below
  c
        call igsum2d( ictxt, 'All', ' ', 1, 1, who, 1, -1, 0 )
        print *, "On ", iam, " who( 1 ) = ", who( 1 )
  
        nrow = 128         ! set global number of rows
        ncol = 128         ! set global number of columns
  c
  c describe a distributed array on to the grid of processors
  c
        call descinit( idesc, nrow, ncol, nprow, npcol, 0, 0,
       +               ictxt, lda, info( 1 ) )
  c
  c collect the status of the descinit routine from each processor
  c
        call igsum2d( ictxt, 'All', ' ', 1, 1, info, 1, -1, 0 )
        if( info( 1 ) .lt. 0 ) then
           if( iam .eq. 0 ) then
              print *, "Status from descinit = ", info
              stop
             endif
          else
           if( iam .eq. 0 ) then
  c
  c print out the descriptor of the distributed array
  c
              print *, "idesc( 1 ) = ", idesc( 1 )
              print *, "idesc( 2 ) = ", idesc( 2 )
              print *, "idesc( 3 ) = ", idesc( 3 )
              print *, "idesc( 4 ) = ", idesc( 4 )
              print *, "idesc( 5 ) = ", idesc( 5 )
              print *, "idesc( 6 ) = ", idesc( 6 )
              print *, "idesc( 7 ) = ", idesc( 7 )
              print *, "idesc( 8 ) = ", idesc( 8 )
           endif
        endif
  c
  c initialize the local portions of the distributed array
  c
        do 20 j = 1, lda / nprocs
           do 10 i = 1, lda
              a( i, j ) = 1                      ! off diagonal elements
    10       continue
    20  continue
        do 30 i = 1, lda
           if( mod( i, nprocs ) .eq. iam+1 ) then
                a( i, ( i / nprocs )  + 1 ) = i ! diagonal elements
             endif
    30  continue
  c
  c first 32 rows on the first processor
  c
        if( iam .eq. 0 ) then
           do 40 i = 1, 32
              write( 6, 600 ) ( a( i, j ), j = 1, 16 )
    40       continue
        endif
        call barrier()
  c
  c first 32 rows on the second processor
  c
        if( iam .eq. 1 ) then
           write( 6, 601 )
           do 50 i = 1, 32
              write( 6, 600 ) ( a( i, j ), j = 1, 16 )
    50     continue
        endif
   600  format( 16f4.1 )
   601  format( / )
        end

Output from the Above Program:


  I am 7 of 8
  I am 2 of 8
  I am 3 of 8
  I am 5 of 8
  I am 6 of 8
  I am 4 of 8
  I am 0 of 8
  I am 1 of 8

  On 7 who( 1 ) = 28
  On 2 who( 1 ) = 28
  On 3 who( 1 ) = 28
  On 5 who( 1 ) = 28
  On 6 who( 1 ) = 28
  On 4 who( 1 ) = 28
  On 0 who( 1 ) = 28
  On 1 who( 1 ) = 28

  idesc( 1 ) = 128
  idesc( 2 ) = 128
  idesc( 3 ) = 1
  idesc( 4 ) = 8
  idesc( 5 ) = 0
  idesc( 6 ) = 0
  idesc( 7 ) = 0
  idesc( 8 ) = 128

  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 9.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.017.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.025.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0


  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.010.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.018.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.026.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
  1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

A Call for Material

If you have discovered a good technique or information on the T3D and you think it might benefit others, then send it to the email address below and it will be passed on through this newsletter.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top