ARSC T3D Users' Newsletter 85, May 3, 1996

How Much does that Division Cost?

The DEC Alpha processor is an IEEE processor. It supports both the formats and arithmetic functions as defined by the ANSI/IEEE STD 754-1985 document as produced by the Floating-Point Working Group of the Microprocessor Standards Subcommittee of the Standards Committee of the IEEE Computer Society. This document specifies what is colloquially called 'IEEE arithmetic'. I can U.S. mail copies of this document to whomever requests it.

The standard specifies that all operations should be computed with precision within 1/2 ulp (ulp - unit of least precision), provide 4 modes of rounding, implement special quantities of Infinity, NaNs and Signed Zero, produce correct signals at specified exceptions, and allow users to implement trap handlers when an exception occurs. The standard treats the four arithmetic functions of addition, subtraction, multiplication and division as fundamental functions requiring the same precision from each function.

From the point of view of the hardware engineer, however, the four functions differ considerably in circuitry costs and execution speeds. Below is a table I've accumulated for these arithmetic functions on various machines:


                     <-----------execution speeds----------->

   machine           addition     multiplication     division

   Univac 1108          2.2  ms        3.0  ms        8.63 ms
   IBM 360/85            .38 ms        1.36 ms        1.64 ms   
   CDC 6600              .4  ms        1.   ms        2.9  ms
   VAX 11/750(inC)     10.6  ms       15.7  ms       15.7  ms
 
   Burroughs BSP      320    ns      320    ns     1280    ns
   Cray-1              75    ns       87.5  ns      362.5  ns

   CDC 205              5    cps       5    cps      54    cps
   Cray-XMP             6    cps       7    cps      29    cps
   intel 387        29-37    cps   32-57    cps      94    cps
   Cray T3D             6    cps       6    cps      61    cps

   ms = microseconds, ns = nanoseconds, cps = clock periods
Certainly as implemented, each operation takes varying time and underlying that they are implemented with varying amounts of silicon. Of course, it's the last bits of each result that are hardest to get correct and there were significant speed .vs. precision tradeoffs before the IEEE standard preempted them. For example, for the Cray-1 with only 48 bits in the mantissa, (52 for IEEE double precision) we had:

  addition          is chopped
  multiplication    is "statistically" rounded
  division          is a reciprocal followed by a multiply, 
                    (could be off by 2 ulps)
But there are still tradeoffs to be made in the software for the most expensive operation, division. There are always several options:
  1. use 32 or 64 bits (except on the Cray vector machines)
  2. use optimized library functions
  3. do a non IEEE divide
Because the IEEE standard treats division as a single arithmetic operation, it is against the standard to replace:

  a / b     with     a * ( 1. / b )
either in the hardware or by the compiler, but there are times when the denominator is constant within a loop and then such a transformation seems reasonable:

      do 10 i = 1, large_number
         a( i ) = b( i ) / scale
  10  continue
could be replaced by:

      qscale = 1.0 / scale
      do 10 i = 1, large_number
         a( i ) = b( i ) * qscale
  10  continue
Here we have a large_number of divisions being replaced by 1 division and a large_number of multiplications. The table below presents timings on a single T3D PE for 1 million "divisions" in a loop with various options:

                        varying denominators  constant denominator
                        --------------------  --------------------
  f90 64 bits                     .549               .501 
  cft77 -o ieeedivide             .543               .493
  cft77 -o noieeedivide           .543               .108
  benchlib oneover_v function     .336           not applicable
  f90 32 bits                     .305               .277
Notes on the table:
  1. cf77/cft77 does not support 32 bit arithmetic (too bad!)
  2. The F90 compiler does not support the -onoieeedivide switch (too bad!!)
  3. The benchlib routine oneover_v (newsletters #34 (5/5/95) and #74 (2/16/96) is a replacement for:
    
             
      do i = 1, n
        a( i ) = 1.0 / b( i )
      enddo
    
      as
    
      call oneover_v( n, a, b )
    
The code for the varying denominator case is given below:

      parameter( MAX = 1000000 )
      real   qieee( MAX ), qnoieee( MAX ), qbench( MAX ), num( MAX )
      real*4 q32ieee( MAX ), rnum( MAX )

      do i = 1, MAX
         qieee( i )   = ranf()
         qnoieee( i ) = qieee( i )
         qbench( i )  = qieee( i )
         q32ieee( i ) = qieee( i )
         num( i )     = ranf()
         rnum( i )    = num( i )
      enddo

      call ieee( MAX, num, qieee, tieee )
      call noieee( MAX, num, qnoieee, tnoieee )
      error1 = 0.0
      do i = 1, MAX
         error1 = error1 + ( qieee( i ) - qnoieee( i ) ) ** 2
      enddo

      call ieee32( MAX, rnum, q32ieee, t32ieee )
      error2 = 0.0
      do i = 1, MAX
         error2 = error2 + ( qieee( i ) - q32ieee( i ) ) ** 2
      enddo

      call bench( MAX, num, qbench, tbench )
      error3 = 0.0
      do i = 1, MAX
         error3 = error3 + ( qieee( i ) - qbench( i ) ) ** 2
      enddo

      write( 6, 601 ) MAX, tieee, tnoieee, t32ieee, tbench
 601  format( i8, 4f6.3 )
      write( 6, 602 ) MAX, sqrt(error1), sqrt(error2), sqrt(error3)
 602  format( i8, 3e16.8 )
      end

      real function second( )
      second = dble( irtc( ) ) / 150000000.0
      end

      subroutine ieee( length, num, q, time )
      real q( length ), num( length )
      time = second()
         do i = 1, length
            q( i ) = num( i ) / q( i )
           enddo
      time = second() - time
      end

      subroutine ieee32( length, num, q, time )
      real*4 q( length ), num( length )
      time = second()
         do i = 1, length
            q( i ) = num( i ) / q( i )
           enddo
      time = second() - time
      end

      subroutine bench( length, num, q, time )
      real q( length ), num( length )
      time = second()
         call oneover_v( length, q, q )
         do i = 1, length
            q( i ) = num( i ) * q( i )
           enddo
      time = second() - time
      end

      subroutine bench( length, num, q, time )
      real q( length ), num( length )
      time = second()
         call oneover_v( length, q, q )
         do i = 1, length
            q( i ) = num( i ) * q( i )
           enddo
      time = second() - time
      end

The Linpack Benchmark on the T3D

Often times, Linpack is dismissed as a 'one loop benchmark' because all of the performance numbers are determined by a single nest of DO loops. That nest is the core of Gaussian Elimination and looks like:

              do 30 j = kp1, n
                 t = a(l,j)
                 if (l .eq. k) go to 20
                    a(l,j) = a(k,j)
                    a(k,j) = t
     20          continue
  c              call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
                 do 21 i = 1, n-k
                    a(k+i,j)=t*a(k+i,k)+a(k+i,j)
     21          continue
     30       continue
where I have replaced the call to the saxpy BLAS1 routine with an equivalent DO loop, much like several inliners would do. But even with such a simple loop there are lots of possibly effective optimizations. Some of the possibilities I investigated were:

  A. Asis    - The source code from Oak Ridge National Labs without any
               modifications
  B. Libsci  - The source code with the BLAS1 routines saxpy, sdot and
               isamax blas deleted, so that they will be supplied by the
               optimized versions in libsci.a.
  C. Libsci  - The source code with the BLAS1 routines saxpy, sdot, and
               isamax and sgefa and sgesl of linpack deleted, so that
               they will be supplied by the optimized versions in
               libsci.a. (not an option on the T3D, as the linpack
               library is not part of libsci.a)
  D. Inline1 - Use either the cf77 inliner or vi inlining to replace
               calls to saxpy, sdot, and isamax.
  E. Inline2 - Use the cf77 inliner and fpp to replace the entire DO
               loop nest with a call to the BLAS2 routine SGERX. The
               above loop is replaced with:

      CDIR@ IVDEP
            DO 30 J = 1, N - KP1 + 1
               T1U(J) = A(L,KP1+J-1)
               IF (L .NE. K) THEN
                  A(L,KP1+J-1) = A(K,KP1+J-1)
                  A(K,KP1+J-1) = T1U(J)
               ENDIF
               J3X = N - K
         30 CONTINUE
            CALL SGERX@(J3X,N-KP1+1,1.,A(1+K,K),1,T1U,1,A(1+K,KP1),1,LDA)

               A simply remarkable transformation!
  F. Lapack    Replace the calling sequence from the Linpack library
               routines of:

                 call sgefa(a,lda,n,ipvt,info)
                 call sgesl(a,lda,n,ipvt,b,0)

               with the equivalence sequence from Lapack library routines:

                 call sgetrf( n, n, a, lda, ipvt, info )
                 call sgetrs( "n", n, 1, a, lda, ipvt, b, n, info )
A summary of the performance of these methods is given below where the speed is given in MFLOPS for solving two systems of linear equations: 100x100 and 1000x1000.

 
                          <-----100x100 case----->     <----1000x1000 case---->
                          T3D:1PE T3D:1PE Y-MP:M98     T3D:1PE T3D:1PE Y-MP:M98
                                  rdahead                      rdahead
A. original source           12.4    14.1     25.0        10.7    11.4     95.0
B. libsci blas (BLAS1 fcns)  13.8    20.4     47.0        25.2    28.3    118.2
C. libsci linpack                             65.8                         97.2
D. inline1     (BLAS1 fcns)  13.7    15.6     45.4        10.9    11.3     87.5
E. inline2     (BLAS2 fcns)  25.8    26.9    120.3        35.4    35.5    190.8
F. Lapack                    35.7    36.9    125.0        67.1    67.3    266.7
There are several notes to make about the results in the above table:
  1. Lapack is a good deal. If you're still back in the Stone Age using linpack routines, you are shooting yourself in the foot. The T3D Software Development Group took a courageous (but wise and correct) policy step in NOT providing the linpack library on the T3D.
  2. The T3D BLAS2 routine STRSM has a bug in it and originally the T3D aborts on the 1000x1000 case. But the beauty of working with a library like Lapack (or Linpack) is that you don't have to wait for the vendor to fix it. You can just go to the Oak Ridge National Labs anonymous ftp server (netlib2.cs.utk.edu) and get the source and compile it into your program as a replacement for the vendor's buggy version. Then if the vendor ever fixes the bug you can delete the linking to your source version.
  3. Getting inlining to the SGER (or SGERX) level on the T3D isn't available on the T3D because the fpp pass of cf77 is only for the Cray Parallel Vector Processors (PVP). But at least in this case we can use the transformational abilities of fpp to produce code for the T3D with something like:
    
      t3dline:
        -rm linpacks.j
        (export TARGET=cray-ymp; cf77 -c -Wd-e68 -Zp -J linpacks.f )
        mv linpacks.j linpacksj.f
        (export TARGET=cray-t3d; cf77 -c linpacksj.f second.f )
        mppldr -o inline linpacksj.o second.o
        inline > results.100.nrdw
        mppldr -D"rdahead=on" -o inline linpacksj.o second.o
        inline > results.100.rdw
    
    The -J flag on the cf77 compiler is used to produce the source listing that goes into the T3D version of cf77 in the above makefile rule.
  4. With CRI's use of read ahead buffers on the T3E it's a good idea to run with and without them to get some experience on which applications will benefit.

Multiprocessors Efforts - Memory Limitations

On the T3D we need to understand the uniprocessor results before we go on to the more challenging problem of using multiprocessors. The first question is: "What can Craft Fortran do for us?". We'll have to determine how to distribute the arrays to the multiple processors and then attack the above loop nest to share the work among the processors.

The array that matters for this application is the 2 dimensional coefficient matrix that describes the system of linear equations. The linpack benchmark overwrites this matrix with a factorization and then that factorization is used to solve for one right-hand side. I believe the only array distribution that makes sense for MPPs for this problem is the column cyclical distribution.

In Craft Fortran:


         parameter( lda = 128 )
         real a( lda, lda )
   cdir$ shared aa( :, :block(1) )
In High Performance Fortran:

         parameter( lda = 128 )
         real a( lda, lda )
   !hpf$ distribute a( *, cyclic )
This distribution (in ASCII) for a 16 by 16 matrix distributed on 4 processors looks like:

        Column number
        1   2   3   4  5    6   7   8                    13  14  15  16
       -----------------------------------------------------------------
  R  1 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

  o  2 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

  w  3 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

     4 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

  n  5 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

  u  6 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

  m  7 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

  b  8 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

  e  9 
 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 

  r 10 
 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 

    11 
 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 

    12 
 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 

    13 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

    14 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

    15 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

    16 
pe0
pe1
pe2
pe3
pe0
pe1
pe2
pe3
 .   .   .   . 
pe0
pe1
pe2
pe3

       -----------------------------------------------------------------
This distribution has two features that no other distribution will have:
  1. Columns of the matrix will be completely contained on one processor. This will take advantage of the column orientation of the factorization and allow us to use cache efficiently on each processor.
  2. As the factorization proceeds, it updates submatrices of the full matrix from the entire matrix into smaller submatrices from the upper left corner to the lower right corner. With the cyclic distribution of columns, this means the work is load balanced as each PE will have approximately the same number of columns to update.
After the distribution of the arrays has been decided on, that distribution has to be put into each subroutine that uses those arrays. This can be a big task, but for Linpack, with its 5 subroutines it's not such a problem. The tool for this work is vi.

One more constraint has to be handled. For this distribution, Craft works only on arrays whose dimensions are powers of two. So for this problem we have:


  the  100x100  problem is solved on a  128x128  array
  the 1000x1000 problem is solved on a 1024x1024 array
  the 2000x2000 problem is solved on a 2048x2048 array
  the 4000x4000 problem is solved on a 4096x4096 array
Using the CDIR$ MASTER/CDIR$ ENDMASTER directives we can use this distribution of arrays on any number of processors but keep all of the computation on a single processor. Of course, when PE0 needs an element located on another processor, that access will not be as fast as accessing a local element. But we still get the right answers, which is always a good start.

  MFLOPS for a single PE computing the linpack
  solution on a problem distributed among many PEs

  # of PEs     <-------------problem size------------->
  that contain 100x100  1000x1000  2000x2000  4000x4000
  the problem  -------  ---------  ---------  ---------
        1        9.58      2.76     too big    too big
        2        1.42      2.22       1.43     too big
        4        1.46      1.47       1.47     too big
        8        1.45      1.46       1.48       1.47
       16        1.44      1.45       1.45       1.45
       32        1.41      1.43       1.43       1.43
The table shows decreasing performance as more and more of the problem is distributed off of PE0. And now we have the dubious distinction of having solved the largest linpack problem on a single Alpha processor at abysmal speed. Now that we understand the uniprocessor performance and the memory limitations, we can concentrate on getting multiprocessor speedup using Craft and HPF. Stay tuned in for the next newsletter.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top