ARSC T3D Users' Newsletter 96, July 19, 1996

ARSC's MPP Queues

ARSC's T3D has been busy: 59 - 77% utilized during the last 4 months. Thanks to a cooperative spirit among users, access to the big queues seems to be reasonably equitable.

This seemed, however, like a good time to re-state the "gentlemen's agreement" which has helped maintain fair access. Simply put: please don't have more than one job (running or executing) in an MPP queue at a time.

The following description of this "agreement" can be found on denali by typing:

  denali> news t3dbatch

  Queue name   Number  Max jobs   Time limit   When queue
               of PEs  per queue   on queue    is enabled

  m_128pe_5m    128        1       5 minutes   after downtimes
  m_128pe_8h    128        1       8 hours     Friday 6PM to Sunday 4AM
  m_64pe_24h     64        1      24 hours     always
  m_64pe_10m     64        1      10 minutes   always
  m_32pe_24h     32        1      24 hours     always
  m_16pe_24h     16        2      24 hours     always
  m_8pe_24h       8        2      24 hours     always
A request made to these queues will be run as soon as enough PEs are available to satisfy the request.

Please submit/execute, at most, one job per MPP queue at a time. ARSC does not enforce this "queue ethic" (by actively blocking or deleting user jobs), but has found that users are willing to work together to provide fair access to the queues.

As an example of the "queue ethic," if user "jackalope" submitted a job to the m_64pe_24h queue, he would not submit another until the first had run to completion and thus made the queue available to other users. Meanwhile, "jackalope" could have one job in the m_32pe_24h queue, if he wanted, without violating the ethic.

If you feel that the queues are being misused, we will work actively to resolve the situation. Contact:


  Tom Baring:             
baring@arsc.edu
   907-474-1899

18:1 Speedup Demonstrated: Free for Using System Libraries

[ Don Morton, visiting ARSC from Cameron University, provides this article. See his contact information, below. ]

Many of us have come to view the art of T3D programming as one of maximizing parallelism. We concentrate on our speedups, and if they're close to linear we may think we've done all we can to enhance performance. Obviously, we want our code to be as scalable as possible, but the bottom line is that we want greater throughput.

In many codes, the throughput can be improved by enhancing the per-processor performance. This might make our speedups less than perfect, since communication costs and inherently sequential sections of code would take up an increasing portion of the overall processing time, but what's more important - a perfect speedup curve or greater throughput? Often, if we can enhance the throughput, we can consider solving larger problems.

Scientific codes often contain vector, vector-matrix, and matrix-matrix operations which can be greatly optimized by making use of Basic Linear Algebra Subroutine (BLAS). Although vector programmers should already be aware of this, the information is useful on non-vector processors, also.

As a simple example, let's consider some vector operations. The code which follows begins with three vectors, x, y, and z. They are initialized, then both x and y are scaled by dividing by their largest-magnitude element. Finally, z is obtained by adding x, multiplied by some factor (alpha), and y. The latter is a classic SAXPY operation (scalar alpha x plus y), and looks like


  z <-- alpha*x + y
where x, y, z are vectors and alpha is a scalar value. Saxpy operations are typically very efficient, as vector entries are stored in contiguous logical memory locations. Although cache replacement algorithms vary between architectures, in general, when the first entry of a vector is referenced, it, and a block of memory which follows the entry, is copied to cache under the assumption that more entries in the vector will be referenced in the near future. Thus, when the second entry of the vector is referenced it will already be in cache, providing rapid access. The lesson here is that if we can express our code so that contiguous memory locations are referenced in each loop iteration, memory access will be more efficient. Additionally, this allows us to make use of the highly-optimized libraries for even greater performance.

The first section of the following code displays the scaling and saxpy operations discussed above. The second half of the code shows the same operations performed by calling BLAS routines ISAMAX to find the index of the largest-magnitude element, SSCAL to scale the vectors, and SAXPY to perform the saxpy operation. All of these routines are documented in man pages on denali. When compiled for the T3D with the Fortran 90 compiler, the code required 1.43 seconds to perform these operations without BLAS routines, and 0.25 seconds with BLAS. CRI has put considerable effort into optimizing these routines, so by recognizing operations which can benefit with BLAS, we can simplify our code and exhibit better per-processor performance, which of course leads to better throughput.


        program blastest

        parameter(isize=1 000 000, alpha=3.14) 

        real x(isize), y(isize), z(isize)
    


  ccc   initialize x and y to some random values
        do i=1,isize
           x(i) = RANF()
           y(i) = RANF()
        enddo

  ccccccccccccccccc   NON-BLAS

        timer1 = SECONDR()

  ccc   scale each of these vectors by their largest absolute value
        index = 1
        do i=2,isize
           if ( ABS(x(i)) .gt. ABS(x(index)) ) index = i
        enddo
        scale = 1.0/x(index)

        do i=1,isize
           x(i) = x(i)/scale
        enddo

        index = 1
        do i=2,isize
           if ( ABS(y(i)) .gt. ABS(y(index)) ) index = i
        enddo
        scale = 1.0/y(index)

        do i=1,isize
           y(i) = y(i)/scale
        enddo

  ccc   saxpy operation
        do i=1,isize
           z(i) = alpha*x(i) + y(i)
        enddo

        timer2 = SECONDR()
        print *, timer2 - timer1, ' seconds for non-BLAS'


  ccccccccccccccccc   Same operations as above, this time using BLAS routines
  ccc   initialize x and y to some random values
        do i=1,isize
           x(i) = RANF()
           y(i) = RANF()
        enddo


        timer1 = SECONDR()

  ccc   scale each of these vectors by their largest absolute value
        index = ISAMAX(isize, x, 1)
        scale = 1.0/x(index)
        call SSCAL(isize, scale, x, 1)
  
        index = ISAMAX(isize, y, 1)
        scale = 1.0/y(index)
        call SSCAL(isize, scale, y, 1)

  ccc   saxpy operation
        call SAXPY(isize, alpha, x, 1, y, 1)

        timer2 = SECONDR()
        print *, timer2 - timer1, ' seconds wall time for BLAS'

        stop
        end
Even more improvement can be realized with matrix operations. As another simple example, let's consider matrix-matrix multiplication. We have all learned that two nxn matrices can be multiplied by placing the following statement inside a set of three nested loops, iterating on i, j, and k:

  c(i,j) <-- c(i,j) + a(i,k)*b(k,j) 
No matter how we arrange the three nested loops (there are six possible ways), we will obtain the same answer. However, the time it takes to get an answer is influenced by the loop ordering.

In the code which follows, after initializing a and b with random values, and c with zeroes, I demonstrate what I call a "naive" approach - we simply code up our textbook example and get our answer.

In the next multiplication, which I call "structured," I arrange the nested loops so that I have a saxpy operation. In Fortran, which stores matrices in column-major ordering, the innermost loop will iterate down the columns of a and c. Because the columns will map to contiguous logical memory locations, they can be viewed just as the vectors in a saxpy operation. As the innermost loop iterates, b(k,j) is constant, and thus we can think of it as a scalar value.

In the third multiplication, the innermost loop is replaced with the BLAS SAXPY routine.

Finally, in the fourth multiplication, I make use of a BLAS3 (matrix-matrix) SGEMM routine to replace all of my loops.



        program matmult

        parameter(isize=512)
        dimension a(isize, isize), b(isize, isize), c(isize,isize)

  ccc   initialize a and b to random values, c to zero
        do j=1,isize
           do i=1,isize
              a(i,j) = RANF()
              b(i,j) = RANF()
              c(i,j) = 0.0
           enddo
        enddo

  ccc   Naive matrix multiplication
        timer1 = SECONDR()
        do i=1,isize
           do j=1,isize
              do k=1,isize
                 c(i,j) = c(i,j) + a(i,k)*b(k,j)
              enddo
           enddo
        enddo
        timer2 = SECONDR()

        print *, 'Time for naive matrix multiplication: ', 
       &         timer2-timer1

  ccc   Structured matrix multiplication
  ccc   initialize c to zero
        do j=1,isize
           do i=1,isize
              c(i,j) = 0.0
           enddo
        enddo

        timer1 = SECONDR()
        do k=1,isize
           do j=1,isize
              do i=1,isize
                 c(i,j) = c(i,j) + a(i,k)*b(k,j)
              enddo
           enddo
        enddo
        timer2 = SECONDR()

        print *, 'Time for structured matrix multiplication: ', 
       &         timer2-timer1

  ccc   BLAS1
  ccc   initialize c to zero
        do j=1,isize
           do i=1,isize
              c(i,j) = 0.0
           enddo
        enddo

        timer1 = SECONDR()
        do k=1,isize
           do j=1,isize
              call SAXPY(isize, b(k,j), a(1,k), 1, c(1,j), 1)
           enddo
        enddo
        timer2 = SECONDR()
        print *, 'Time for BLAS1 matrix multiplication: ', 
       &         timer2-timer1

  ccc   BLAS3
  ccc   initialize c to zero
        do j=1,isize
           do i=1,isize
              c(i,j) = 0.0
           enddo
        enddo

        timer1 = SECONDR()
        call SGEMM('N', 'N', isize, isize, isize, 1.0, a, isize, 
       &            b, isize, 1.0, c, isize)
        timer2 = SECONDR()
        print *, 'Time for BLAS3 SGEMM matrix multiplication: ', 
       &         timer2-timer1


        stop
        end
Timings for the code are as follows:

  Time for naive matrix multiplication:       46.09
  Time for structured matrix multiplication:  34.33
  Time for BLAS1 matrix multiplication:       11.18
  Time for BLAS3 SGEMM matrix multiplication:  2.52
I have only scratched the surface of linear algebra optimization, but hopefully it has demonstrated the benefits of looking more closely at our operations to see how we might benefit by calling BLAS routines. With a little effort, we might find it possible to use 16 PE's where we once required 64, simply by improving the per-processor performance.

Closing points - LAPACK routines, for the solution of various linear systems, are also available on denali. If you're worried about portability, both BLAS and LAPACK sources are available from Netlib:


http://www.netlib.org/
and have been implemented on numerous systems ranging from Linux PC's to supercomputers. In fact, the use of these routines allows us to implement operations in a portable, yet often optimized, manner.

  =====================================================================
  
  Don Morton                              Email:  morton@arsc.edu  

  
  Visiting Scientist  (summer)            Voice: (907) 474-5507    

  
  Arctic Region Supercomputing Center     Fax  : (907) 450-8601    

  
  University of Alaska                                             

  
  Fairbanks, AK 99775                                              

  
  http://grizzly.cameron.edu/morton.html                           

  =====================================================================

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top