| Newsletter Index | Quick-Tip Index | Search Newsletters |
ARSC's T3D has been busy: 59 - 77% utilized during the last 4 months. Thanks to a cooperative spirit among users, access to the big queues seems to be reasonably equitable.
This seemed, however, like a good time to re-state the "gentlemen's agreement" which has helped maintain fair access. Simply put: please don't have more than one job (running or executing) in an MPP queue at a time.
The following description of this "agreement" can be found on denali by typing:
denali> news t3dbatchor at the URL:
http://www.arsc.edu/user/UsingT3D.htmlThe T3D batch queues were changed on January 16, 1996. The current T3D queues are:
Queue name Number Max jobs Time limit When queue
of PEs per queue on queue is enabled
m_128pe_5m 128 1 5 minutes after downtimes
m_128pe_8h 128 1 8 hours Friday 6PM to Sunday 4AM
m_64pe_24h 64 1 24 hours always
m_64pe_10m 64 1 10 minutes always
m_32pe_24h 32 1 24 hours always
m_16pe_24h 16 2 24 hours always
m_8pe_24h 8 2 24 hours always
A request made to these queues will be run as soon as enough PEs are
available to satisfy the request.
Please submit/execute, at most, one job per MPP queue at a time. ARSC does not enforce this "queue ethic" (by actively blocking or deleting user jobs), but has found that users are willing to work together to provide fair access to the queues.
As an example of the "queue ethic," if user "jackalope" submitted a job to the m_64pe_24h queue, he would not submit another until the first had run to completion and thus made the queue available to other users. Meanwhile, "jackalope" could have one job in the m_32pe_24h queue, if he wanted, without violating the ethic.
If you feel that the queues are being misused, we will work actively to resolve the situation. Contact:
Tom Baring: 907-474-1899
Many of us have come to view the art of T3D programming as one of maximizing parallelism. We concentrate on our speedups, and if they're close to linear we may think we've done all we can to enhance performance. Obviously, we want our code to be as scalable as possible, but the bottom line is that we want greater throughput.
In many codes, the throughput can be improved by enhancing the per-processor performance. This might make our speedups less than perfect, since communication costs and inherently sequential sections of code would take up an increasing portion of the overall processing time, but what's more important - a perfect speedup curve or greater throughput? Often, if we can enhance the throughput, we can consider solving larger problems.
Scientific codes often contain vector, vector-matrix, and matrix-matrix operations which can be greatly optimized by making use of Basic Linear Algebra Subroutine (BLAS). Although vector programmers should already be aware of this, the information is useful on non-vector processors, also.
As a simple example, let's consider some vector operations. The code which follows begins with three vectors, x, y, and z. They are initialized, then both x and y are scaled by dividing by their largest-magnitude element. Finally, z is obtained by adding x, multiplied by some factor (alpha), and y. The latter is a classic SAXPY operation (scalar alpha x plus y), and looks like
z <-- alpha*x + ywhere x, y, z are vectors and alpha is a scalar value. Saxpy operations are typically very efficient, as vector entries are stored in contiguous logical memory locations. Although cache replacement algorithms vary between architectures, in general, when the first entry of a vector is referenced, it, and a block of memory which follows the entry, is copied to cache under the assumption that more entries in the vector will be referenced in the near future. Thus, when the second entry of the vector is referenced it will already be in cache, providing rapid access. The lesson here is that if we can express our code so that contiguous memory locations are referenced in each loop iteration, memory access will be more efficient. Additionally, this allows us to make use of the highly-optimized libraries for even greater performance.
The first section of the following code displays the scaling and saxpy operations discussed above. The second half of the code shows the same operations performed by calling BLAS routines ISAMAX to find the index of the largest-magnitude element, SSCAL to scale the vectors, and SAXPY to perform the saxpy operation. All of these routines are documented in man pages on denali. When compiled for the T3D with the Fortran 90 compiler, the code required 1.43 seconds to perform these operations without BLAS routines, and 0.25 seconds with BLAS. CRI has put considerable effort into optimizing these routines, so by recognizing operations which can benefit with BLAS, we can simplify our code and exhibit better per-processor performance, which of course leads to better throughput.
program blastest
parameter(isize=1 000 000, alpha=3.14)
real x(isize), y(isize), z(isize)
ccc initialize x and y to some random values
do i=1,isize
x(i) = RANF()
y(i) = RANF()
enddo
ccccccccccccccccc NON-BLAS
timer1 = SECONDR()
ccc scale each of these vectors by their largest absolute value
index = 1
do i=2,isize
if ( ABS(x(i)) .gt. ABS(x(index)) ) index = i
enddo
scale = 1.0/x(index)
do i=1,isize
x(i) = x(i)/scale
enddo
index = 1
do i=2,isize
if ( ABS(y(i)) .gt. ABS(y(index)) ) index = i
enddo
scale = 1.0/y(index)
do i=1,isize
y(i) = y(i)/scale
enddo
ccc saxpy operation
do i=1,isize
z(i) = alpha*x(i) + y(i)
enddo
timer2 = SECONDR()
print *, timer2 - timer1, ' seconds for non-BLAS'
ccccccccccccccccc Same operations as above, this time using BLAS routines
ccc initialize x and y to some random values
do i=1,isize
x(i) = RANF()
y(i) = RANF()
enddo
timer1 = SECONDR()
ccc scale each of these vectors by their largest absolute value
index = ISAMAX(isize, x, 1)
scale = 1.0/x(index)
call SSCAL(isize, scale, x, 1)
index = ISAMAX(isize, y, 1)
scale = 1.0/y(index)
call SSCAL(isize, scale, y, 1)
ccc saxpy operation
call SAXPY(isize, alpha, x, 1, y, 1)
timer2 = SECONDR()
print *, timer2 - timer1, ' seconds wall time for BLAS'
stop
end
Even more improvement can be realized with matrix operations. As another
simple example, let's consider matrix-matrix multiplication. We have all
learned that two nxn matrices can be multiplied by placing the following
statement inside a set of three nested loops, iterating on i, j, and k:
c(i,j) <-- c(i,j) + a(i,k)*b(k,j)No matter how we arrange the three nested loops (there are six possible ways), we will obtain the same answer. However, the time it takes to get an answer is influenced by the loop ordering.
In the code which follows, after initializing a and b with random values, and c with zeroes, I demonstrate what I call a "naive" approach - we simply code up our textbook example and get our answer.
In the next multiplication, which I call "structured," I arrange the nested loops so that I have a saxpy operation. In Fortran, which stores matrices in column-major ordering, the innermost loop will iterate down the columns of a and c. Because the columns will map to contiguous logical memory locations, they can be viewed just as the vectors in a saxpy operation. As the innermost loop iterates, b(k,j) is constant, and thus we can think of it as a scalar value.
In the third multiplication, the innermost loop is replaced with the BLAS SAXPY routine.
Finally, in the fourth multiplication, I make use of a BLAS3 (matrix-matrix) SGEMM routine to replace all of my loops.
program matmult
parameter(isize=512)
dimension a(isize, isize), b(isize, isize), c(isize,isize)
ccc initialize a and b to random values, c to zero
do j=1,isize
do i=1,isize
a(i,j) = RANF()
b(i,j) = RANF()
c(i,j) = 0.0
enddo
enddo
ccc Naive matrix multiplication
timer1 = SECONDR()
do i=1,isize
do j=1,isize
do k=1,isize
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
timer2 = SECONDR()
print *, 'Time for naive matrix multiplication: ',
& timer2-timer1
ccc Structured matrix multiplication
ccc initialize c to zero
do j=1,isize
do i=1,isize
c(i,j) = 0.0
enddo
enddo
timer1 = SECONDR()
do k=1,isize
do j=1,isize
do i=1,isize
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
timer2 = SECONDR()
print *, 'Time for structured matrix multiplication: ',
& timer2-timer1
ccc BLAS1
ccc initialize c to zero
do j=1,isize
do i=1,isize
c(i,j) = 0.0
enddo
enddo
timer1 = SECONDR()
do k=1,isize
do j=1,isize
call SAXPY(isize, b(k,j), a(1,k), 1, c(1,j), 1)
enddo
enddo
timer2 = SECONDR()
print *, 'Time for BLAS1 matrix multiplication: ',
& timer2-timer1
ccc BLAS3
ccc initialize c to zero
do j=1,isize
do i=1,isize
c(i,j) = 0.0
enddo
enddo
timer1 = SECONDR()
call SGEMM('N', 'N', isize, isize, isize, 1.0, a, isize,
& b, isize, 1.0, c, isize)
timer2 = SECONDR()
print *, 'Time for BLAS3 SGEMM matrix multiplication: ',
& timer2-timer1
stop
end
Timings for the code are as follows:
Time for naive matrix multiplication: 46.09 Time for structured matrix multiplication: 34.33 Time for BLAS1 matrix multiplication: 11.18 Time for BLAS3 SGEMM matrix multiplication: 2.52I have only scratched the surface of linear algebra optimization, but hopefully it has demonstrated the benefits of looking more closely at our operations to see how we might benefit by calling BLAS routines. With a little effort, we might find it possible to use 16 PE's where we once required 64, simply by improving the per-processor performance.
Closing points - LAPACK routines, for the solution of various linear systems, are also available on denali. If you're worried about portability, both BLAS and LAPACK sources are available from Netlib:
http://www.netlib.org/and have been implemented on numerous systems ranging from Linux PC's to supercomputers. In fact, the use of these routines allows us to implement operations in a portable, yet often optimized, manner.
===================================================================== | Don Morton Email: morton@arsc.edu | | Visiting Scientist (summer) Voice: (907) 474-5507 | | Arctic Region Supercomputing Center Fax : (907) 450-8601 | | University of Alaska | | Fairbanks, AK 99775 | | http://grizzly.cameron.edu/morton.html | =====================================================================
Contact:
Donald Bahls ARSC User Consultant ph: 907-450-8674 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.E-mail Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources