Performance Analysis of the CRAY T3E-1200E
Performance Analysis of the CRAY T3E-1200EEdward Anderson Lockheed Martin Services Inc. National Environmental Supercomputing Center Anderson.Edward@epa.gov November 23, 1999 Click here for the postscript version.
Most sources for CRAY T3E results give performance measurements for the CRAY T3E (300 MHz) or CRAY T3E-900 (450 MHz) models (,). This paper presents results for the CRAY T3E-1200E (600 MHz), which differs from the earlier models in the following respects:
- The clock speed of the microprocessor is 600 MHz.
- The DRAM speed is 50 ns, vs. 60 ns in earlier models.
- The support circuitry supports hardware read-ahead for stream operations (this is also true of the CRAY T3E-900). The differences between a CRAY T3E-1200 and a CRAY T3E-1200E include a later revision of the R chip, higher performance for direct memory copies over the interconnection network, and support for different dynamic routing.
1.0 Hardware overview
The CRAY T3E is a distributed memory multiprocessor with a globally addressable address space. The liquid-cooled model is scalable to eight cabinets, each housing one clock module and up to 34 processor modules. Each processor module contains eight nodes or processing elements (PEs), giving a maximum configuration of 272 PEs/cabinet. The PEs are site-configurable into three groups: application (APP) PEs, used to run parallel user jobs, command (CMD) PEs, used to run single-processor user jobs and to process interactive commands, and operating system (OS) PEs, used for system tasks. There must be at least two CMD and two OS PEs in a liquid-cooled system. For a fully-populated cabinet of 272 PEs, the typical configuration is 256 APP PEs and a total of 16 CMD and OS PEs. This ratio of application to support PEs scales to 2048 application processors.
Each node of a CRAY T3E system consists of an Alpha 21164 microprocessor, a system control chip, 64 MB - 2048 MB (*) of local memory, and a network router. The DEC Alpha 21164 is a superscalar microprocessor with 4-way instruction issue, an 8 KB primary data cache, an 8 KB instruction cache, and a 96 KB secondary cache for both data and instructions. The system logic operates at 75 MHz, and the microprocessor clock speed must be a multiple of this rate. Included on the control chip are a set of six hardware stream buffers, which provide prefetching and buffering for loads of consecutive cache lines from local memory, and an external register set (the "E-registers"), which are used for memory transfers. Although the main purpose of the E-registers is to handle inter-processor communication, they can also be used for non-cached access to local memory. A cache backmap maintains coherence of the local memory, caches, streams, and E-registers by flushing cached lines when a non-cached memory request is initiated.
(*) Most CRAY T3E systems have 128, 256, or 512 MB of memory per node.
There is one network router for each processor on a CRAY T3E system, in contrast to the CRAY T3D system architecture, in which two processors shared a network connection. As in the CRAY T3D, network routers are connected in a 3-D torus topology. Each network router has 16 uni-directional channels, two to connect to and from the support circuitry of its associated PE, two to connect to and from the I/O controller on the processor module, and 12 to connect to and from other nodes in the +/- x, +/- y, and +/- z directions of the 3-D torus. Torus links have a theoretical peak bandwidth of 650 MB/s in each direction.
The test platform for the benchmarks in this paper was the 72-processor CRAY T3E-1200E system at the National Environmental Supercomputing Center, operated by Lockheed Martin for the U. S. Environmental Protection Agency. This system was configured with 64 application PEs, 4 command PEs, and 4 OS PEs, each with 256 MB of memory. The operating system level was Unicos/mk 18.104.22.168 and the programming environment release level was 3.2.
2.0 Local memory bandwidthThe peak bandwidth from local memory to the microprocessor within a CRAY T3E node is rated at 1200 MB/sec. This rate is based on a peak transfer rate of 128 bits (16 bytes) per system clock along the processor bus, or 32 bits per system clock along each of the four channels between the memory controller chips and the main control chip, and a system clock rate of 75 MHz. The theoretical peak sustained bandwidth is 80% of this, or 960 MB/sec.
The greatest measured bandwidths are obtained by assembly-coded kernels or library routines. Simple load/store tests are best written in assembly language because high-level languages can not easily express one-way data transfers except in the context of some other work. Table 1 shows the best measured bandwidth rates for cacheable loads and stores. The effect of the on-chip caches was minimized for these tests by loading or storing unit-stride vectors of millions of words until an asymptotic rate was reached. Cacheable store rates appear slow because the write-allocate secondary cache requires that stored data first be loaded into the cache, then updated and marked as dirty until flushed to memory at some later time, so the reported rate represents only half of the memory traffic. The load/store operation assumes different addresses for the load and store and is equivalent to a vector copy. Experiments with the relative memory addresses of the streams showed that separating the arrays by a multiple of 8192 words plus 64 or 128 words for avoiding cache conflicts gave the best performance.
|Operation||1 stream||2 streams||3 streams||4 streams|
|cacheable load + store||-||515||-||513|
Table 2 shows the best measured bandwidth rates for uncached loads and stores (typically called GETs and PUTs). These rates were measured using the one-sided E-register GET and PUT routines from benchlib () and the SHMEM library routines SHMEM_GET and SHMEM_PUT. In this context, "one-sided" means a transfer from local memory into the E-registers, or from the E-registers to local memory. A one-sided GET is a meaningless operation, but a one-sided PUT could have application in initializing a data area to a constant value. Because the data never makes it to the Alpha microprocessor during an E-register operation, the uncached load and store rates are useful only for modeling memory-to-memory transfers such as a vector copy or a matrix transpose.
|uncached GET+PUT via benchlib||582|
|uncached GET+PUT via SHMEM_GET||668|
|uncached GET+PUT via SHMEM_PUT||609|
A somewhat more standard measure of "sustainable" local memory bandwidth is the STREAM benchmark (). The single-processor STREAM benchmark program evaluates the performance of the local memory system on the following unit-stride operations:
|Name||Operation||Bytes transferred (at 8 bytes/word)|
|Copy||c(1:n) = a(1:n)||8*(16*n)|
|Scale||b(1:n) = s*c(1:n)||8*(16*n)|
|Add||c(1:n) = a(1:n) + b(1:n)||8*(24*n)|
|Triad||a(1:n) = b(1:n) + s*c(1:n)||8*(24*n)|
The value of n is chosen large enough to exceed the size of any local caches. Optimization by use of assembly language kernels or library routines is not allowed. When STREAM benchmark results are given for parallel systems, they include the time to perform the local memory operations and the time to synchronize all the participating processors or processes.
Table 3 shows the best measured STREAM benchmark rates for each operation. The size n was 2000000 elements for all tests, but different values of the "offset" parameter, which helps to control the relative memory addresses, were optimal for the different cases.
|Function||Offset||Compiler options||Rate [MB/s]|
|Copy||7040||f90 -O3,unroll2,pipeline2 -dp||520|
|Scale||7104||f90 -O3,unroll2,pipeline2 -dp||517|
|Add||7040||f90 -O3,unroll2,pipeline2 -dp -a pad||611|
|Triad||10000||f90 -O3,unroll2,pipeline2 -dp||622|
3.0 Inter-processor bandwidthThe COMMS benchmarks from the ParkBench suite () measure inter-processor communication rates. Two versions of these benchmarks are included in the benchmark suite, one using the PVM message-passing library and one using MPI. In addition, we implemented the benchmarks using the SHMEM data-passing library, which has the lowest latency and the highest bandwidth of the supported messaging libraries on the CRAY T3E.
Latency and bandwidth data extracted from the COMMS1 benchmark results are shown in Table 4. For the MPI tests, setting the environment variable MPI_BUFFER_MAX to 2048, thereby limiting buffering to messages less than 2048 bytes, improved the performance by 75% for the larger transfers. For the PVM tests, setting the environment variable PVM_DATA_MAX to 32000 (as suggested in ) improved the performance for messages up to that length, but did not affect the asymptotic rate.
|Library||time for zero-length message [microseconds]||bandwidth [Mbyte/s]|
The SYNCH1 benchmark measures the overhead for global synchronization by measuring the rate at which a barrier statement can be executed as a function of the number of processes (nodes) taking part in the global barrier synchronization. The SYNCH1 benchmark repeats a sequence of 10 BARRIER statements 1000 times.
|NPES||MPI barrier [usec]||PVM barrier [usec]||SHMEM barrier [usec]|
4.0 Single processor performanceAs on previous CRAY systems, the best single-processor performance for computational kernels on the CRAY T3E is obtained by the use of library routines. The CRAY scientific library (libsci) includes optimized single-processor versions of the BLAS (the Basic Linear Algebra Subprograms from netlib), LAPACK (a standard linear algebra package), and FFTs. Compiler optimization options enable vectorization of selected math library (libm) functions, and an alternate math library (libmfastv) provides higher-performing versions of some of these functions. Certain memory-to-memory operations, such as block copies, matrix transposes, data initialization, and gather/scatter operations, can be optimized by using Shared Memory (SHMEM) library routines, specifying the local processor for both the target and source. Finally, the unsupported benchlib library () provides finer control over memory-to-memory transfers by direct manipulation of the E-registers.
The libsci BLAS can attain near peak performance when operating on data in the cache. To illustrate, we performed some basic vector operations (1-norm, 2-norm, vector sum, and dot product) using 64-bit Level 1 BLAS with vectors sized to fit in the cache, and repeated the tests many times to get an in-cache performance rate. The results are shown in Table 6. The performance of SNRM2 is the same as SASUM despite doing twice as much work because a 2-pass algorithm is used in SNRM2 to avoid underflow or overflow during the sum of squares ().
|Subroutine||Operation||Data type||In-cache rate [Mflop/s]|
|SASUM||a <- || x ||_1||real||365|
|SNRM2||a <- || x ||_2||real||361|
|SAXPY||y <- a*x + y||real||523|
|CAXPY||y <- a*x + y||complex||943|
|SDOT||a <- x' * y||real||979|
|CDOTU||a <- x' * y||complex||1175|
Compared to other distributed memory systems, the CRAY T3E has a relatively small cache on each node. To highlight the advantages of a larger cache, some vendors will only show the performance of library routines from the cache. However, when data is not already cached, the asymptotic performance rate is a better predictor of the performance that will actually be achieved. Table 7 shows asymptotic performance rates for representative libsci BLAS operations on 64-bit vectors and matrices.
|Subroutine||Problem size||Rate [Mflop/s]|
Besides the Basic Linear Algebra Subprograms, the Cray Scientific Library also contains software from LAPACK (a Fortran library of linear algebra software), special solvers such as first- and second-order linear system solvers, and FFTs. A sample of the performance from some of the block algorithms in LAPACK, taken directly from the LAPACK timing program, is shown in Table 8. Although signficantly below the matrix multiply rate and probably optimizable, these results may be more indicative of the performance of Level 3 BLAS in the context of a user program. Table 9 shows the performance of the FFT library routine CCFFT, including both an in-cache performance rate and a rate with all the vectors starting from memory. The megaflop rate is based on an operation count of 5 n log( n ) for each complex-to-complex FFT. Parallel versions of some of this software are also available through the ScaLAPACK library and several well-optimized 2-D and 3-D FFTs.
|N||In-cache [Mflop/s]||From memory [Mflop/s]|
A number of common memory-to-memory transfers can be performed more efficiently by using the E-registers to bypass the cache. Examples include block copies, matrix transposes, data initialization, and gather/scatter operations (). Support for this type of data movement is provided through the SHMEM and benchlib libraries or use of the CACHE_BYPASS compiler directive. The next several tables summarize their various implementations and compare their performance to that of regular Fortran using the cache. Some of the operations do not have a straightforward implementation in every method.
|Method||Source code||Rate [Mb/s]|
DO I = 1, N Y(I) = X(I) END DO
!DIR$ CACHE_BYPASS X, Y DO I = 1, N Y(I) = X(I) END DO
DO I = 1, N, 480 NI = MIN( 480, N-I+1 ) CALL LGETV( X(I), 1, NI ) CALL LPUTV( Y(I), 1, NI ) END DO
CALL SHMEM_GET( Y, X, N, MYPE )
|Method||Source code||Rate [Mb/s]|
DO J = 1, N DO I = 1, N B(I,J) = A(J,I) END DO END DO
DO J = 1, N !DIR$ CACHE_BYPASS A, B DO I = 1, N B(I,J) = A(J,I) END DO END DO
DO J = 1, N, 480 NJ = MIN( N-J+1, 480 ) DO I = 1, N CALL LGETV( A(I,J), LDA, NJ ) CALL LPUTV( B(J,I), 1, NJ ) END DO END DO 123 IF( LPUTP().NE.0 ) GOTO 123
DO J = 1, N CALL SHMEM_IGET( B(1,J), A(J,1), &1, LDA, N, MYPE ) END DO
|Method||Source code||Rate [Mb/s]|
DO J = 1, N-1 DO I = J+1, N TMP = A(I,J) A(I,J) = A(J,I) A(J,I) = TMP END DO END DO
DO J = 1, N-1 !DIR$ CACHE_BYPASS A DO I = J+1, N TMP = A(I,J) A(I,J) = A(J,I) A(J,I) = TMP END DO END DO
DO J = 1, N-1 DO I = 1, N-J, 240 NI = MIN( (N-J)-I+1, 240 ) CALL LGETVO( A(J+I,J), 1, NI, 0 ) CALL LGETVO( A(J,J+I), LDA, NI, 240 ) CALL LPUTVO( A(J,J+I), LDA, NI, 0 ) CALL LPUTVO( A(J+I,J), 1, NI, 240 ) END DO END DO 123 IF( LPUTP().NE.0 ) GOTO 123
DO J = 1, N-1 CALL SHMEM_IGET( B, A(J,J+1), &1, LDA, N-J, MYPE ) CALL SHMEM_IGET( A(J,J+1), A(J+1,J), &LDA, 1, N-J, MYPE ) CALL SHMEM_PUT( A(J+1,J), B, N-J, MYPE ) END DO
|Method||Source code||Rate [Mb/s]|
DO I = 1, N X(I) = SUM END DO
!DIR$ CACHE_BYPASS X DO I = 1, N X(I) = SUM END DO
CALL LSETV( X, 1, N, SUM ) 123 IF( LPUTP().NE.0 ) GO TO 123
|Method||Source code||NGATH||Rate [Mb/s]|
DO I = 1, NGATH X(I) = A(INDEX(I)) END DO
|10 100 1000 10000||23 52 61 67|
!DIR$ CACHE_BYPASS A DO I = 1, NGATH X(I) = A(INDEX(I)) END DO
|10 100 1000 10000||21 111 218 237|
DO I = 1, NGATH, 480 NI = MIN( NGATH-I+1, 480 ) CALL LGATH( A(0), INDEX(I), NI ) CALL LPUTV( X(I), 1, NI ) END DO 123 IF( LPUTP().NE.0 ) GOTO 123
|10 100 1000 10000||23 150 271 310|
CALL SHMEM_IXGET( X, A(0), INDEX, &NGATH, MYPE )
|10 100 1000 10000||24 134 202 273|
5.0 Compiler optimizationsIn order to evaluate the impact of different compiler options on single-processor performance, we conducted a study on the Livermore Fortran Kernels (LFK), a collection of 24 computational kernels with three sets of DO spans developed at the Lawrence Livermore National Laboratory. The LFK are a favorite test collection of compiler writers, and reasonable performance can usually be obtained by compiler options alone. We also experimented with hand optimization of selected kernels with modest success.
The LFK authors stipulate that statistics should be quoted from the summary table of 72 timings in the output file. The baseline performance of the LFK tests, using only the compiler options f90 -O2, was as follows:
Maximum Rate = 596.9183 Mega-Flops/Sec. Quartile Q3 = 157.0089 Mega-Flops/Sec. Average Rate = 143.0450 Mega-Flops/Sec. GEOMETRIC MEAN = 115.8271 Mega-Flops/Sec. Median Q2 = 103.9888 Mega-Flops/Sec. Harmonic Mean = 95.1324 Mega-Flops/Sec. Quartile Q1 = 84.7023 Mega-Flops/Sec. Minimum Rate = 19.5352 Mega-Flops/Sec.
We added compiler options one at a time in order to study the effects of each on performance. These experiments can be summarized by looking at the geometric means from each test:
|f90 -dp -O3||122|
|f90 -dp -O3,unroll2||147|
|f90 -dp -O3,unroll2,pipeline2||151|
|f90 -dp -O3,unroll2,pipeline2 -a pad||142|
|f90 -dp -O3,unroll2,pipeline2 -lmfastv||151|
|f90 -dp -O3,unroll2,pipeline2,split2 -lmfastv||147|
Although there was no measurable difference from adding the -lmfastv option, it is effective in some situations (), so we quote the full statistics for the case f90 -dp -O3,unroll2,pipeline2 -lmfastv:
Maximum Rate = 632.9465 Mega-Flops/Sec. Quartile Q3 = 214.1753 Mega-Flops/Sec. Average Rate = 189.2878 Mega-Flops/Sec. GEOMETRIC MEAN = 151.3626 Mega-Flops/Sec. Median Q2 = 147.5358 Mega-Flops/Sec. Harmonic Mean = 122.4048 Mega-Flops/Sec. Quartile Q1 = 97.0975 Mega-Flops/Sec. Minimum Rate = 31.2659 Mega-Flops/Sec.
Previous studies have indicated that loop splitting is best applied on a loop-by-loop basis, rather than globally through use of the compiler's -Osplit flag (). Also, the compiler is typically unable to recognize when a scientific library substitution can be made. In the remainder of this section, specific hand optimizations to individual kernels are described that illustrate further opportunities for performance improvements.
Kernel 3: library substitution
Kernel 3 consists of the following three lines of code:
Q= 0.000d0 DO 3 k= 1,n 3 Q= Q + Z(k) * X(k)
It can be replaced by a library call:
Q = SDOT( N, Z, 1, X, 1 )
The library routine is faster than the compiler-generated code when the vectors X and Z are in the cache because the library routine is more effective at hiding the latency of an Scache load. In the LFK test suite, the vectors do become cache resident because each test is executed several times. The statistics for kernel 3 for each do span length are as follows:
KERNEL FLOPS MICROSEC MFLOP/SEC SPAN WEIGHT CHECK-SUMS PRECIS ------ ----- -------- --------- ---- ------ ---------------------- ----- 3 1.598E+06 8.714E+03 183.439 27 1.00 1.0555606580531002E-01 16.90 3 2.141E+06 8.251E+03 259.508 101 2.00 3.9489293708756362E-01 16.90 3 1.802E+06 6.233E+03 289.073 1001 1.00 3.9140054768099826E+00 16.90
3 1.598E+06 7.190E+03 222.323 27 1.00 1.0555606580531003E-01 16.90 3 2.141E+06 4.547E+03 470.892 101 2.00 3.9489293708756368E-01 16.90 3 1.802E+06 2.142E+03 841.000 1001 1.00 3.9140054768099817E+00 16.65
Similarly, Kernel 24 can be replaced by a call to ISMIN with a small increase in performance.
Kernel 13: cache bypass
Kernel 13, a 2-D particle in cell computation, benefitted from the common block padding option, -a pad, but the performance of the full LFK benchmark set degraded slightly with this option. The kernel is
fw= 1.000d0 1013 DO 13 k= 1,n i1= P(1,k) j1= P(2,k) i1= 1 + MOD2N(i1,64) j1= 1 + MOD2N(j1,64) P(3,k)= P(3,k) + B(i1,j1) P(4,k)= P(4,k) + C(i1,j1) P(1,k)= P(1,k) + P(3,k) P(2,k)= P(2,k) + P(4,k) i2= P(1,k) j2= P(2,k) i2= MOD2N(i2,64) j2= MOD2N(j2,64) P(1,k)= P(1,k) + Y(i2+32) P(2,k)= P(2,k) + Z(j2+32) i2= i2 + E(i2+32) j2= j2 + F(j2+32) H(i2,j2)= H(i2,j2) + fw 13 CONTINUE
A contributing factor to the poor performance of this kernel is the non-unit stride pattern of access to the arrays B and C. The compiler always generates cacheable loads because it can not tell what is in cache and what is not, but we can request non-cached access by preceding the above DO loop with the directive
!dir$ cache_bypass b, c
The baseline performance of kernel 13 was
KERNEL FLOPS MICROSEC MFLOP/SEC SPAN WEIGHT CHECK-SUMS PRECIS ------ ----- -------- --------- ---- ------ ---------------------- ----- 13 1.389E+06 3.815E+04 36.407 8 1.00 8.7535991110238361E+09 16.90 13 1.837E+06 4.805E+04 38.230 32 2.00 3.0307123384693829E+10 16.90 13 1.613E+06 4.347E+04 37.102 64 1.00 4.5614594609595291E+10 16.90
13 1.389E+06 4.971E+04 27.940 8 1.00 8.7535991110238361E+09 16.90 13 1.837E+06 3.972E+04 46.247 32 2.00 3.0307123384693829E+10 16.90 13 1.613E+06 3.392E+04 47.541 64 1.00 4.5614594609595291E+10 16.90
The degradation for the smallest DO span is expected because the non-cached access requested by the cache_bypass directive will force cache-resident elements of B and C to be loaded from memory instead.
Kernel 14: loop combining
Kernel 14 is a 1-D particle in cell code consisting of three loops from 1 to n. The first loop has 7 streams and accesses two additional arrays (EX and DEX) in a non-sequential order. The second loop also has 7 streams, while the third uses only two, with some non-unit stride stores to a third array. In general, one wants to limit the number of streams to 6 or fewer. In this case, however, there was a lot of reuse within a loop iteration, so stripmining the outer loops and combining the three loops into two was the most beneficial.
The original code looked like this:
1014 DO 141 k= 1,n VX(k)= 0.0d0 XX(k)= 0.0d0 IX(k)= INT( GRD(k)) XI(k)= REAL( IX(k)) EX1(k)= EX ( IX(k)) DEX1(k)= DEX ( IX(k)) 141 CONTINUE c DO 142 k= 1,n VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k) XX(k)= XX(k) + VX(k) + FLX IR(k)= XX(k) RX(k)= XX(k) - IR(k) IR(k)= MOD2N( IR(k),2048) + 1 XX(k)= RX(k) + IR(k) 142 CONTINUE c DO 14 k= 1,n RH(IR(k) )= RH(IR(k) ) + fw - RX(k) RH(IR(k)+1)= RH(IR(k)+1) + RX(k) 14 CONTINUE
The optimized code is as follows:
do kk = 1, n, 128 kn = min(n-kk+1,128) DO 141 k = kk, kk+kn-1 VX(k)= 0.0d0 XX(k)= 0.0d0 IX(k)= INT( GRD(k)) XI(k)= REAL( IX(k)) EX1(k)= EX ( IX(k)) DEX1(k)= DEX ( IX(k)) VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k) XX(k)= XX(k) + VX(k) + FLX 141 CONTINUE c DO 142 k = kk, kk+kn-1 IR(k)= XX(k) RX(k)= XX(k) - IR(k) IR(k)= MOD2N( IR(k),2048) + 1 XX(k)= RX(k) + IR(k) RH(IR(k) )= RH(IR(k) ) + fw - RX(k) RH(IR(k)+1)= RH(IR(k)+1) + RX(k) 142 CONTINUE c end do
The baseline performance was
KERNEL FLOPS MICROSEC MFLOP/SEC SPAN WEIGHT CHECK-SUMS PRECIS ------ ----- -------- --------- ---- ------ ---------------------- ----- 14 1.901E+06 3.049E+04 62.339 27 1.00 1.9943880114661271E+06 16.90 14 2.222E+06 3.297E+04 67.387 101 2.00 2.3107401197908435E+07 16.90 14 2.202E+06 7.043E+04 31.266 1001 1.00 2.1783317062516003E+09 16.65
Improvements with the optimized code ranged from 20-70%:
14 1.901E+06 1.813E+04 104.834 27 1.00 1.9943880114661271E+06 16.90 14 2.222E+06 2.062E+04 107.755 101 2.00 2.3107401197908435E+07 16.90 14 2.202E+06 5.610E+04 39.254 1001 1.00 2.1783317062516003E+09 16.65
Although dramatic in a few cases, the overall effect of the hand optimizations was only about a 5% improvement in the average or mean performance rates:
Maximum Rate = 831.2763 Mega-Flops/Sec. Quartile Q3 = 215.1152 Mega-Flops/Sec. Average Rate = 198.2022 Mega-Flops/Sec. GEOMETRIC MEAN = 157.3169 Mega-Flops/Sec. Median Q2 = 152.3032 Mega-Flops/Sec. Harmonic Mean = 128.0855 Mega-Flops/Sec. Quartile Q1 = 100.2081 Mega-Flops/Sec. Minimum Rate = 29.4546 Mega-Flops/Sec.
6.0 Multiple processor computational performanceThe LINPACK benchmark for scalable parallel systems measures the greatest sustainable performance for the solution of a dense system of linear equations (). The problem size and implementation are left to the implementor, except that the algorithm used must be Gaussian elimination with partial pivoting. The popularity of the benchmark stems from its use in ranking computers for the biannual list of the "Top 500" supercomputers (). Factoring a matrix (the main part of the LINPACK benchmark) is a particularly cache-friendly operation because it is dominated by matrix-matrix multiplications. Although performance comparable to that of the LINPACK benchmark is seldom replicated in real applications, it is a good test of new systems and a first validation of a scalable design. The CRAY T3E-1200 exhibits near linear scalability on this benchmark, as can be seen from Table 15 (*).
|NPES||Rmax [Gflop/s]||Nmax||N 1/2||Rpeak [Gflop/s]||% peak|
(*) LINPACK benchmark results are from the LINPACK benchmark report, and were not run at NESC.
The NAS parallel benchmarks (NPB) are a collection of eight benchmarks developed as part of the Numerical Aerodynamic Simulation (NAS) program at the NASA Ames Research Center to measure and compare the performance of highly parallel computers. These benchmarks, which are derived from computational fluid dynamic codes, have become a standard measure of supercomputer performance. Five of the benchmarks (EP, CG, MG, FT, and IS) represent computational kernels, while the other three (LU, SP, and BT) represent simplified applications. Three problem classes of increasing size are specified for each benchmark: class A, class B, and class C. Although source code implementations of the benchmarks in MPI now exist, the original specification, now called NPB 1.0, was a "pencil and paper" benchmark, in which the actual implementation was left to the computer vendor.
NPB 1.0 results are the most optimized versions, and it is those results (courtesy of the Cray Research benchmarking group) that are presented in Table 16, Table 17, and Table 18. The benchmarks vary widely in their communication requirements and patterns. EP has almost no communication, MG uses ghost-cell update communication and FT and BT use data set transpose algorithms. The drop-off in FT performance occurs when we run out of "planes" to do in parallel and have to do domain decomposition within a plane, which adds another communication step.
|Rate per PE [Mflop/s]:|
|Rate per PE [Mflop/s]:|
|Rate per PE [Mflop/s]:|
7.0 I/O performanceThe subject of distributing the I/O of a parallel program is only discussed in general terms in this paper (see , , and  for more details). On a CRAY T3E system, the processors have a common file space, and the programmer must decide whether one or multiple processors will access a file. If only one processor handles I/O, then all other processors must communicate with that one in order to access or store data on disk. If multiple processors handle the I/O in the program, then either each processor must access its own file or the Cray Global I/O layer must be used to safely buffer data and coordinate the accesses of multiple processes to the same file. On some file systems, the files may be striped across different partitions on the disk in order to increase the rate of access observed by any one processor.
Each Fortran OPEN statement issued by a processor has its own file descriptor and counts against the per job limit of open files (currently 256 at NESC), even if it is accessing the same file as another processor. It is easy to exceed this limit when using the Cray global I/O layer if every processor opens several files. That is why designating one processor as the I/O processor, and using the fast inter-processor connection network to move data as needed, is often seen as the best alternative.
In order to quantify the I/O performance of the CRAY T3E, we will look at two metrics: the time for a global file OPEN, as would be necessary when doing global I/O to a particular file, and the maximum read and write bandwidth to disk from a single processor. The latter of these is somewhat specific to the current NESC configuration, but it is not atypical for a T3E system.
The command "df -p" shows the partitions available on the currently mounted filesystems. A filesystem with more than one partition may benefit from user-level striping. On the NESC CRAY T3E system at the time of these tests, the following configuration was in place for the /tmp disk:
/tmp (/dev/dsk/tmp ): 3923683 sectors 505891 I-nodes total: 6250000 sectors 524288 I-nodes Big file threshold: 32768 bytes Big file allocation minimum: 24 blocks Allocation Strategy: round robin files round robin all user data Primary partitions allocation unit: 16K byte blocks part start total free (%) frags (%) device ---- -------- -------- ----------------- ---------------- -------- 0 0 6250000 4230096 ( 67.7%) 24 ( 0.002%) tmp1 1 6250000 6250000 3872580 ( 62.0%) 35 ( 0.004%) tmp3 2 12500000 6250000 4062884 ( 65.0%) 18 ( 0.002%) tmp2 3 18750000 6250000 3529172 ( 56.5%) 55 ( 0.006%) tmp4
Our test programs are taken from the NERSC I/O guide (), modified to write or read an unformatted 100 MB file. As documented in the I/O guide, we use the assign command with options to specify the size of the file in 4KB blocks (-n 25600), the chunksize, a blocking parameter (-q 128), and the global I/O layer (-F bufa). Results for different numbers of partitions (specified with the -p command) are shown in Table 19.
|assign command||Write [MB/s]||Read [MB/s]|
|No assign (1 partition)||32.12||44.50|
|assign -n 25600 -q 128 -F bufa:128:2 u:10||40.52||57.79|
|assign -n 25600 -q 128 -p 0-1 -F bufa:128:4 u:10||77.03||111.37|
|assign -n 25600 -q 128 -p 0-2 -F bufa:128:6 u:10||104.82||162.40|
|assign -n 25600 -q 128 -p 0-3 -F bufa:128:8 u:10||106.99||219.53|
Table 20 shows the best observed file open times for the 100 MB file in the previous exercise assuming that every processor opens the file before the file is created. The time to execute the OPEN statement varied widely from run to run and in some cases was an order of magnitude larger than these times. These data support the NERSC I/O guide's statement that "it could take a significant amount of time to open a large number of large files" (). Indeed, the assignment that optimizes the file transfer rate may cause the file open time to be larger than the time to read or write the file itself.
|NPES||1 partition [sec]||4 partitions [sec]|
AcknowledgementsI am grateful to Jeff Brooks of SGI/Cray Research for some helpful discussions and for providing the three tables of NAS Parallel Benchmark results in Section 6.0.
- Ed Anderson, Jeff Brooks, Charles Grassl, and Steve Scott, Performance [Analysis] of the CRAY T3E Multiprocessor , Proceedings of SC97, http://www.supercomp.org/sc97/proceedings/TECH/ANDERSON/INDEX.HTM , November 1997.
- Ed Anderson, Jeff Brooks, and Tom Hewitt, The Benchmarker's Guide to Single-processor Optimization for CRAY T3E Systems , http://www.sgi.com/t3e/images/benchmark.pdf , June 1997.
- Edward Anderson and Mark Fahey, Performance Improvements to LAPACK for the Cray Scientific Library , LAPACK Working Note 126, http://www.netlib.org/lapack/lawns/lawn126.ps , April 1997.
- Michael Berry and Roger Hockney, Public International Benchmarks for Parallel Computers , PARKBENCH Committee Report 1, http://www.netlib.org/parkbench/ , February 1994.
- Cray Research, CRAY T3E Fortran Optimization Guide , SG-2518 3.0, http://www.cray.com/products/software/publications/ , 1997.
- Cray Research, Application Programmer's I/O Guide , 007-3695-005, http://www.cray.com/products/software/publications/ , 1997.
- Jack J. Dongarra, Performance of Various Computers Using Standard Linear Equations Software , University of Tennessee, report number CS-89-85, http://www.netlib.org/benchmark/performance.ps , April 1999.
- Jack J. Dongarra, Hans W. Meuer, and Erich Strohmaier, Top500 Supercomputer Sites , University of Tennessee, report number CS-99-425, http://www.netlib.org/benchmark/top500.html , June 1999.
- Richard A. Gerber, I/O on the NERSC Cray T3E , National Energy Research Scientific Computing Center, http://hpcf.nersc.gov/training/tutorials/T3E/IO/ , July 1998.
- John D. McCalpin, Sustainable Memory Bandwidth in Current High Performance Computers , http://www.cs.virginia.edu/stream/ , October 1995.