Most sources for CRAY T3E results give performance measurements for the CRAY T3E (300 MHz) or CRAY T3E-900 (450 MHz) models ([1],[2]). This paper presents results for the CRAY T3E-1200E (600 MHz), which differs from the earlier models in the following respects:
The CRAY T3E is a distributed memory multiprocessor with a globally addressable address space. The liquid-cooled model is scalable to eight cabinets, each housing one clock module and up to 34 processor modules. Each processor module contains eight nodes or processing elements (PEs), giving a maximum configuration of 272 PEs/cabinet. The PEs are site-configurable into three groups: application (APP) PEs, used to run parallel user jobs, command (CMD) PEs, used to run single-processor user jobs and to process interactive commands, and operating system (OS) PEs, used for system tasks. There must be at least two CMD and two OS PEs in a liquid-cooled system. For a fully-populated cabinet of 272 PEs, the typical configuration is 256 APP PEs and a total of 16 CMD and OS PEs. This ratio of application to support PEs scales to 2048 application processors.
Each node of a CRAY T3E system consists of an Alpha 21164 microprocessor, a system control chip, 64 MB - 2048 MB (*) of local memory, and a network router. The DEC Alpha 21164 is a superscalar microprocessor with 4-way instruction issue, an 8 KB primary data cache, an 8 KB instruction cache, and a 96 KB secondary cache for both data and instructions. The system logic operates at 75 MHz, and the microprocessor clock speed must be a multiple of this rate. Included on the control chip are a set of six hardware stream buffers, which provide prefetching and buffering for loads of consecutive cache lines from local memory, and an external register set (the "E-registers"), which are used for memory transfers. Although the main purpose of the E-registers is to handle inter-processor communication, they can also be used for non-cached access to local memory. A cache backmap maintains coherence of the local memory, caches, streams, and E-registers by flushing cached lines when a non-cached memory request is initiated.
(*) Most CRAY T3E systems have 128, 256, or 512 MB of memory per node.
There is one network router for each processor on a CRAY T3E system, in contrast to the CRAY T3D system architecture, in which two processors shared a network connection. As in the CRAY T3D, network routers are connected in a 3-D torus topology. Each network router has 16 uni-directional channels, two to connect to and from the support circuitry of its associated PE, two to connect to and from the I/O controller on the processor module, and 12 to connect to and from other nodes in the +/- x, +/- y, and +/- z directions of the 3-D torus. Torus links have a theoretical peak bandwidth of 650 MB/s in each direction.
The test platform for the benchmarks in this paper was the 72-processor CRAY T3E-1200E system at the National Environmental Supercomputing Center, operated by Lockheed Martin for the U. S. Environmental Protection Agency. This system was configured with 64 application PEs, 4 command PEs, and 4 OS PEs, each with 256 MB of memory. The operating system level was Unicos/mk 2.0.4.60 and the programming environment release level was 3.2.
The greatest measured bandwidths are obtained by assembly-coded kernels or library routines. Simple load/store tests are best written in assembly language because high-level languages can not easily express one-way data transfers except in the context of some other work. Table 1 shows the best measured bandwidth rates for cacheable loads and stores. The effect of the on-chip caches was minimized for these tests by loading or storing unit-stride vectors of millions of words until an asymptotic rate was reached. Cacheable store rates appear slow because the write-allocate secondary cache requires that stored data first be loaded into the cache, then updated and marked as dirty until flushed to memory at some later time, so the reported rate represents only half of the memory traffic. The load/store operation assumes different addresses for the load and store and is equivalent to a vector copy. Experiments with the relative memory addresses of the streams showed that separating the arrays by a multiple of 8192 words plus 64 or 128 words for avoiding cache conflicts gave the best performance.
| Operation | 1 stream | 2 streams | 3 streams | 4 streams |
| cacheable load | 453 | 815 | 873 | 865 |
| cacheable store | 302 | 356 | 332 | 332 |
| cacheable load + store | - | 515 | - | 513 |
Table 2 shows the best measured bandwidth rates for uncached loads and stores (typically called GETs and PUTs). These rates were measured using the one-sided E-register GET and PUT routines from benchlib ([2]) and the SHMEM library routines SHMEM_GET and SHMEM_PUT. In this context, "one-sided" means a transfer from local memory into the E-registers, or from the E-registers to local memory. A one-sided GET is a meaningless operation, but a one-sided PUT could have application in initializing a data area to a constant value. Because the data never makes it to the Alpha microprocessor during an E-register operation, the uncached load and store rates are useful only for modeling memory-to-memory transfers such as a vector copy or a matrix transpose.
| Operation | Rate [MB/s] |
| uncached GET | 598 |
| uncached PUT | 532 |
| uncached GET+PUT via benchlib | 582 |
| uncached GET+PUT via SHMEM_GET | 668 |
| uncached GET+PUT via SHMEM_PUT | 609 |
A somewhat more standard measure of "sustainable" local memory bandwidth is the STREAM benchmark ([10]). The single-processor STREAM benchmark program evaluates the performance of the local memory system on the following unit-stride operations:
| Name | Operation | Bytes transferred (at 8 bytes/word) |
| Copy | c(1:n) = a(1:n) | 8*(16*n) |
| Scale | b(1:n) = s*c(1:n) | 8*(16*n) |
| Add | c(1:n) = a(1:n) + b(1:n) | 8*(24*n) |
| Triad | a(1:n) = b(1:n) + s*c(1:n) | 8*(24*n) |
The value of n is chosen large enough to exceed the size of any local caches. Optimization by use of assembly language kernels or library routines is not allowed. When STREAM benchmark results are given for parallel systems, they include the time to perform the local memory operations and the time to synchronize all the participating processors or processes.
Table 3 shows the best measured STREAM benchmark rates for each operation. The size n was 2000000 elements for all tests, but different values of the "offset" parameter, which helps to control the relative memory addresses, were optimal for the different cases.
| Function | Offset | Compiler options | Rate [MB/s] |
| Copy | 7040 | f90 -O3,unroll2,pipeline2 -dp | 520 |
| Scale | 7104 | f90 -O3,unroll2,pipeline2 -dp | 517 |
| Add | 7040 | f90 -O3,unroll2,pipeline2 -dp -a pad | 611 |
| Triad | 10000 | f90 -O3,unroll2,pipeline2 -dp | 622 |
Latency and bandwidth data extracted from the COMMS1 benchmark results are shown in Table 4. For the MPI tests, setting the environment variable MPI_BUFFER_MAX to 2048, thereby limiting buffering to messages less than 2048 bytes, improved the performance by 75% for the larger transfers. For the PVM tests, setting the environment variable PVM_DATA_MAX to 32000 (as suggested in [5]) improved the performance for messages up to that length, but did not affect the asymptotic rate.
| Library | time for zero-length message [microseconds] |
bandwidth [Mbyte/s] |
| MPI | 14 | 315 |
| PVM | 11 | 154 |
| SHMEM | 2 | 418 |
The SYNCH1 benchmark measures the overhead for global synchronization by measuring the rate at which a barrier statement can be executed as a function of the number of processes (nodes) taking part in the global barrier synchronization. The SYNCH1 benchmark repeats a sequence of 10 BARRIER statements 1000 times.
| NPES | MPI barrier [usec] |
PVM barrier [usec] |
SHMEM barrier [usec] |
| 2 | 4.8 | 3.2 | 1.07 |
| 4 | 9.3 | 6.1 | 1.31 |
| 8 | 14. | 9.0 | 1.55 |
| 16 | 18. | 11.9 | 1.54 |
| 32 | 23. | 14.6 | 1.78 |
| 64 | 27. | 17.5 | 2.02 |
The libsci BLAS can attain near peak performance when operating on data in the cache. To illustrate, we performed some basic vector operations (1-norm, 2-norm, vector sum, and dot product) using 64-bit Level 1 BLAS with vectors sized to fit in the cache, and repeated the tests many times to get an in-cache performance rate. The results are shown in Table 6. The performance of SNRM2 is the same as SASUM despite doing twice as much work because a 2-pass algorithm is used in SNRM2 to avoid underflow or overflow during the sum of squares ([3]).
| Subroutine | Operation | Data type | In-cache rate [Mflop/s] |
| SASUM | a <- || x ||_1 | real | 365 |
| SNRM2 | a <- || x ||_2 | real | 361 |
| SAXPY | y <- a*x + y | real | 523 |
| CAXPY | y <- a*x + y | complex | 943 |
| SDOT | a <- x' * y | real | 979 |
| CDOTU | a <- x' * y | complex | 1175 |
Compared to other distributed memory systems, the CRAY T3E has a relatively small cache on each node. To highlight the advantages of a larger cache, some vendors will only show the performance of library routines from the cache. However, when data is not already cached, the asymptotic performance rate is a better predictor of the performance that will actually be achieved. Table 7 shows asymptotic performance rates for representative libsci BLAS operations on 64-bit vectors and matrices.
| Subroutine | Problem size | Rate [Mflop/s] |
| SNRM2 | 30000000 | 102 |
| SAXPY | 15000000 | 61 |
| CAXPY | 7500000 | 122 |
| SDOT | 15000000 | 102 |
| CDOTU | 7500000 | 164 |
| SGER | 2000 | 81 |
| SGEMV | 2000 | 156 |
| STRSV | 2000 | 85 |
| SGEMM | 2000 | 722 |
| STRSM | 2000 | 524 |
Besides the Basic Linear Algebra Subprograms, the Cray Scientific Library also contains software from LAPACK (a Fortran library of linear algebra software), special solvers such as first- and second-order linear system solvers, and FFTs. A sample of the performance from some of the block algorithms in LAPACK, taken directly from the LAPACK timing program, is shown in Table 8. Although signficantly below the matrix multiply rate and probably optimizable, these results may be more indicative of the performance of Level 3 BLAS in the context of a user program. Table 9 shows the performance of the FFT library routine CCFFT, including both an in-cache performance rate and a rate with all the vectors starting from memory. The megaflop rate is based on an operation count of 5n log(n) for each complex-to-complex FFT. Parallel versions of some of this software are also available through the ScaLAPACK library and several well-optimized 2-D and 3-D FFTs.
| Subroutine | 100 | 200 | 300 | 400 | 500 | SGEMM | 551 | 632 | 638 | 656 | 679 | SGETRF | 256 | 355 | 375 | 422 | 429 | SPOTRF(U) | 188 | 301 | 365 | 412 | 444 | SPOTRF(L) | 187 | 292 | 299 | 349 | 345 | SGEQRF | 185 | 290 | 297 | 348 | 351 | SGEHRD | 126 | 154 | 184 | 208 | 225 | SSYTRD(U) | 159 | 130 | 149 | 163 | 178 | SSYTRD(L) | 137 | 126 | 144 | 165 | 172 | SGEBRD | 98 | 104 | 122 | 137 | 147 |
| CGEMM | 707 | 744 | 755 | 754 | 773 | CGETRF | 302 | 414 | 491 | 544 | 581 | CPOTRF(U) | 215 | 289 | 324 | 342 | 353 | CPOTRF(L) | 221 | 293 | 314 | 313 | 308 | CGEQRF | 259 | 257 | 273 | 282 | 293 | CGEHRD | 186 | 209 | 227 | 234 | 238 | CHETRD(U) | 185 | 184 | 215 | 238 | 253 | CHETRD(L) | 183 | 186 | 219 | 242 | 257 | CGEBRD | 129 | 148 | 172 | 186 | 194 | Table 8: LAPACK routine performance for different values of N [Mflop/s]
| N | In-cache [Mflop/s] |
From memory [Mflop/s] |
| 64 | 474 | 87 |
| 128 | 486 | 117 |
| 256 | 569 | 182 |
| 512 | 533 | 195 |
| 1024 | 535 | 220 |
| 2048 | 170 | 164 |
| 4096 | 82 | 76 |
| 8192 | 27 | 27 |
| 16384 | 25 | 25 |
| 32768 | 23 | 23 |
A number of common memory-to-memory transfers can be performed more efficiently by using the E-registers to bypass the cache. Examples include block copies, matrix transposes, data initialization, and gather/scatter operations ([2]). Support for this type of data movement is provided through the SHMEM and benchlib libraries or use of the CACHE_BYPASS compiler directive. The next several tables summarize their various implementations and compare their performance to that of regular Fortran using the cache. Some of the operations do not have a straightforward implementation in every method.
| Method | Source code | Rate [Mb/s] |
| Fortran |
DO I = 1, N
Y(I) = X(I)
END DO
|
508 |
| CACHE_BYPASS |
!DIR$ CACHE_BYPASS X, Y
DO I = 1, N
Y(I) = X(I)
END DO
|
582 |
| Benchlib |
DO I = 1, N, 480
NI = MIN( 480, N-I+1 )
CALL LGETV( X(I), 1, NI )
CALL LPUTV( Y(I), 1, NI )
END DO
|
582 |
| SHMEM |
CALL SHMEM_GET( Y, X, N, MYPE )
|
679 |
| Method | Source code | Rate [Mb/s] |
| Fortran |
DO J = 1, N
DO I = 1, N
B(I,J) = A(J,I)
END DO
END DO
|
106 |
| CACHE_BYPASS |
DO J = 1, N
!DIR$ CACHE_BYPASS A, B
DO I = 1, N
B(I,J) = A(J,I)
END DO
END DO
|
442 |
| Benchlib |
DO J = 1, N, 480
NJ = MIN( N-J+1, 480 )
DO I = 1, N
CALL LGETV( A(I,J), LDA, NJ )
CALL LPUTV( B(J,I), 1, NJ )
END DO
END DO
123 IF( LPUTP().NE.0 ) GOTO 123
|
547 |
| SHMEM |
DO J = 1, N
CALL SHMEM_IGET( B(1,J), A(J,1),
& 1, LDA, N, MYPE )
END DO
|
421 |
| Method | Source code | Rate [Mb/s] |
| Fortran |
DO J = 1, N-1
DO I = J+1, N
TMP = A(I,J)
A(I,J) = A(J,I)
A(J,I) = TMP
END DO
END DO
|
151 |
| CACHE_BYPASS |
DO J = 1, N-1
!DIR$ CACHE_BYPASS A
DO I = J+1, N
TMP = A(I,J)
A(I,J) = A(J,I)
A(J,I) = TMP
END DO
END DO
|
167 |
| Benchlib |
DO J = 1, N-1
DO I = 1, N-J, 240
NI = MIN( (N-J)-I+1, 240 )
CALL LGETVO( A(J+I,J), 1, NI, 0 )
CALL LGETVO( A(J,J+I), LDA, NI, 240 )
CALL LPUTVO( A(J,J+I), LDA, NI, 0 )
CALL LPUTVO( A(J+I,J), 1, NI, 240 )
END DO
END DO
123 IF( LPUTP().NE.0 ) GOTO 123
|
465 |
| SHMEM |
DO J = 1, N-1
CALL SHMEM_IGET( B, A(J,J+1),
& 1, LDA, N-J, MYPE )
CALL SHMEM_IGET( A(J,J+1), A(J+1,J),
& LDA, 1, N-J, MYPE )
CALL SHMEM_PUT( A(J+1,J), B, N-J, MYPE )
END DO
|
321 |
| Method | Source code | Rate [Mb/s] |
| Fortran |
DO I = 1, N
X(I) = SUM
END DO
|
332 |
| CACHE_BYPASS |
!DIR$ CACHE_BYPASS X
DO I = 1, N
X(I) = SUM
END DO
|
331 |
| Benchlib |
CALL LSETV( X, 1, N, SUM )
123 IF( LPUTP().NE.0 ) GO TO 123
|
530 |
| Method | Source code | NGATH | Rate [Mb/s] |
| Fortran |
DO I = 1, NGATH
X(I) = A(INDEX(I))
END DO
|
10 100 1000 10000 |
23 52 61 67 |
| CACHE_BYPASS |
!DIR$ CACHE_BYPASS A
DO I = 1, NGATH
X(I) = A(INDEX(I))
END DO
|
10 100 1000 10000 |
21 111 218 237 |
| Benchlib |
DO I = 1, NGATH, 480
NI = MIN( NGATH-I+1, 480 )
CALL LGATH( A(0), INDEX(I), NI )
CALL LPUTV( X(I), 1, NI )
END DO
123 IF( LPUTP().NE.0 ) GOTO 123
|
10 100 1000 10000 |
23 150 271 310 |
| SHMEM |
CALL SHMEM_IXGET( X, A(0), INDEX,
& NGATH, MYPE )
|
10 100 1000 10000 |
24 134 202 273 |
The LFK authors stipulate that statistics should be quoted from the summary table of 72 timings in the output file. The baseline performance of the LFK tests, using only the compiler options f90 -O2, was as follows:
Maximum Rate = 596.9183 Mega-Flops/Sec.
Quartile Q3 = 157.0089 Mega-Flops/Sec.
Average Rate = 143.0450 Mega-Flops/Sec.
GEOMETRIC MEAN = 115.8271 Mega-Flops/Sec.
Median Q2 = 103.9888 Mega-Flops/Sec.
Harmonic Mean = 95.1324 Mega-Flops/Sec.
Quartile Q1 = 84.7023 Mega-Flops/Sec.
Minimum Rate = 19.5352 Mega-Flops/Sec.
We added compiler options one at a time in order to study the effects of each on performance. These experiments can be summarized by looking at the geometric means from each test:
| Options | Rate [Mflop/s] |
| f90 -dp -O3 | 122 |
| f90 -dp -O3,unroll2 | 147 |
| f90 -dp -O3,unroll2,pipeline2 | 151 |
| f90 -dp -O3,unroll2,pipeline2 -a pad | 142 |
| f90 -dp -O3,unroll2,pipeline2 -lmfastv | 151 |
| f90 -dp -O3,unroll2,pipeline2,split2 -lmfastv | 147 |
Although there was no measurable difference from adding the -lmfastv option, it is effective in some situations ([2]), so we quote the full statistics for the case f90 -dp -O3,unroll2,pipeline2 -lmfastv:
Maximum Rate = 632.9465 Mega-Flops/Sec.
Quartile Q3 = 214.1753 Mega-Flops/Sec.
Average Rate = 189.2878 Mega-Flops/Sec.
GEOMETRIC MEAN = 151.3626 Mega-Flops/Sec.
Median Q2 = 147.5358 Mega-Flops/Sec.
Harmonic Mean = 122.4048 Mega-Flops/Sec.
Quartile Q1 = 97.0975 Mega-Flops/Sec.
Minimum Rate = 31.2659 Mega-Flops/Sec.
Previous studies have indicated that loop splitting is best applied on a loop-by-loop basis, rather than globally through use of the compiler's -Osplit flag ([2]). Also, the compiler is typically unable to recognize when a scientific library substitution can be made. In the remainder of this section, specific hand optimizations to individual kernels are described that illustrate further opportunities for performance improvements.
Kernel 3: library substitution
Kernel 3 consists of the following three lines of code:
Q= 0.000d0
DO 3 k= 1,n
3 Q= Q + Z(k) * X(k)
It can be replaced by a library call:
Q = SDOT( N, Z, 1, X, 1 )
The library routine is faster than the compiler-generated code when the vectors X and Z are in the cache because the library routine is more effective at hiding the latency of an Scache load. In the LFK test suite, the vectors do become cache resident because each test is executed several times. The statistics for kernel 3 for each do span length are as follows:
KERNEL FLOPS MICROSEC MFLOP/SEC SPAN WEIGHT CHECK-SUMS PRECIS ------ ----- -------- --------- ---- ------ ---------------------- ----- 3 1.598E+06 8.714E+03 183.439 27 1.00 1.0555606580531002E-01 16.90 3 2.141E+06 8.251E+03 259.508 101 2.00 3.9489293708756362E-01 16.90 3 1.802E+06 6.233E+03 289.073 1001 1.00 3.9140054768099826E+00 16.90
After optimization:
3 1.598E+06 7.190E+03 222.323 27 1.00 1.0555606580531003E-01 16.90 3 2.141E+06 4.547E+03 470.892 101 2.00 3.9489293708756368E-01 16.90 3 1.802E+06 2.142E+03 841.000 1001 1.00 3.9140054768099817E+00 16.65
Similarly, Kernel 24 can be replaced by a call to ISMIN with a small increase in performance.
Kernel 13: cache bypass
Kernel 13, a 2-D particle in cell computation, benefitted from the common block padding option, -a pad, but the performance of the full LFK benchmark set degraded slightly with this option. The kernel is
fw= 1.000d0
1013 DO 13 k= 1,n
i1= P(1,k)
j1= P(2,k)
i1= 1 + MOD2N(i1,64)
j1= 1 + MOD2N(j1,64)
P(3,k)= P(3,k) + B(i1,j1)
P(4,k)= P(4,k) + C(i1,j1)
P(1,k)= P(1,k) + P(3,k)
P(2,k)= P(2,k) + P(4,k)
i2= P(1,k)
j2= P(2,k)
i2= MOD2N(i2,64)
j2= MOD2N(j2,64)
P(1,k)= P(1,k) + Y(i2+32)
P(2,k)= P(2,k) + Z(j2+32)
i2= i2 + E(i2+32)
j2= j2 + F(j2+32)
H(i2,j2)= H(i2,j2) + fw
13 CONTINUE
A contributing factor to the poor performance of this kernel is the non-unit stride pattern of access to the arrays B and C. The compiler always generates cacheable loads because it can not tell what is in cache and what is not, but we can request non-cached access by preceding the above DO loop with the directive
!dir$ cache_bypass b, c
The baseline performance of kernel 13 was
KERNEL FLOPS MICROSEC MFLOP/SEC SPAN WEIGHT CHECK-SUMS PRECIS ------ ----- -------- --------- ---- ------ ---------------------- ----- 13 1.389E+06 3.815E+04 36.407 8 1.00 8.7535991110238361E+09 16.90 13 1.837E+06 4.805E+04 38.230 32 2.00 3.0307123384693829E+10 16.90 13 1.613E+06 4.347E+04 37.102 64 1.00 4.5614594609595291E+10 16.90
With cache_bypass:
13 1.389E+06 4.971E+04 27.940 8 1.00 8.7535991110238361E+09 16.90 13 1.837E+06 3.972E+04 46.247 32 2.00 3.0307123384693829E+10 16.90 13 1.613E+06 3.392E+04 47.541 64 1.00 4.5614594609595291E+10 16.90
The degradation for the smallest DO span is expected because the non-cached access requested by the cache_bypass directive will force cache-resident elements of B and C to be loaded from memory instead.
Kernel 14: loop combining
Kernel 14 is a 1-D particle in cell code consisting of three loops from 1 to n. The first loop has 7 streams and accesses two additional arrays (EX and DEX) in a non-sequential order. The second loop also has 7 streams, while the third uses only two, with some non-unit stride stores to a third array. In general, one wants to limit the number of streams to 6 or fewer. In this case, however, there was a lot of reuse within a loop iteration, so stripmining the outer loops and combining the three loops into two was the most beneficial.
The original code looked like this:
1014 DO 141 k= 1,n
VX(k)= 0.0d0
XX(k)= 0.0d0
IX(k)= INT( GRD(k))
XI(k)= REAL( IX(k))
EX1(k)= EX ( IX(k))
DEX1(k)= DEX ( IX(k))
141 CONTINUE
c
DO 142 k= 1,n
VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k)
XX(k)= XX(k) + VX(k) + FLX
IR(k)= XX(k)
RX(k)= XX(k) - IR(k)
IR(k)= MOD2N( IR(k),2048) + 1
XX(k)= RX(k) + IR(k)
142 CONTINUE
c
DO 14 k= 1,n
RH(IR(k) )= RH(IR(k) ) + fw - RX(k)
RH(IR(k)+1)= RH(IR(k)+1) + RX(k)
14 CONTINUE
The optimized code is as follows:
do kk = 1, n, 128
kn = min(n-kk+1,128)
DO 141 k = kk, kk+kn-1
VX(k)= 0.0d0
XX(k)= 0.0d0
IX(k)= INT( GRD(k))
XI(k)= REAL( IX(k))
EX1(k)= EX ( IX(k))
DEX1(k)= DEX ( IX(k))
VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k)
XX(k)= XX(k) + VX(k) + FLX
141 CONTINUE
c
DO 142 k = kk, kk+kn-1
IR(k)= XX(k)
RX(k)= XX(k) - IR(k)
IR(k)= MOD2N( IR(k),2048) + 1
XX(k)= RX(k) + IR(k)
RH(IR(k) )= RH(IR(k) ) + fw - RX(k)
RH(IR(k)+1)= RH(IR(k)+1) + RX(k)
142 CONTINUE
c
end do
The baseline performance was
KERNEL FLOPS MICROSEC MFLOP/SEC SPAN WEIGHT CHECK-SUMS PRECIS ------ ----- -------- --------- ---- ------ ---------------------- ----- 14 1.901E+06 3.049E+04 62.339 27 1.00 1.9943880114661271E+06 16.90 14 2.222E+06 3.297E+04 67.387 101 2.00 2.3107401197908435E+07 16.90 14 2.202E+06 7.043E+04 31.266 1001 1.00 2.1783317062516003E+09 16.65
Improvements with the optimized code ranged from 20-70%:
14 1.901E+06 1.813E+04 104.834 27 1.00 1.9943880114661271E+06 16.90 14 2.222E+06 2.062E+04 107.755 101 2.00 2.3107401197908435E+07 16.90 14 2.202E+06 5.610E+04 39.254 1001 1.00 2.1783317062516003E+09 16.65
Although dramatic in a few cases, the overall effect of the hand optimizations was only about a 5% improvement in the average or mean performance rates:
Maximum Rate = 831.2763 Mega-Flops/Sec.
Quartile Q3 = 215.1152 Mega-Flops/Sec.
Average Rate = 198.2022 Mega-Flops/Sec.
GEOMETRIC MEAN = 157.3169 Mega-Flops/Sec.
Median Q2 = 152.3032 Mega-Flops/Sec.
Harmonic Mean = 128.0855 Mega-Flops/Sec.
Quartile Q1 = 100.2081 Mega-Flops/Sec.
Minimum Rate = 29.4546 Mega-Flops/Sec.
| NPES | Rmax [Gflop/s] | Nmax | N 1/2 | Rpeak [Gflop/s] | % peak |
| 256 | 212 | 125962 | 11520 | 307 | 69 |
| 128 | 106 | 89088 | 7488 | 154 | 69 |
| 64 | 53.1 | 62976 | 4992 | 76.8 | 69 |
| 32 | 26.6 | 44544 | 3456 | 38.4 | 69 |
| 16 | 13.4 | 31680 | 2304 | 19.2 | 70 |
| 8 | 6.67 | 22272 | 1536 | 9.6 | 69 |
| 4 | 3.37 | 15936 | 960 | 4.8 | 70 |
| 2 | 1.68 | 11040 | 576 | 2.4 | 70 |
(*) LINPACK benchmark results are from the LINPACK benchmark report, and were not run at NESC.
The NAS parallel benchmarks (NPB) are a collection of eight benchmarks developed as part of the Numerical Aerodynamic Simulation (NAS) program at the NASA Ames Research Center to measure and compare the performance of highly parallel computers. These benchmarks, which are derived from computational fluid dynamic codes, have become a standard measure of supercomputer performance. Five of the benchmarks (EP, CG, MG, FT, and IS) represent computational kernels, while the other three (LU, SP, and BT) represent simplified applications. Three problem classes of increasing size are specified for each benchmark: class A, class B, and class C. Although source code implementations of the benchmarks in MPI now exist, the original specification, now called NPB 1.0, was a "pencil and paper" benchmark, in which the actual implementation was left to the computer vendor.
NPB 1.0 results are the most optimized versions, and it is those results (courtesy of the Cray Research benchmarking group) that are presented in Table 16, Table 17, and Table 18. The benchmarks vary widely in their communication requirements and patterns. EP has almost no communication, MG uses ghost-cell update communication and FT and BT use data set transpose algorithms. The drop-off in FT performance occurs when we run out of "planes" to do in parallel and have to do domain decomposition within a plane, which adds another communication step.
| Time [Seconds]: | ||||||||
| PEs | ep | cg | mg | ft | is | lu | sp | bt |
| 32 | 1.8 | 1.5 | 1.1 | 1.3 | 0.7 | 12.8 | 21.7 | 24.5 |
| 64 | 0.9 | 0.6 | 0.6 | 0.6 | 0.3 | 7.4 | 11.5 | 12.9 |
| 128 | 0.4 | 0.4 | 0.3 | 0.3 | 0.2 | 4.4 | 7.3 | 7.0 |
| 256 | 0.2 | 0.3 | 0.2 | 0.2 | 0.1 | 3.1 | 4.6 | 3.9 |
| 512 | 0.1 | 0.2 | 0.1 | 0.2 | 0.1 | 2.3 | 2.6 | 2.2 |
| 1024 | 0.1 | 0.2 | 0.1 | 0.1 | 0.2 | 2.0 | 1.5 | 1.3 |
| Rate per PE [Mflop/s]: | ||||||||
| PEs | ep | cg | mg | ft | is | lu | sp | bt |
| 32 | 471 | 32 | 113 | 138 | 37 | 158 | 147 | 232 |
| 64 | 471 | 36 | 105 | 136 | 37 | 137 | 138 | 220 |
| 128 | 470 | 28 | 97 | 134 | 34 | 114 | 109 | 202 |
| 256 | 467 | 21 | 88 | 124 | 27 | 82 | 86 | 180 |
| 512 | 463 | 14 | 67 | 72 | 11 | 55 | 78 | 163 |
| 1024 | 451 | 9 | 49 | 62 | 4 | 31 | 64 | 137 |
| Time [Seconds]: | ||||||||
| PEs | ep | cg | mg | ft | is | lu | sp | bt |
| 32 | 7.1 | 50.4 | 5.1 | 15.0 | 2.8 | 48.5 | 82.4 | 106.7 |
| 64 | 3.5 | 27.7 | 2.7 | 7.6 | 1.7 | 26.5 | 44.0 | 56.4 |
| 128 | 1.8 | 17.3 | 1.5 | 3.9 | 0.8 | 16.8 | 25.9 | 29.4 |
| 256 | 0.9 | 9.0 | 0.8 | 2.1 | 0.4 | 9.9 | 14.8 | 15.5 |
| 512 | 0.4 | 5.3 | 0.5 | 1.6 | 0.3 | 6.5 | 9.0 | 8.9 |
| 1024 | 0.2 | 3.1 | 0.4 | 1.0 | 0.2 | 4.9 | 5.1 | 5.0 |
| Rate per PE [Mflop/s]: | ||||||||
| PEs | ep | cg | mg | ft | is | lu | sp | bt |
| 32 | 446 | 34 | 116 | 149 | 35 | 206 | 170 | 211 |
| 64 | 446 | 31 | 108 | 146 | 30 | 188 | 159 | 200 |
| 128 | 446 | 25 | 98 | 144 | 30 | 149 | 135 | 192 |
| 256 | 446 | 24 | 87 | 134 | 29 | 127 | 118 | 182 |
| 512 | 444 | 20 | 68 | 85 | 22 | 96 | 97 | 158 |
| 1024 | 441 | 17 | 49 | 73 | 12 | 64 | 86 | 140 |
| Time [Seconds]: | ||||||||
| PEs | ep | cg | mg | ft | is | lu | sp | bt |
| 32 | 28.2 | 125.8 | 41.1 | 65.5 | 13.4 | 199.4 | 326.8 | 561.1 |
| 64 | 14.1 | 68.0 | 19.9 | 33.3 | 8.8 | 100.9 | 171.6 | 285.2 |
| 128 | 7.1 | 40.5 | 10.3 | 16.8 | 4.7 | 52.0 | 93.9 | 145.6 |
| 256 | 3.5 | 30.6 | 5.2 | 8.9 | 2.5 | 29.6 | 50.1 | 75.7 |
| 512 | 1.8 | 16.7 | 2.8 | 4.6 | 1.4 | 17.0 | 28.4 | 40.3 |
| 1024 | 0.9 | 7.3 | 1.6 | 3.6 | 0.9 | 11.2 | 15.8 | 21.6 |
| Rate per PE [Mflop/s]: | ||||||||
| PEs | ep | cg | mg | ft | is | lu | sp | bt |
| 32 | 447 | 34 | 118 | 193 | 30 | 320 | 139 | 170 |
| 64 | 447 | 31 | 122 | 189 | 23 | 316 | 132 | 167 |
| 128 | 447 | 26 | 118 | 188 | 21 | 306 | 121 | 164 |
| 256 | 447 | 17 | 116 | 177 | 20 | 269 | 113 | 157 |
| 512 | 446 | 16 | 107 | 172 | 19 | 234 | 100 | 148 |
| 1024 | 445 | 18 | 92 | 110 | 14 | 178 | 90 | 138 |
Each Fortran OPEN statement issued by a processor has its own file descriptor and counts against the per job limit of open files (currently 256 at NESC), even if it is accessing the same file as another processor. It is easy to exceed this limit when using the Cray global I/O layer if every processor opens several files. That is why designating one processor as the I/O processor, and using the fast inter-processor connection network to move data as needed, is often seen as the best alternative.
In order to quantify the I/O performance of the CRAY T3E, we will look at two metrics: the time for a global file OPEN, as would be necessary when doing global I/O to a particular file, and the maximum read and write bandwidth to disk from a single processor. The latter of these is somewhat specific to the current NESC configuration, but it is not atypical for a T3E system.
The command "df -p" shows the partitions available on the currently mounted filesystems. A filesystem with more than one partition may benefit from user-level striping. On the NESC CRAY T3E system at the time of these tests, the following configuration was in place for the /tmp disk:
/tmp (/dev/dsk/tmp ): 3923683 sectors 505891 I-nodes
total: 6250000 sectors 524288 I-nodes
Big file threshold: 32768 bytes
Big file allocation minimum: 24 blocks
Allocation Strategy: round robin files
round robin all user data
Primary partitions allocation unit: 16K byte blocks
part start total free (%) frags (%) device
---- -------- -------- ----------------- ---------------- --------
0 0 6250000 4230096 ( 67.7%) 24 ( 0.002%) tmp1
1 6250000 6250000 3872580 ( 62.0%) 35 ( 0.004%) tmp3
2 12500000 6250000 4062884 ( 65.0%) 18 ( 0.002%) tmp2
3 18750000 6250000 3529172 ( 56.5%) 55 ( 0.006%) tmp4
Our test programs are taken from the NERSC I/O guide ([9]), modified to write or read an unformatted 100 MB file. As documented in the I/O guide, we use the assign command with options to specify the size of the file in 4KB blocks (-n 25600), the chunksize, a blocking parameter (-q 128), and the global I/O layer (-F bufa). Results for different numbers of partitions (specified with the -p command) are shown in Table 19.
| assign command | Write [MB/s] |
Read [MB/s] |
| No assign (1 partition) | 32.12 | 44.50 |
| assign -n 25600 -q 128 -F bufa:128:2 u:10 | 40.52 | 57.79 |
| assign -n 25600 -q 128 -p 0-1 -F bufa:128:4 u:10 | 77.03 | 111.37 |
| assign -n 25600 -q 128 -p 0-2 -F bufa:128:6 u:10 | 104.82 | 162.40 |
| assign -n 25600 -q 128 -p 0-3 -F bufa:128:8 u:10 | 106.99 | 219.53 |
Table 20 shows the best observed file open times for the 100 MB file in the previous exercise assuming that every processor opens the file before the file is created. The time to execute the OPEN statement varied widely from run to run and in some cases was an order of magnitude larger than these times. These data support the NERSC I/O guide's statement that "it could take a significant amount of time to open a large number of large files" ([9]). Indeed, the assignment that optimizes the file transfer rate may cause the file open time to be larger than the time to read or write the file itself.
| NPES | 1 partition [sec] |
4 partitions [sec] |
| 1 | 0.045 | 0.94 |
| 2 | 0.058 | 7.4 |
| 4 | 0.099 | 14. |
| 8 | 0.20 | 28. |
| 16 | 0.78 | 58. |
| 32 | 1.7 | 115. |