Performance Analysis of the CRAY T3E-1200E

Performance Analysis of the CRAY T3E-1200E

Edward Anderson Lockheed Martin Services Inc. National Environmental Supercomputing Center Anderson.Edward@epa.gov November 23, 1999 Click here for the postscript version.

Most sources for CRAY T3E results give performance measurements for the CRAY T3E (300 MHz) or CRAY T3E-900 (450 MHz) models ([1],[2]). This paper presents results for the CRAY T3E-1200E (600 MHz), which differs from the earlier models in the following respects:

  • The clock speed of the microprocessor is 600 MHz.
  • The DRAM speed is 50 ns, vs. 60 ns in earlier models.
  • The support circuitry supports hardware read-ahead for stream operations (this is also true of the CRAY T3E-900). The differences between a CRAY T3E-1200 and a CRAY T3E-1200E include a later revision of the R chip, higher performance for direct memory copies over the interconnection network, and support for different dynamic routing.

1.0 Hardware overview

The CRAY T3E is a distributed memory multiprocessor with a globally addressable address space. The liquid-cooled model is scalable to eight cabinets, each housing one clock module and up to 34 processor modules. Each processor module contains eight nodes or processing elements (PEs), giving a maximum configuration of 272 PEs/cabinet. The PEs are site-configurable into three groups: application (APP) PEs, used to run parallel user jobs, command (CMD) PEs, used to run single-processor user jobs and to process interactive commands, and operating system (OS) PEs, used for system tasks. There must be at least two CMD and two OS PEs in a liquid-cooled system. For a fully-populated cabinet of 272 PEs, the typical configuration is 256 APP PEs and a total of 16 CMD and OS PEs. This ratio of application to support PEs scales to 2048 application processors.

Each node of a CRAY T3E system consists of an Alpha 21164 microprocessor, a system control chip, 64 MB - 2048 MB (*) of local memory, and a network router. The DEC Alpha 21164 is a superscalar microprocessor with 4-way instruction issue, an 8 KB primary data cache, an 8 KB instruction cache, and a 96 KB secondary cache for both data and instructions. The system logic operates at 75 MHz, and the microprocessor clock speed must be a multiple of this rate. Included on the control chip are a set of six hardware stream buffers, which provide prefetching and buffering for loads of consecutive cache lines from local memory, and an external register set (the "E-registers"), which are used for memory transfers. Although the main purpose of the E-registers is to handle inter-processor communication, they can also be used for non-cached access to local memory. A cache backmap maintains coherence of the local memory, caches, streams, and E-registers by flushing cached lines when a non-cached memory request is initiated.

(*) Most CRAY T3E systems have 128, 256, or 512 MB of memory per node.

There is one network router for each processor on a CRAY T3E system, in contrast to the CRAY T3D system architecture, in which two processors shared a network connection. As in the CRAY T3D, network routers are connected in a 3-D torus topology. Each network router has 16 uni-directional channels, two to connect to and from the support circuitry of its associated PE, two to connect to and from the I/O controller on the processor module, and 12 to connect to and from other nodes in the +/- x, +/- y, and +/- z directions of the 3-D torus. Torus links have a theoretical peak bandwidth of 650 MB/s in each direction.

The test platform for the benchmarks in this paper was the 72-processor CRAY T3E-1200E system at the National Environmental Supercomputing Center, operated by Lockheed Martin for the U. S. Environmental Protection Agency. This system was configured with 64 application PEs, 4 command PEs, and 4 OS PEs, each with 256 MB of memory. The operating system level was Unicos/mk 2.0.4.60 and the programming environment release level was 3.2.

2.0 Local memory bandwidth

The peak bandwidth from local memory to the microprocessor within a CRAY T3E node is rated at 1200 MB/sec. This rate is based on a peak transfer rate of 128 bits (16 bytes) per system clock along the processor bus, or 32 bits per system clock along each of the four channels between the memory controller chips and the main control chip, and a system clock rate of 75 MHz. The theoretical peak sustained bandwidth is 80% of this, or 960 MB/sec.

The greatest measured bandwidths are obtained by assembly-coded kernels or library routines. Simple load/store tests are best written in assembly language because high-level languages can not easily express one-way data transfers except in the context of some other work. Table 1 shows the best measured bandwidth rates for cacheable loads and stores. The effect of the on-chip caches was minimized for these tests by loading or storing unit-stride vectors of millions of words until an asymptotic rate was reached. Cacheable store rates appear slow because the write-allocate secondary cache requires that stored data first be loaded into the cache, then updated and marked as dirty until flushed to memory at some later time, so the reported rate represents only half of the memory traffic. The load/store operation assumes different addresses for the load and store and is equivalent to a vector copy. Experiments with the relative memory addresses of the streams showed that separating the arrays by a multiple of 8192 words plus 64 or 128 words for avoiding cache conflicts gave the best performance.

Table 1: Measured load/store rates [MB/s]
Operation 1 stream 2 streams 3 streams 4 streams
cacheable load 453 815 873 865
cacheable store 302 356 332 332
cacheable load + store - 515 - 513

Table 2 shows the best measured bandwidth rates for uncached loads and stores (typically called GETs and PUTs). These rates were measured using the one-sided E-register GET and PUT routines from benchlib ([2]) and the SHMEM library routines SHMEM_GET and SHMEM_PUT. In this context, "one-sided" means a transfer from local memory into the E-registers, or from the E-registers to local memory. A one-sided GET is a meaningless operation, but a one-sided PUT could have application in initializing a data area to a constant value. Because the data never makes it to the Alpha microprocessor during an E-register operation, the uncached load and store rates are useful only for modeling memory-to-memory transfers such as a vector copy or a matrix transpose.

Table 2: Measured GET/PUT rates
Operation Rate [MB/s]
uncached GET 598
uncached PUT 532
uncached GET+PUT via benchlib 582
uncached GET+PUT via SHMEM_GET 668
uncached GET+PUT via SHMEM_PUT 609

A somewhat more standard measure of "sustainable" local memory bandwidth is the STREAM benchmark ([10]). The single-processor STREAM benchmark program evaluates the performance of the local memory system on the following unit-stride operations:

Name Operation Bytes transferred (at 8 bytes/word)
Copy c(1:n) = a(1:n) 8*(16*n)
Scale b(1:n) = s*c(1:n) 8*(16*n)
Add c(1:n) = a(1:n) + b(1:n) 8*(24*n)
Triad a(1:n) = b(1:n) + s*c(1:n) 8*(24*n)

The value of n is chosen large enough to exceed the size of any local caches. Optimization by use of assembly language kernels or library routines is not allowed. When STREAM benchmark results are given for parallel systems, they include the time to perform the local memory operations and the time to synchronize all the participating processors or processes.

Table 3 shows the best measured STREAM benchmark rates for each operation. The size n was 2000000 elements for all tests, but different values of the "offset" parameter, which helps to control the relative memory addresses, were optimal for the different cases.

Table 3: Single-processor STREAM benchmark results
Function Offset Compiler options Rate [MB/s]
Copy 7040 f90 -O3,unroll2,pipeline2 -dp 520
Scale 7104 f90 -O3,unroll2,pipeline2 -dp 517
Add 7040 f90 -O3,unroll2,pipeline2 -dp -a pad 611
Triad 10000 f90 -O3,unroll2,pipeline2 -dp 622

3.0 Inter-processor bandwidth

The COMMS benchmarks from the ParkBench suite ([4]) measure inter-processor communication rates. Two versions of these benchmarks are included in the benchmark suite, one using the PVM message-passing library and one using MPI. In addition, we implemented the benchmarks using the SHMEM data-passing library, which has the lowest latency and the highest bandwidth of the supported messaging libraries on the CRAY T3E.

Latency and bandwidth data extracted from the COMMS1 benchmark results are shown in Table 4. For the MPI tests, setting the environment variable MPI_BUFFER_MAX to 2048, thereby limiting buffering to messages less than 2048 bytes, improved the performance by 75% for the larger transfers. For the PVM tests, setting the environment variable PVM_DATA_MAX to 32000 (as suggested in [5]) improved the performance for messages up to that length, but did not affect the asymptotic rate.

Table 4: Measured bandwidths from the message-passing libraries
Library time for zero-length message [microseconds] bandwidth [Mbyte/s]
MPI 14 315
PVM 11 154
SHMEM 2 418

The SYNCH1 benchmark measures the overhead for global synchronization by measuring the rate at which a barrier statement can be executed as a function of the number of processes (nodes) taking part in the global barrier synchronization. The SYNCH1 benchmark repeats a sequence of 10 BARRIER statements 1000 times.

Table 5: Barrier times from the message-passing libraries
NPES MPI barrier [usec] PVM barrier [usec] SHMEM barrier [usec]
2 4.8 3.2 1.07
4 9.3 6.1 1.31
8 14. 9.0 1.55
16 18. 11.9 1.54
32 23. 14.6 1.78
64 27. 17.5 2.02

4.0 Single processor performance

As on previous CRAY systems, the best single-processor performance for computational kernels on the CRAY T3E is obtained by the use of library routines. The CRAY scientific library (libsci) includes optimized single-processor versions of the BLAS (the Basic Linear Algebra Subprograms from netlib), LAPACK (a standard linear algebra package), and FFTs. Compiler optimization options enable vectorization of selected math library (libm) functions, and an alternate math library (libmfastv) provides higher-performing versions of some of these functions. Certain memory-to-memory operations, such as block copies, matrix transposes, data initialization, and gather/scatter operations, can be optimized by using Shared Memory (SHMEM) library routines, specifying the local processor for both the target and source. Finally, the unsupported benchlib library ([2]) provides finer control over memory-to-memory transfers by direct manipulation of the E-registers.

The libsci BLAS can attain near peak performance when operating on data in the cache. To illustrate, we performed some basic vector operations (1-norm, 2-norm, vector sum, and dot product) using 64-bit Level 1 BLAS with vectors sized to fit in the cache, and repeated the tests many times to get an in-cache performance rate. The results are shown in Table 6. The performance of SNRM2 is the same as SASUM despite doing twice as much work because a 2-pass algorithm is used in SNRM2 to avoid underflow or overflow during the sum of squares ([3]).

Table 6: In-cache BLAS performance
Subroutine Operation Data type In-cache rate [Mflop/s]
SASUM a <- || x ||_1 real 365
SNRM2 a <- || x ||_2 real 361
SAXPY y <- a*x + y real 523
CAXPY y <- a*x + y complex 943
SDOT a <- x' * y real 979
CDOTU a <- x' * y complex 1175

Compared to other distributed memory systems, the CRAY T3E has a relatively small cache on each node. To highlight the advantages of a larger cache, some vendors will only show the performance of library routines from the cache. However, when data is not already cached, the asymptotic performance rate is a better predictor of the performance that will actually be achieved. Table 7 shows asymptotic performance rates for representative libsci BLAS operations on 64-bit vectors and matrices.

Table 7: Asymptotic BLAS performance
Subroutine Problem size Rate [Mflop/s]
SNRM2 30000000 102
SAXPY 15000000 61
CAXPY 7500000 122
SDOT 15000000 102
CDOTU 7500000 164
SGER 2000 81
SGEMV 2000 156
STRSV 2000 85
SGEMM 2000 722
STRSM 2000 524

Besides the Basic Linear Algebra Subprograms, the Cray Scientific Library also contains software from LAPACK (a Fortran library of linear algebra software), special solvers such as first- and second-order linear system solvers, and FFTs. A sample of the performance from some of the block algorithms in LAPACK, taken directly from the LAPACK timing program, is shown in Table 8. Although signficantly below the matrix multiply rate and probably optimizable, these results may be more indicative of the performance of Level 3 BLAS in the context of a user program. Table 9 shows the performance of the FFT library routine CCFFT, including both an in-cache performance rate and a rate with all the vectors starting from memory. The megaflop rate is based on an operation count of 5 n log( n ) for each complex-to-complex FFT. Parallel versions of some of this software are also available through the ScaLAPACK library and several well-optimized 2-D and 3-D FFTs.

Table 8: LAPACK routine performance for different values of N [Mflop/s]
Subroutine 100 200 300 400 500
SGEMM 551 632 638 656 679
SGETRF 256 355 375 422 429
SPOTRF(U) 188 301 365 412 444
SPOTRF(L) 187 292 299 349 345
SGEQRF 185 290 297 348 351
SGEHRD 126 154 184 208 225
SSYTRD(U) 159 130 149 163 178
SSYTRD(L) 137 126 144 165 172
SGEBRD 98 104 122 137 147
CGEMM 707 744 755 754 773
CGETRF 302 414 491 544 581
CPOTRF(U) 215 289 324 342 353
CPOTRF(L) 221 293 314 313 308
CGEQRF 259 257 273 282 293
CGEHRD 186 209 227 234 238
CHETRD(U) 185 184 215 238 253
CHETRD(L) 183 186 219 242 257
CGEBRD 129 148 172 186 194
Table 9: Performance of CCFFT (complex-to-complex Fast Fourier Transform)
N In-cache [Mflop/s] From memory [Mflop/s]
64 474 87
128 486 117
256 569 182
512 533 195
1024 535 220
2048 170 164
4096 82 76
8192 27 27
16384 25 25
32768 23 23

A number of common memory-to-memory transfers can be performed more efficiently by using the E-registers to bypass the cache. Examples include block copies, matrix transposes, data initialization, and gather/scatter operations ([2]). Support for this type of data movement is provided through the SHMEM and benchlib libraries or use of the CACHE_BYPASS compiler directive. The next several tables summarize their various implementations and compare their performance to that of regular Fortran using the cache. Some of the operations do not have a straightforward implementation in every method.

Table 10: Block copy implementations
Method Source code Rate [Mb/s]
Fortran

DO I = 1, N
Y(I) = X(I)
END DO
508
CACHE_BYPASS

!DIR$ CACHE_BYPASS X, Y
DO I = 1, N
Y(I) = X(I)
END DO
582
Benchlib

DO I = 1, N, 480
NI = MIN( 480, N-I+1 )
CALL LGETV( X(I), 1, NI )
CALL LPUTV( Y(I), 1, NI )
END DO
582
SHMEM

CALL SHMEM_GET( Y, X, N, MYPE )
679
Table 11: Matrix transpose implementations (N=LDA=1025)
Method Source code Rate [Mb/s]
Fortran

DO J = 1, N
DO I = 1, N
B(I,J) = A(J,I)
END DO
END DO
106
CACHE_BYPASS

DO J = 1, N
!DIR$ CACHE_BYPASS A, B
DO I = 1, N
B(I,J) = A(J,I)
END DO
END DO
442
Benchlib

DO J = 1, N, 480
NJ = MIN( N-J+1, 480 )
DO I = 1, N
CALL LGETV( A(I,J), LDA, NJ )
CALL LPUTV( B(J,I), 1, NJ )
END DO
END DO
123 IF( LPUTP().NE.0 ) GOTO 123
547
SHMEM

DO J = 1, N
CALL SHMEM_IGET( B(1,J), A(J,1),
&1, LDA, N, MYPE )
END DO
421
Table 12: In-place matrix transpose implementations (N=1025)
Method Source code Rate [Mb/s]
Fortran

DO J = 1, N-1
DO I = J+1, N
TMP = A(I,J)
A(I,J) = A(J,I)
A(J,I) = TMP
END DO
END DO
151
CACHE_BYPASS

DO J = 1, N-1
!DIR$ CACHE_BYPASS A
DO I = J+1, N
TMP = A(I,J)
A(I,J) = A(J,I)
A(J,I) = TMP
END DO
END DO
167
Benchlib

DO J = 1, N-1
DO I = 1, N-J, 240
NI = MIN( (N-J)-I+1, 240 )
CALL LGETVO( A(J+I,J), 1, NI, 0 )
CALL LGETVO( A(J,J+I), LDA, NI, 240 )
CALL LPUTVO( A(J,J+I), LDA, NI, 0 )
CALL LPUTVO( A(J+I,J), 1, NI, 240 )
END DO
END DO
123 IF( LPUTP().NE.0 ) GOTO 123
465
SHMEM

DO J = 1, N-1
CALL SHMEM_IGET( B, A(J,J+1),
&1, LDA, N-J, MYPE )
CALL SHMEM_IGET( A(J,J+1), A(J+1,J),
&LDA, 1, N-J, MYPE )
CALL SHMEM_PUT( A(J+1,J), B, N-J, MYPE )
END DO
321
Table 13: Data initialization implementations
Method Source code Rate [Mb/s]
Fortran

DO I = 1, N
X(I) = SUM
END DO
332
CACHE_BYPASS

!DIR$ CACHE_BYPASS X
DO I = 1, N
X(I) = SUM
END DO
331
Benchlib

CALL LSETV( X, 1, N, SUM )
123 IF( LPUTP().NE.0 ) GO TO 123
530
Table 14: Vector gather implementations (N = 100000)
Method Source code NGATH Rate [Mb/s]
Fortran

DO I = 1, NGATH
X(I) = A(INDEX(I))
END DO
10 100 1000 10000 23 52 61 67
CACHE_BYPASS

!DIR$ CACHE_BYPASS A
DO I = 1, NGATH
X(I) = A(INDEX(I))
END DO
10 100 1000 10000 21 111 218 237
Benchlib

DO I = 1, NGATH, 480
NI = MIN( NGATH-I+1, 480 )
CALL LGATH( A(0), INDEX(I), NI )
CALL LPUTV( X(I), 1, NI )
END DO
123 IF( LPUTP().NE.0 ) GOTO 123
10 100 1000 10000 23 150 271 310
SHMEM

CALL SHMEM_IXGET( X, A(0), INDEX,
&NGATH, MYPE )
10 100 1000 10000 24 134 202 273

5.0 Compiler optimizations

In order to evaluate the impact of different compiler options on single-processor performance, we conducted a study on the Livermore Fortran Kernels (LFK), a collection of 24 computational kernels with three sets of DO spans developed at the Lawrence Livermore National Laboratory. The LFK are a favorite test collection of compiler writers, and reasonable performance can usually be obtained by compiler options alone. We also experimented with hand optimization of selected kernels with modest success.

The LFK authors stipulate that statistics should be quoted from the summary table of 72 timings in the output file. The baseline performance of the LFK tests, using only the compiler options f90 -O2, was as follows:


         Maximum   Rate =    596.9183 Mega-Flops/Sec.
         Quartile  Q3   =    157.0089 Mega-Flops/Sec.
         Average   Rate =    143.0450 Mega-Flops/Sec.
         GEOMETRIC MEAN =    115.8271 Mega-Flops/Sec.
         Median    Q2   =    103.9888 Mega-Flops/Sec.
         Harmonic  Mean =     95.1324 Mega-Flops/Sec.
         Quartile  Q1   =     84.7023 Mega-Flops/Sec.
         Minimum   Rate =     19.5352 Mega-Flops/Sec.

We added compiler options one at a time in order to study the effects of each on performance. These experiments can be summarized by looking at the geometric means from each test:

Options Rate [Mflop/s]
f90 -dp -O3 122
f90 -dp -O3,unroll2 147
f90 -dp -O3,unroll2,pipeline2 151
f90 -dp -O3,unroll2,pipeline2 -a pad 142
f90 -dp -O3,unroll2,pipeline2 -lmfastv 151
f90 -dp -O3,unroll2,pipeline2,split2 -lmfastv 147

Although there was no measurable difference from adding the -lmfastv option, it is effective in some situations ([2]), so we quote the full statistics for the case f90 -dp -O3,unroll2,pipeline2 -lmfastv:


         Maximum   Rate =    632.9465 Mega-Flops/Sec.
         Quartile  Q3   =    214.1753 Mega-Flops/Sec.
         Average   Rate =    189.2878 Mega-Flops/Sec.
         GEOMETRIC MEAN =    151.3626 Mega-Flops/Sec.
         Median    Q2   =    147.5358 Mega-Flops/Sec.
         Harmonic  Mean =    122.4048 Mega-Flops/Sec.
         Quartile  Q1   =     97.0975 Mega-Flops/Sec.
         Minimum   Rate =     31.2659 Mega-Flops/Sec.

Previous studies have indicated that loop splitting is best applied on a loop-by-loop basis, rather than globally through use of the compiler's -Osplit flag ([2]). Also, the compiler is typically unable to recognize when a scientific library substitution can be made. In the remainder of this section, specific hand optimizations to individual kernels are described that illustrate further opportunities for performance improvements.

Kernel 3: library substitution

Kernel 3 consists of the following three lines of code:


          Q= 0.000d0
          DO 3 k= 1,n
        3      Q= Q + Z(k) * X(k)

It can be replaced by a library call:


          Q = SDOT( N, Z, 1, X, 1 )

The library routine is faster than the compiler-generated code when the vectors X and Z are in the cache because the library routine is more effective at hiding the latency of an Scache load. In the LFK test suite, the vectors do become cache resident because each test is executed several times. The statistics for kernel 3 for each do span length are as follows:


 KERNEL FLOPS  MICROSEC  MFLOP/SEC SPAN WEIGHT  CHECK-SUMS            PRECIS
 ------ -----  --------  --------- ---- ------  ---------------------- -----
  3 1.598E+06 8.714E+03    183.439   27   1.00  1.0555606580531002E-01 16.90
  3 2.141E+06 8.251E+03    259.508  101   2.00  3.9489293708756362E-01 16.90
  3 1.802E+06 6.233E+03    289.073 1001   1.00  3.9140054768099826E+00 16.90

After optimization:


  3 1.598E+06 7.190E+03    222.323   27   1.00  1.0555606580531003E-01 16.90
  3 2.141E+06 4.547E+03    470.892  101   2.00  3.9489293708756368E-01 16.90
  3 1.802E+06 2.142E+03    841.000 1001   1.00  3.9140054768099817E+00 16.65

Similarly, Kernel 24 can be replaced by a call to ISMIN with a small increase in performance.

Kernel 13: cache bypass

Kernel 13, a 2-D particle in cell computation, benefitted from the common block padding option, -a pad, but the performance of the full LFK benchmark set degraded slightly with this option. The kernel is


                fw= 1.000d0
 1013 DO  13     k= 1,n
                i1= P(1,k)
                j1= P(2,k)
                i1=       1 + MOD2N(i1,64)
                j1=       1 + MOD2N(j1,64)
            P(3,k)= P(3,k)  + B(i1,j1)
            P(4,k)= P(4,k)  + C(i1,j1)
            P(1,k)= P(1,k)  + P(3,k)
            P(2,k)= P(2,k)  + P(4,k)
                i2= P(1,k)
                j2= P(2,k)
                i2=            MOD2N(i2,64)
                j2=            MOD2N(j2,64)
            P(1,k)= P(1,k)  + Y(i2+32)
            P(2,k)= P(2,k)  + Z(j2+32)
                i2= i2      + E(i2+32)
                j2= j2      + F(j2+32)
          H(i2,j2)= H(i2,j2) + fw
   13 CONTINUE

A contributing factor to the poor performance of this kernel is the non-unit stride pattern of access to the arrays B and C. The compiler always generates cacheable loads because it can not tell what is in cache and what is not, but we can request non-cached access by preceding the above DO loop with the directive


!dir$ cache_bypass b, c

The baseline performance of kernel 13 was


 KERNEL FLOPS  MICROSEC  MFLOP/SEC SPAN WEIGHT  CHECK-SUMS            PRECIS
 ------ -----  --------  --------- ---- ------  ---------------------- -----
 13 1.389E+06 3.815E+04     36.407    8   1.00  8.7535991110238361E+09 16.90
 13 1.837E+06 4.805E+04     38.230   32   2.00  3.0307123384693829E+10 16.90
 13 1.613E+06 4.347E+04     37.102   64   1.00  4.5614594609595291E+10 16.90

With cache_bypass:


 13 1.389E+06 4.971E+04     27.940    8   1.00  8.7535991110238361E+09 16.90
 13 1.837E+06 3.972E+04     46.247   32   2.00  3.0307123384693829E+10 16.90
 13 1.613E+06 3.392E+04     47.541   64   1.00  4.5614594609595291E+10 16.90

The degradation for the smallest DO span is expected because the non-cached access requested by the cache_bypass directive will force cache-resident elements of B and C to be loaded from memory instead.

Kernel 14: loop combining

Kernel 14 is a 1-D particle in cell code consisting of three loops from 1 to n. The first loop has 7 streams and accesses two additional arrays (EX and DEX) in a non-sequential order. The second loop also has 7 streams, while the third uses only two, with some non-unit stride stores to a third array. In general, one wants to limit the number of streams to 6 or fewer. In this case, however, there was a lot of reuse within a loop iteration, so stripmining the outer loops and combining the three loops into two was the most beneficial.

The original code looked like this:


 1014 DO   141  k= 1,n
            VX(k)= 0.0d0
            XX(k)= 0.0d0
            IX(k)= INT(  GRD(k))
            XI(k)= REAL( IX(k))
           EX1(k)= EX   ( IX(k))
          DEX1(k)= DEX  ( IX(k))
 141  CONTINUE
c
      DO   142  k= 1,n
            VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k)
            XX(k)= XX(k) + VX(k)  + FLX
            IR(k)= XX(k)
            RX(k)= XX(k) - IR(k)
            IR(k)= MOD2N(  IR(k),2048) + 1
            XX(k)= RX(k) + IR(k)
 142  CONTINUE
c
      DO  14    k= 1,n
      RH(IR(k)  )= RH(IR(k)  ) + fw - RX(k)
      RH(IR(k)+1)= RH(IR(k)+1) + RX(k)
  14  CONTINUE

The optimized code is as follows:


      do kk = 1, n, 128
         kn = min(n-kk+1,128)
      DO   141  k = kk, kk+kn-1
            VX(k)= 0.0d0
            XX(k)= 0.0d0
            IX(k)= INT(  GRD(k))
            XI(k)= REAL( IX(k))
           EX1(k)= EX   ( IX(k))
          DEX1(k)= DEX  ( IX(k))
            VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k)
            XX(k)= XX(k) + VX(k)  + FLX
 141  CONTINUE
c
      DO   142  k = kk, kk+kn-1
            IR(k)= XX(k)
            RX(k)= XX(k) - IR(k)
            IR(k)= MOD2N(  IR(k),2048) + 1
            XX(k)= RX(k) + IR(k)
            RH(IR(k)  )= RH(IR(k)  ) + fw - RX(k)
            RH(IR(k)+1)= RH(IR(k)+1) + RX(k)
 142  CONTINUE
c
      end do

The baseline performance was


 KERNEL FLOPS  MICROSEC  MFLOP/SEC SPAN WEIGHT  CHECK-SUMS            PRECIS
 ------ -----  --------  --------- ---- ------  ---------------------- -----
 14 1.901E+06 3.049E+04     62.339   27   1.00  1.9943880114661271E+06 16.90
 14 2.222E+06 3.297E+04     67.387  101   2.00  2.3107401197908435E+07 16.90
 14 2.202E+06 7.043E+04     31.266 1001   1.00  2.1783317062516003E+09 16.65

Improvements with the optimized code ranged from 20-70%:


 14 1.901E+06 1.813E+04    104.834   27   1.00  1.9943880114661271E+06 16.90
 14 2.222E+06 2.062E+04    107.755  101   2.00  2.3107401197908435E+07 16.90
 14 2.202E+06 5.610E+04     39.254 1001   1.00  2.1783317062516003E+09 16.65

Although dramatic in a few cases, the overall effect of the hand optimizations was only about a 5% improvement in the average or mean performance rates:


         Maximum   Rate =    831.2763 Mega-Flops/Sec.
         Quartile  Q3   =    215.1152 Mega-Flops/Sec.
         Average   Rate =    198.2022 Mega-Flops/Sec.
         GEOMETRIC MEAN =    157.3169 Mega-Flops/Sec.
         Median    Q2   =    152.3032 Mega-Flops/Sec.
         Harmonic  Mean =    128.0855 Mega-Flops/Sec.
         Quartile  Q1   =    100.2081 Mega-Flops/Sec.
         Minimum   Rate =     29.4546 Mega-Flops/Sec.

6.0 Multiple processor computational performance

The LINPACK benchmark for scalable parallel systems measures the greatest sustainable performance for the solution of a dense system of linear equations ([7]). The problem size and implementation are left to the implementor, except that the algorithm used must be Gaussian elimination with partial pivoting. The popularity of the benchmark stems from its use in ranking computers for the biannual list of the "Top 500" supercomputers ([8]). Factoring a matrix (the main part of the LINPACK benchmark) is a particularly cache-friendly operation because it is dominated by matrix-matrix multiplications. Although performance comparable to that of the LINPACK benchmark is seldom replicated in real applications, it is a good test of new systems and a first validation of a scalable design. The CRAY T3E-1200 exhibits near linear scalability on this benchmark, as can be seen from Table 15 (*).
Table 15: LINPACK benchmark results
NPES Rmax [Gflop/s] Nmax N 1/2 Rpeak [Gflop/s] % peak
256 212 125962 11520 307 69
128 106 89088 7488 154 69
64 53.1 62976 4992 76.8 69
32 26.6 44544 3456 38.4 69
16 13.4 31680 2304 19.2 70
8 6.67 22272 1536 9.6 69
4 3.37 15936 960 4.8 70
2 1.68 11040 576 2.4 70

(*) LINPACK benchmark results are from the LINPACK benchmark report, and were not run at NESC.

The NAS parallel benchmarks (NPB) are a collection of eight benchmarks developed as part of the Numerical Aerodynamic Simulation (NAS) program at the NASA Ames Research Center to measure and compare the performance of highly parallel computers. These benchmarks, which are derived from computational fluid dynamic codes, have become a standard measure of supercomputer performance. Five of the benchmarks (EP, CG, MG, FT, and IS) represent computational kernels, while the other three (LU, SP, and BT) represent simplified applications. Three problem classes of increasing size are specified for each benchmark: class A, class B, and class C. Although source code implementations of the benchmarks in MPI now exist, the original specification, now called NPB 1.0, was a "pencil and paper" benchmark, in which the actual implementation was left to the computer vendor.

NPB 1.0 results are the most optimized versions, and it is those results (courtesy of the Cray Research benchmarking group) that are presented in Table 16, Table 17, and Table 18. The benchmarks vary widely in their communication requirements and patterns. EP has almost no communication, MG uses ghost-cell update communication and FT and BT use data set transpose algorithms. The drop-off in FT performance occurs when we run out of "planes" to do in parallel and have to do domain decomposition within a plane, which adds another communication step.

Table 16: NAS Parallel Benchmarks, Class A
Time [Seconds]:
PEs ep cg mg ft is lu sp bt
32 1.8 1.5 1.1 1.3 0.7 12.8 21.7 24.5
64 0.9 0.6 0.6 0.6 0.3 7.4 11.5 12.9
128 0.4 0.4 0.3 0.3 0.2 4.4 7.3 7.0
256 0.2 0.3 0.2 0.2 0.1 3.1 4.6 3.9
512 0.1 0.2 0.1 0.2 0.1 2.3 2.6 2.2
1024 0.1 0.2 0.1 0.1 0.2 2.0 1.5 1.3
Rate per PE [Mflop/s]:
PEs ep cg mg ft is lu sp bt
32 471 32 113 138 37 158 147 232
64 471 36 105 136 37 137 138 220
128 470 28 97 134 34 114 109 202
256 467 21 88 124 27 82 86 180
512 463 14 67 72 11 55 78 163
1024 451 9 49 62 4 31 64 137
Table 17: NAS Parallel Benchmarks, Class B
Time [Seconds]:
PEs ep cg mg ft is lu sp bt
32 7.1 50.4 5.1 15.0 2.8 48.5 82.4 106.7
64 3.5 27.7 2.7 7.6 1.7 26.5 44.0 56.4
128 1.8 17.3 1.5 3.9 0.8 16.8 25.9 29.4
256 0.9 9.0 0.8 2.1 0.4 9.9 14.8 15.5
512 0.4 5.3 0.5 1.6 0.3 6.5 9.0 8.9
1024 0.2 3.1 0.4 1.0 0.2 4.9 5.1 5.0
Rate per PE [Mflop/s]:
PEs ep cg mg ft is lu sp bt
32 446 34 116 149 35 206 170 211
64 446 31 108 146 30 188 159 200
128 446 25 98 144 30 149 135 192
256 446 24 87 134 29 127 118 182
512 444 20 68 85 22 96 97 158
1024 441 17 49 73 12 64 86 140
Table 18: NAS Parallel Benchmarks, Class C
Time [Seconds]:
PEs ep cg mg ft is lu sp bt
32 28.2 125.8 41.1 65.5 13.4 199.4 326.8 561.1
64 14.1 68.0 19.9 33.3 8.8 100.9 171.6 285.2
128 7.1 40.5 10.3 16.8 4.7 52.0 93.9 145.6
256 3.5 30.6 5.2 8.9 2.5 29.6 50.1 75.7
512 1.8 16.7 2.8 4.6 1.4 17.0 28.4 40.3
1024 0.9 7.3 1.6 3.6 0.9 11.2 15.8 21.6
Rate per PE [Mflop/s]:
PEs ep cg mg ft is lu sp bt
32 447 34 118 193 30 320 139 170
64 447 31 122 189 23 316 132 167
128 447 26 118 188 21 306 121 164
256 447 17 116 177 20 269 113 157
512 446 16 107 172 19 234 100 148
1024 445 18 92 110 14 178 90 138

7.0 I/O performance

The subject of distributing the I/O of a parallel program is only discussed in general terms in this paper (see [5], [6], and [9] for more details). On a CRAY T3E system, the processors have a common file space, and the programmer must decide whether one or multiple processors will access a file. If only one processor handles I/O, then all other processors must communicate with that one in order to access or store data on disk. If multiple processors handle the I/O in the program, then either each processor must access its own file or the Cray Global I/O layer must be used to safely buffer data and coordinate the accesses of multiple processes to the same file. On some file systems, the files may be striped across different partitions on the disk in order to increase the rate of access observed by any one processor.

Each Fortran OPEN statement issued by a processor has its own file descriptor and counts against the per job limit of open files (currently 256 at NESC), even if it is accessing the same file as another processor. It is easy to exceed this limit when using the Cray global I/O layer if every processor opens several files. That is why designating one processor as the I/O processor, and using the fast inter-processor connection network to move data as needed, is often seen as the best alternative.

In order to quantify the I/O performance of the CRAY T3E, we will look at two metrics: the time for a global file OPEN, as would be necessary when doing global I/O to a particular file, and the maximum read and write bandwidth to disk from a single processor. The latter of these is somewhat specific to the current NESC configuration, but it is not atypical for a T3E system.

The command "df -p" shows the partitions available on the currently mounted filesystems. A filesystem with more than one partition may benefit from user-level striping. On the NESC CRAY T3E system at the time of these tests, the following configuration was in place for the /tmp disk:


/tmp (/dev/dsk/tmp     ):   3923683 sectors          505891 I-nodes
                   total:   6250000 sectors          524288 I-nodes

Big file threshold:            32768 bytes
Big file allocation minimum:      24 blocks

Allocation Strategy:    round robin files
                        round robin all user data

Primary partitions allocation unit:         16K byte blocks

part   start     total         free (%)          frags (%)      device
----  --------  --------  -----------------  ----------------  --------
   0         0   6250000   4230096 ( 67.7%)     24 (  0.002%)  tmp1    
   1   6250000   6250000   3872580 ( 62.0%)     35 (  0.004%)  tmp3    
   2  12500000   6250000   4062884 ( 65.0%)     18 (  0.002%)  tmp2    
   3  18750000   6250000   3529172 ( 56.5%)     55 (  0.006%)  tmp4    

Our test programs are taken from the NERSC I/O guide ([9]), modified to write or read an unformatted 100 MB file. As documented in the I/O guide, we use the assign command with options to specify the size of the file in 4KB blocks (-n 25600), the chunksize, a blocking parameter (-q 128), and the global I/O layer (-F bufa). Results for different numbers of partitions (specified with the -p command) are shown in Table 19.

Table 19: Write/read rates to /tmp for 100 MB file
assign command Write [MB/s] Read [MB/s]
No assign (1 partition) 32.12 44.50
assign -n 25600 -q 128 -F bufa:128:2 u:10 40.52 57.79
assign -n 25600 -q 128 -p 0-1 -F bufa:128:4 u:10 77.03 111.37
assign -n 25600 -q 128 -p 0-2 -F bufa:128:6 u:10 104.82 162.40
assign -n 25600 -q 128 -p 0-3 -F bufa:128:8 u:10 106.99 219.53

Table 20 shows the best observed file open times for the 100 MB file in the previous exercise assuming that every processor opens the file before the file is created. The time to execute the OPEN statement varied widely from run to run and in some cases was an order of magnitude larger than these times. These data support the NERSC I/O guide's statement that "it could take a significant amount of time to open a large number of large files" ([9]). Indeed, the assignment that optimizes the file transfer rate may cause the file open time to be larger than the time to read or write the file itself.

Table 20: File open times for a 100 MB file
NPES 1 partition [sec] 4 partitions [sec]
1 0.045 0.94
2 0.058 7.4
4 0.099 14.
8 0.20 28.
16 0.78 58.
32 1.7 115.

Acknowledgements

I am grateful to Jeff Brooks of SGI/Cray Research for some helpful discussions and for providing the three tables of NAS Parallel Benchmark results in Section 6.0.

References

[1]
Ed Anderson, Jeff Brooks, Charles Grassl, and Steve Scott, Performance [Analysis] of the CRAY T3E Multiprocessor , Proceedings of SC97, http://www.supercomp.org/sc97/proceedings/TECH/ANDERSON/INDEX.HTM , November 1997.
[2]
Ed Anderson, Jeff Brooks, and Tom Hewitt, The Benchmarker's Guide to Single-processor Optimization for CRAY T3E Systems , http://www.sgi.com/t3e/images/benchmark.pdf , June 1997.
[3]
Edward Anderson and Mark Fahey, Performance Improvements to LAPACK for the Cray Scientific Library , LAPACK Working Note 126, http://www.netlib.org/lapack/lawns/lawn126.ps , April 1997.
[4]
Michael Berry and Roger Hockney, Public International Benchmarks for Parallel Computers , PARKBENCH Committee Report 1, http://www.netlib.org/parkbench/ , February 1994.
[5]
Cray Research, CRAY T3E Fortran Optimization Guide , SG-2518 3.0, http://www.cray.com/products/software/publications/ , 1997.
[6]
Cray Research, Application Programmer's I/O Guide , 007-3695-005, http://www.cray.com/products/software/publications/ , 1997.
[7]
Jack J. Dongarra, Performance of Various Computers Using Standard Linear Equations Software , University of Tennessee, report number CS-89-85, http://www.netlib.org/benchmark/performance.ps , April 1999.
[8]
Jack J. Dongarra, Hans W. Meuer, and Erich Strohmaier, Top500 Supercomputer Sites , University of Tennessee, report number CS-99-425, http://www.netlib.org/benchmark/top500.html , June 1999.
[9]
Richard A. Gerber, I/O on the NERSC Cray T3E , National Energy Research Scientific Computing Center, http://hpcf.nersc.gov/training/tutorials/T3E/IO/ , July 1998.
[10]
John D. McCalpin, Sustainable Memory Bandwidth in Current High Performance Computers , http://www.cs.virginia.edu/stream/ , October 1995.
Anderson.Edward@epa.gov . Last Modified: November 23, 1999
Back to Top