ARSC T3D Users' Newsletter 102, August 30, 1996
f90 (It's not just the law...)
In case you hadn't heard...
- CF77 will not exist on the T3E.
- CF77 is being phased out on all CRI PVP systems. With the release of the programming environment 2.0, CF90 replaces the CF77 compiling system. (ARSC plans to upgrade to PE2.0 later this fall.)
Your Fortran 77 codes should compile under CF90, but you might want to make the switch earlier rather than later.
Numerical Recipes in Fortran 90 Available
[ Taken from a posting to comp.parallel ]
A message from the authors of Numerical Recipes:
Our new book, "Numerical Recipes in Fortran 90: The Art of Parallel
Scientific Computing" and our new "Numerical Recipes Code CDROM" are
both out and available now from Cambridge University Press. We've put
a lot of effort into these, and we hope you like them! Here are brief
descriptions:
"Numerical Recipes in Fortran 90: The Art of Parallel Scientific
Computing", Volume 2 of Fortran Numerical Recipes. This new volume,
intended for use with the existing book (now renamed Numerical Recipes
in Fortran 77), reworks all the Numerical Recipes routines to use
Fortran 90's concise parallel language constructions. Even on single
processor machines, you get the benefit of a slick, modern version of
Fortran, and new conciseness and clarity in the code. There are also
three new chapters on Fortran 90 language features and parallel
programming methods, and an introduction by Michael Metcalf. More
information on the book is available at:
http://nr.harvard.edu/nr/nrf90_blurb.html
Simple Vector Operations
[ One of our T3D users, Dr. Alan Wallcraft of Stennis Space Center, contributes this article. ]Current "workstation" Fortran compilers seem to do a relatively poor job with simple vector operations on REAL*4. This is a problem because Cray vector codes, Fortran 90 codes, and High Performance Fortran (or CM Fortran) codes contain many such operations. Using optimized BLAS can help, but some vendors don't optimize level-1 BLAS and Cray has minimal support for 32-bit BLAS on the T3D.
There is a need for a standard set of vector subroutines, like BLAS but not just for linear algebra, that could be optimized for each machine. To illustrate the problem, consider A = S and A = B where A and B are vectors and S a scalar. These operations are quite common, although they can often be avoided by code restructuring. I expected compilers to produce almost optimal code for such operations. However, using REAL*8 assignment instead of REAL*4 is 1.5 to 2 times faster on many machines and this can be achieved using "almost standard" f77. A = S is the simplest example. Here is the test program. The LOC function is non-standard, but almost always available, and may be either INTEGER*4 or INTEGER*8 depending on the machine. It is used to detect how A is aligned w.r.t. REAL*8 word boundaries.
PROGRAM WSETST
IMPLICIT NONE
C
INTEGER NP,NN
PARAMETER (NP=22, NN=2**NP)
C
INTEGER IP,L,N
REAL*4 A(NN+8)
REAL*8 SECOND
REAL*8 T0,T1,T2
C
REAL*4 ZERO4
PARAMETER (ZERO4=0.0)
C
C PROGRAM TIMING A(1:N) = 0.0, WITH A IN CACHE (IF IT FITS).
C
C R4WSET - SUBROUTINE USING REAL*4 ASSIGNMENT
C R4WSET8 - SUBROUTINE USING REAL*8 ASSIGNMENT
C
DO IP= 1,NP
N = 2**IP
C
CALL R4WSET(A,ZERO4,N+8)
C
T0 = SECOND()
C
DO L= 1,NN,N
CALL R4WSET(A(1),ZERO4,N)
CALL R4WSET(A(2),ZERO4,N)
CALL R4WSET(A(3),ZERO4,N)
CALL R4WSET(A(4),ZERO4,N)
CALL R4WSET(A(5),ZERO4,N)
CALL R4WSET(A(6),ZERO4,N)
CALL R4WSET(A(7),ZERO4,N)
CALL R4WSET(A(8),ZERO4,N)
ENDDO
C
T1 = SECOND()
C
DO L= 1,NN,N
CALL R4WSET8(A(1),ZERO4,N)
CALL R4WSET8(A(2),ZERO4,N)
CALL R4WSET8(A(3),ZERO4,N)
CALL R4WSET8(A(4),ZERO4,N)
CALL R4WSET8(A(5),ZERO4,N)
CALL R4WSET8(A(6),ZERO4,N)
CALL R4WSET8(A(7),ZERO4,N)
CALL R4WSET8(A(8),ZERO4,N)
ENDDO
C
T2 = SECOND()
WRITE(6,6000) N,NN*32.E-6/(T1-T0),
+ NN*32.E-6/(T2-T1),(T1-T0)/(T2-T1)
ENDDO
C
6000 FORMAT(2X,'N = ',I8,
+ 3X,'R4WSET,R4WSET8 =',F8.2,',',F8.2,' MB/s',
+ 3X,'SPEEDUP =',F5.2)
END
SUBROUTINE R4WSET(S,W,N)
IMPLICIT NONE
INTEGER N
REAL*4 S(N),W
C
C S = W.
C
INTEGER I
C
DO I= 1,N
S(I) = W
ENDDO
RETURN
END
SUBROUTINE R4WSET8(S,W,N)
IMPLICIT NONE
INTEGER N
REAL*4 S(N),W
C
C S = W.
C
C LOC IS MACHINE DEPENDENT, ASSUMED TO RETURN ADDRESS IN BYTES.
C
INTEGER*4 LOC,IS1,I8
* INTEGER*8 LOC,IS1,I8
PARAMETER (I8=8)
REAL*8 W8(1)
REAL*4 W4(2)
EQUIVALENCE (W8,W4)
C
W4(1) = W
W4(2) = W
IS1 = LOC(S(1))
IF (MOD(IS1,I8).EQ.0) THEN
CALL R8WSET(S(1),W8,N/2)
S(N) = W
ELSE
S(1) = W
CALL R8WSET(S(2),W8,(N-1)/2)
S(N) = W
ENDIF
RETURN
END
SUBROUTINE R8WSET(S,W,N)
IMPLICIT NONE
INTEGER N
REAL*8 S(N),W
C
C S = W.
C
INTEGER I
C
DO I= 1,N
S(I) = W
ENDDO
RETURN
END
REAL*8 FUNCTION SECOND()
IMPLICIT NONE
C
C EMULATION OF CDC'S SECOND TIMING ROUTINE.
C
*
* UNIX VERSION
*
REAL*4 TARRAY(2)
REAL*4 ETIME
SECOND = ETIME(TARRAY)
*
* T3D VERSION
*
* INTEGER IRTC
* SECOND = IRTC() * 6.6E-9
RETURN
END
On each machine, this is compiled using high optimization (including automatic loop unrolling, but excluding subroutine in-lining).
Cray T3D results:
N = 2 R4WSET,R4WSET8 = 16.96, 10.49 MB/s SPEEDUP = 0.62 N = 4 R4WSET,R4WSET8 = 32.93, 19.88 MB/s SPEEDUP = 0.60 N = 8 R4WSET,R4WSET8 = 61.86, 36.51 MB/s SPEEDUP = 0.59 N = 16 R4WSET,R4WSET8 = 110.06, 68.32 MB/s SPEEDUP = 0.62 N = 32 R4WSET,R4WSET8 = 174.05, 119.27 MB/s SPEEDUP = 0.69 N = 64 R4WSET,R4WSET8 = 231.35, 195.69 MB/s SPEEDUP = 0.85 N = 128 R4WSET,R4WSET8 = 277.62, 282.63 MB/s SPEEDUP = 1.02 N = 256 R4WSET,R4WSET8 = 306.82, 367.51 MB/s SPEEDUP = 1.20 N = 512 R4WSET,R4WSET8 = 324.51, 430.44 MB/s SPEEDUP = 1.33 N = 1024 R4WSET,R4WSET8 = 334.36, 470.84 MB/s SPEEDUP = 1.41 N = 2048 R4WSET,R4WSET8 = 339.38, 492.19 MB/s SPEEDUP = 1.45 N = 4096 R4WSET,R4WSET8 = 341.82, 505.49 MB/s SPEEDUP = 1.48 N = 8192 R4WSET,R4WSET8 = 343.28, 511.05 MB/s SPEEDUP = 1.49 N = 16384 R4WSET,R4WSET8 = 343.96, 514.65 MB/s SPEEDUP = 1.50 N = 32768 R4WSET,R4WSET8 = 344.16, 516.32 MB/s SPEEDUP = 1.50 N = 65536 R4WSET,R4WSET8 = 344.47, 516.78 MB/s SPEEDUP = 1.50 N = 131072 R4WSET,R4WSET8 = 344.56, 517.58 MB/s SPEEDUP = 1.50 N = 262144 R4WSET,R4WSET8 = 344.52, 517.85 MB/s SPEEDUP = 1.50 N = 524288 R4WSET,R4WSET8 = 344.78, 518.04 MB/s SPEEDUP = 1.50 N = 1048576 R4WSET,R4WSET8 = 344.65, 518.17 MB/s SPEEDUP = 1.50 N = 2097152 R4WSET,R4WSET8 = 344.74, 518.35 MB/s SPEEDUP = 1.50 N = 4194304 R4WSET,R4WSET8 = 344.86, 518.58 MB/s SPEEDUP = 1.50
SGI Power Challenge results:
N = 2 R4WSET,R4WSET8 = 22.16, 12.66 MB/s SPEEDUP = 0.57 N = 4 R4WSET,R4WSET8 = 54.40, 22.90 MB/s SPEEDUP = 0.42 N = 8 R4WSET,R4WSET8 = 99.73, 39.39 MB/s SPEEDUP = 0.40 N = 16 R4WSET,R4WSET8 = 140.79, 73.92 MB/s SPEEDUP = 0.53 N = 32 R4WSET,R4WSET8 = 227.93, 133.41 MB/s SPEEDUP = 0.59 N = 64 R4WSET,R4WSET8 = 330.14, 240.05 MB/s SPEEDUP = 0.73 N = 128 R4WSET,R4WSET8 = 425.43, 399.90 MB/s SPEEDUP = 0.94 N = 256 R4WSET,R4WSET8 = 497.27, 599.47 MB/s SPEEDUP = 1.21 N = 512 R4WSET,R4WSET8 = 543.04, 798.79 MB/s SPEEDUP = 1.47 N = 1024 R4WSET,R4WSET8 = 569.36, 957.71 MB/s SPEEDUP = 1.68 N = 2048 R4WSET,R4WSET8 = 583.59, 1063.42 MB/s SPEEDUP = 1.82 N = 4096 R4WSET,R4WSET8 = 590.88, 1126.19 MB/s SPEEDUP = 1.91 N = 8192 R4WSET,R4WSET8 = 594.41, 1160.20 MB/s SPEEDUP = 1.95 N = 16384 R4WSET,R4WSET8 = 596.47, 1177.80 MB/s SPEEDUP = 1.97 N = 32768 R4WSET,R4WSET8 = 597.23, 1187.54 MB/s SPEEDUP = 1.99 N = 65536 R4WSET,R4WSET8 = 597.57, 1191.56 MB/s SPEEDUP = 1.99 N = 131072 R4WSET,R4WSET8 = 597.94, 1193.86 MB/s SPEEDUP = 2.00 N = 262144 R4WSET,R4WSET8 = 597.76, 1194.96 MB/s SPEEDUP = 2.00 N = 524288 R4WSET,R4WSET8 = 590.84, 1196.06 MB/s SPEEDUP = 2.02 N = 1048576 R4WSET,R4WSET8 = 492.22, 1063.97 MB/s SPEEDUP = 2.16 N = 2097152 R4WSET,R4WSET8 = 132.57, 147.14 MB/s SPEEDUP = 1.11 N = 4194304 R4WSET,R4WSET8 = 115.25, 126.86 MB/s SPEEDUP = 1.10
DEC alpha (DEC2100_A500) results:
N = 2 R4WSET,R4WSET8 = 59.82, 42.88 MB/s SPEEDUP = 0.72 N = 4 R4WSET,R4WSET8 = 154.17, 76.87 MB/s SPEEDUP = 0.50 N = 8 R4WSET,R4WSET8 = 202.23, 129.86 MB/s SPEEDUP = 0.64 N = 16 R4WSET,R4WSET8 = 309.73, 218.63 MB/s SPEEDUP = 0.71 N = 32 R4WSET,R4WSET8 = 392.91, 308.34 MB/s SPEEDUP = 0.78 N = 64 R4WSET,R4WSET8 = 399.76, 367.70 MB/s SPEEDUP = 0.92 N = 128 R4WSET,R4WSET8 = 404.46, 408.07 MB/s SPEEDUP = 1.01 N = 256 R4WSET,R4WSET8 = 408.07, 431.09 MB/s SPEEDUP = 1.06 N = 512 R4WSET,R4WSET8 = 334.59, 445.04 MB/s SPEEDUP = 1.33 N = 1024 R4WSET,R4WSET8 = 316.13, 439.35 MB/s SPEEDUP = 1.39 N = 2048 R4WSET,R4WSET8 = 265.48, 369.67 MB/s SPEEDUP = 1.39 N = 4096 R4WSET,R4WSET8 = 214.54, 258.49 MB/s SPEEDUP = 1.20 N = 8192 R4WSET,R4WSET8 = 162.94, 195.06 MB/s SPEEDUP = 1.20 N = 16384 R4WSET,R4WSET8 = 137.66, 159.90 MB/s SPEEDUP = 1.16 N = 32768 R4WSET,R4WSET8 = 127.21, 140.47 MB/s SPEEDUP = 1.10 N = 65536 R4WSET,R4WSET8 = 118.96, 128.16 MB/s SPEEDUP = 1.08 N = 131072 R4WSET,R4WSET8 = 112.17, 115.56 MB/s SPEEDUP = 1.03 N = 262144 R4WSET,R4WSET8 = 104.98, 107.69 MB/s SPEEDUP = 1.03 N = 524288 R4WSET,R4WSET8 = 101.64, 101.27 MB/s SPEEDUP = 1.00 N = 1048576 R4WSET,R4WSET8 = 98.16, 98.79 MB/s SPEEDUP = 1.01 N = 2097152 R4WSET,R4WSET8 = 96.57, 96.98 MB/s SPEEDUP = 1.00 N = 4194304 R4WSET,R4WSET8 = 95.43, 95.70 MB/s SPEEDUP = 1.00
Sun SPARC 20/61 results:
N = 2 R4WSET,R4WSET8 = 25.78, 9.07 MB/s SPEEDUP = 0.35 N = 4 R4WSET,R4WSET8 = 39.02, 12.38 MB/s SPEEDUP = 0.32 N = 8 R4WSET,R4WSET8 = 74.87, 17.37 MB/s SPEEDUP = 0.23 N = 16 R4WSET,R4WSET8 = 104.07, 23.05 MB/s SPEEDUP = 0.22 N = 32 R4WSET,R4WSET8 = 130.78, 30.57 MB/s SPEEDUP = 0.23 N = 64 R4WSET,R4WSET8 = 149.31, 53.29 MB/s SPEEDUP = 0.36 N = 128 R4WSET,R4WSET8 = 160.56, 89.04 MB/s SPEEDUP = 0.55 N = 256 R4WSET,R4WSET8 = 166.90, 134.45 MB/s SPEEDUP = 0.81 N = 512 R4WSET,R4WSET8 = 170.30, 179.61 MB/s SPEEDUP = 1.05 N = 1024 R4WSET,R4WSET8 = 172.14, 217.24 MB/s SPEEDUP = 1.26 N = 2048 R4WSET,R4WSET8 = 172.94, 242.02 MB/s SPEEDUP = 1.40 N = 4096 R4WSET,R4WSET8 = 173.50, 255.16 MB/s SPEEDUP = 1.47 N = 8192 R4WSET,R4WSET8 = 173.52, 264.75 MB/s SPEEDUP = 1.53 N = 16384 R4WSET,R4WSET8 = 173.47, 268.55 MB/s SPEEDUP = 1.55 N = 32768 R4WSET,R4WSET8 = 171.82, 267.37 MB/s SPEEDUP = 1.56 N = 65536 R4WSET,R4WSET8 = 168.56, 264.91 MB/s SPEEDUP = 1.57 N = 131072 R4WSET,R4WSET8 = 162.87, 256.73 MB/s SPEEDUP = 1.58 N = 262144 R4WSET,R4WSET8 = 161.47, 247.10 MB/s SPEEDUP = 1.53 N = 524288 R4WSET,R4WSET8 = 41.48, 44.59 MB/s SPEEDUP = 1.07 N = 1048576 R4WSET,R4WSET8 = 41.47, 44.76 MB/s SPEEDUP = 1.08 N = 2097152 R4WSET,R4WSET8 = 41.60, 44.61 MB/s SPEEDUP = 1.07 N = 4194304 R4WSET,R4WSET8 = 41.49, 44.62 MB/s SPEEDUP = 1.08
Sun UltraSPARC 1/140 results:
N = 2 R4WSET,R4WSET8 = 49.78, 25.36 MB/s SPEEDUP = 0.51 N = 4 R4WSET,R4WSET8 = 129.54, 46.50 MB/s SPEEDUP = 0.36 N = 8 R4WSET,R4WSET8 = 117.02, 89.79 MB/s SPEEDUP = 0.77 N = 16 R4WSET,R4WSET8 = 197.45, 130.32 MB/s SPEEDUP = 0.66 N = 32 R4WSET,R4WSET8 = 269.68, 234.06 MB/s SPEEDUP = 0.87 N = 64 R4WSET,R4WSET8 = 362.30, 350.17 MB/s SPEEDUP = 0.97 N = 128 R4WSET,R4WSET8 = 424.71, 472.96 MB/s SPEEDUP = 1.11 N = 256 R4WSET,R4WSET8 = 460.14, 589.28 MB/s SPEEDUP = 1.28 N = 512 R4WSET,R4WSET8 = 482.29, 656.96 MB/s SPEEDUP = 1.36 N = 1024 R4WSET,R4WSET8 = 495.44, 709.14 MB/s SPEEDUP = 1.43 N = 2048 R4WSET,R4WSET8 = 501.21, 735.97 MB/s SPEEDUP = 1.47 N = 4096 R4WSET,R4WSET8 = 504.29, 747.54 MB/s SPEEDUP = 1.48 N = 8192 R4WSET,R4WSET8 = 505.50, 753.96 MB/s SPEEDUP = 1.49 N = 16384 R4WSET,R4WSET8 = 506.07, 757.10 MB/s SPEEDUP = 1.50 N = 32768 R4WSET,R4WSET8 = 505.23, 754.07 MB/s SPEEDUP = 1.49 N = 65536 R4WSET,R4WSET8 = 499.08, 739.38 MB/s SPEEDUP = 1.48 N = 131072 R4WSET,R4WSET8 = 381.13, 506.37 MB/s SPEEDUP = 1.33 N = 262144 R4WSET,R4WSET8 = 153.88, 171.16 MB/s SPEEDUP = 1.11 N = 524288 R4WSET,R4WSET8 = 149.14, 165.40 MB/s SPEEDUP = 1.11 N = 1048576 R4WSET,R4WSET8 = 149.13, 165.53 MB/s SPEEDUP = 1.11 N = 2097152 R4WSET,R4WSET8 = 149.15, 165.47 MB/s SPEEDUP = 1.11 N = 4194304 R4WSET,R4WSET8 = 149.31, 165.78 MB/s SPEEDUP = 1.11
In all cases, the REAL*8 version is faster for O(1000) vector lengths but not necessarily faster once the secondary cache size is exceeded. Presumably, hand coded assembly language could do even better.
The A = B case is similar, but only if A and B are appropriately aligned with each other. The BLAS routine SCOPY (HCOPY on T3D) should be the fastest way to do A = B, if it has been optimized for a given machine.
Quick-Tip Q & A
Q: What's a handy way to "vi" every file which contains a given
string, in the current working directory, (E.g., You want to
read every T3D Newsletter which mentions "CRAFT".)
A: {{ How can you delete a file named "-i" ??? }}
rm ./-i # Succinct! Sent in by a reader.
rm -- -i # Also sent in. The flag, "--" is common to many
# UNICOS commands (e.g., "f90"), and says; "I am
# the last flag."
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
