ARSC T3D Users' Newsletter 102, August 30, 1996

f90 (It's not just the law...)

In case you hadn't heard...

  • CF77 will not exist on the T3E.
  • CF77 is being phased out on all CRI PVP systems. With the release of the programming environment 2.0, CF90 replaces the CF77 compiling system. (ARSC plans to upgrade to PE2.0 later this fall.)

Your Fortran 77 codes should compile under CF90, but you might want to make the switch earlier rather than later.

Numerical Recipes in Fortran 90 Available

[ Taken from a posting to comp.parallel ]

 
 
 
 A message from the authors of Numerical Recipes:
 
 
 
 Our new book, "Numerical Recipes in Fortran 90: The Art of Parallel
 
 Scientific Computing" and our new "Numerical Recipes Code CDROM" are
 
 both out and available now from Cambridge University Press.  We've put
 
 a lot of effort into these, and we hope you like them!  Here are brief
 
 descriptions:
 
 
 
 "Numerical Recipes in Fortran 90: The Art of Parallel Scientific
 
 Computing", Volume 2 of Fortran Numerical Recipes.  This new volume,
 
 intended for use with the existing book (now renamed Numerical Recipes
 
 in Fortran 77), reworks all the Numerical Recipes routines to use
 
 Fortran 90's concise parallel language constructions.  Even on single
 
 processor machines, you get the benefit of a slick, modern version of
 
 Fortran, and new conciseness and clarity in the code.  There are also
 
 three new chapters on Fortran 90 language features and parallel
 
 programming methods, and an introduction by Michael Metcalf.  More
 
 information on the book is available at:
 

 
     
http://nr.harvard.edu/nr/nrf90_blurb.html

Simple Vector Operations

[ One of our T3D users, Dr. Alan Wallcraft of Stennis Space Center, contributes this article. ]

Current "workstation" Fortran compilers seem to do a relatively poor job with simple vector operations on REAL*4. This is a problem because Cray vector codes, Fortran 90 codes, and High Performance Fortran (or CM Fortran) codes contain many such operations. Using optimized BLAS can help, but some vendors don't optimize level-1 BLAS and Cray has minimal support for 32-bit BLAS on the T3D.

There is a need for a standard set of vector subroutines, like BLAS but not just for linear algebra, that could be optimized for each machine. To illustrate the problem, consider A = S and A = B where A and B are vectors and S a scalar. These operations are quite common, although they can often be avoided by code restructuring. I expected compilers to produce almost optimal code for such operations. However, using REAL*8 assignment instead of REAL*4 is 1.5 to 2 times faster on many machines and this can be achieved using "almost standard" f77. A = S is the simplest example. Here is the test program. The LOC function is non-standard, but almost always available, and may be either INTEGER*4 or INTEGER*8 depending on the machine. It is used to detect how A is aligned w.r.t. REAL*8 word boundaries.



      PROGRAM WSETST
      IMPLICIT NONE
C
      INTEGER    NP,NN
      PARAMETER (NP=22, NN=2**NP)
C
      INTEGER IP,L,N
      REAL*4  A(NN+8)
      REAL*8  SECOND
      REAL*8  T0,T1,T2
C
      REAL*4     ZERO4
      PARAMETER (ZERO4=0.0)
C
C     PROGRAM TIMING A(1:N) = 0.0, WITH A IN CACHE (IF IT FITS).
C
C     R4WSET  - SUBROUTINE USING REAL*4 ASSIGNMENT
C     R4WSET8 - SUBROUTINE USING REAL*8 ASSIGNMENT 
C
      DO IP= 1,NP
        N = 2**IP
C
        CALL R4WSET(A,ZERO4,N+8)
C
        T0 = SECOND()
C
        DO L= 1,NN,N
          CALL R4WSET(A(1),ZERO4,N)
          CALL R4WSET(A(2),ZERO4,N)
          CALL R4WSET(A(3),ZERO4,N)
          CALL R4WSET(A(4),ZERO4,N)
          CALL R4WSET(A(5),ZERO4,N)
          CALL R4WSET(A(6),ZERO4,N)
          CALL R4WSET(A(7),ZERO4,N)
          CALL R4WSET(A(8),ZERO4,N)
        ENDDO
C
        T1 = SECOND()
C
        DO L= 1,NN,N
          CALL R4WSET8(A(1),ZERO4,N)
          CALL R4WSET8(A(2),ZERO4,N)
          CALL R4WSET8(A(3),ZERO4,N)
          CALL R4WSET8(A(4),ZERO4,N)
          CALL R4WSET8(A(5),ZERO4,N)
          CALL R4WSET8(A(6),ZERO4,N)
          CALL R4WSET8(A(7),ZERO4,N)
          CALL R4WSET8(A(8),ZERO4,N)
        ENDDO
C
        T2 = SECOND()
        WRITE(6,6000) N,NN*32.E-6/(T1-T0),
     +                  NN*32.E-6/(T2-T1),(T1-T0)/(T2-T1)
      ENDDO
C
 6000 FORMAT(2X,'N = ',I8,
     +   3X,'R4WSET,R4WSET8 =',F8.2,',',F8.2,' MB/s',
     +   3X,'SPEEDUP =',F5.2)
      END
      SUBROUTINE R4WSET(S,W,N)
      IMPLICIT NONE
      INTEGER N
      REAL*4  S(N),W
C
C     S = W.
C
      INTEGER I
C
      DO I= 1,N
        S(I) = W
      ENDDO
      RETURN
      END
      SUBROUTINE R4WSET8(S,W,N)
      IMPLICIT NONE
      INTEGER N
      REAL*4  S(N),W
C
C     S = W.
C
C     LOC IS MACHINE DEPENDENT, ASSUMED TO RETURN ADDRESS IN BYTES.
C
      INTEGER*4 LOC,IS1,I8
*     INTEGER*8 LOC,IS1,I8
      PARAMETER (I8=8)
      REAL*8    W8(1)
      REAL*4    W4(2)
      EQUIVALENCE (W8,W4)
C
      W4(1) = W
      W4(2) = W
      IS1   = LOC(S(1))
      IF     (MOD(IS1,I8).EQ.0) THEN
        CALL R8WSET(S(1),W8,N/2)
        S(N) = W
      ELSE
        S(1) = W
        CALL R8WSET(S(2),W8,(N-1)/2)
        S(N) = W
      ENDIF
      RETURN
      END
      SUBROUTINE R8WSET(S,W,N)
      IMPLICIT NONE
      INTEGER N
      REAL*8  S(N),W
C
C     S = W.
C
      INTEGER I
C
      DO I= 1,N
        S(I) = W
      ENDDO
      RETURN
      END
      REAL*8 FUNCTION SECOND()
      IMPLICIT NONE
C
C     EMULATION OF CDC'S SECOND TIMING ROUTINE.
C
*
*     UNIX VERSION
*
      REAL*4  TARRAY(2)
      REAL*4  ETIME
      SECOND = ETIME(TARRAY)
*
*     T3D VERSION
*
*     INTEGER IRTC
*     SECOND = IRTC() * 6.6E-9
      RETURN
      END

On each machine, this is compiled using high optimization (including automatic loop unrolling, but excluding subroutine in-lining).

Cray T3D results:


  N =        2   R4WSET,R4WSET8 =   16.96,   10.49 MB/s   SPEEDUP = 0.62
  N =        4   R4WSET,R4WSET8 =   32.93,   19.88 MB/s   SPEEDUP = 0.60
  N =        8   R4WSET,R4WSET8 =   61.86,   36.51 MB/s   SPEEDUP = 0.59
  N =       16   R4WSET,R4WSET8 =  110.06,   68.32 MB/s   SPEEDUP = 0.62
  N =       32   R4WSET,R4WSET8 =  174.05,  119.27 MB/s   SPEEDUP = 0.69
  N =       64   R4WSET,R4WSET8 =  231.35,  195.69 MB/s   SPEEDUP = 0.85
  N =      128   R4WSET,R4WSET8 =  277.62,  282.63 MB/s   SPEEDUP = 1.02
  N =      256   R4WSET,R4WSET8 =  306.82,  367.51 MB/s   SPEEDUP = 1.20
  N =      512   R4WSET,R4WSET8 =  324.51,  430.44 MB/s   SPEEDUP = 1.33
  N =     1024   R4WSET,R4WSET8 =  334.36,  470.84 MB/s   SPEEDUP = 1.41
  N =     2048   R4WSET,R4WSET8 =  339.38,  492.19 MB/s   SPEEDUP = 1.45
  N =     4096   R4WSET,R4WSET8 =  341.82,  505.49 MB/s   SPEEDUP = 1.48
  N =     8192   R4WSET,R4WSET8 =  343.28,  511.05 MB/s   SPEEDUP = 1.49
  N =    16384   R4WSET,R4WSET8 =  343.96,  514.65 MB/s   SPEEDUP = 1.50
  N =    32768   R4WSET,R4WSET8 =  344.16,  516.32 MB/s   SPEEDUP = 1.50
  N =    65536   R4WSET,R4WSET8 =  344.47,  516.78 MB/s   SPEEDUP = 1.50
  N =   131072   R4WSET,R4WSET8 =  344.56,  517.58 MB/s   SPEEDUP = 1.50
  N =   262144   R4WSET,R4WSET8 =  344.52,  517.85 MB/s   SPEEDUP = 1.50
  N =   524288   R4WSET,R4WSET8 =  344.78,  518.04 MB/s   SPEEDUP = 1.50
  N =  1048576   R4WSET,R4WSET8 =  344.65,  518.17 MB/s   SPEEDUP = 1.50
  N =  2097152   R4WSET,R4WSET8 =  344.74,  518.35 MB/s   SPEEDUP = 1.50
  N =  4194304   R4WSET,R4WSET8 =  344.86,  518.58 MB/s   SPEEDUP = 1.50

SGI Power Challenge results:


  N =        2   R4WSET,R4WSET8 =   22.16,   12.66 MB/s   SPEEDUP = 0.57
  N =        4   R4WSET,R4WSET8 =   54.40,   22.90 MB/s   SPEEDUP = 0.42
  N =        8   R4WSET,R4WSET8 =   99.73,   39.39 MB/s   SPEEDUP = 0.40
  N =       16   R4WSET,R4WSET8 =  140.79,   73.92 MB/s   SPEEDUP = 0.53
  N =       32   R4WSET,R4WSET8 =  227.93,  133.41 MB/s   SPEEDUP = 0.59
  N =       64   R4WSET,R4WSET8 =  330.14,  240.05 MB/s   SPEEDUP = 0.73
  N =      128   R4WSET,R4WSET8 =  425.43,  399.90 MB/s   SPEEDUP = 0.94
  N =      256   R4WSET,R4WSET8 =  497.27,  599.47 MB/s   SPEEDUP = 1.21
  N =      512   R4WSET,R4WSET8 =  543.04,  798.79 MB/s   SPEEDUP = 1.47
  N =     1024   R4WSET,R4WSET8 =  569.36,  957.71 MB/s   SPEEDUP = 1.68
  N =     2048   R4WSET,R4WSET8 =  583.59, 1063.42 MB/s   SPEEDUP = 1.82
  N =     4096   R4WSET,R4WSET8 =  590.88, 1126.19 MB/s   SPEEDUP = 1.91
  N =     8192   R4WSET,R4WSET8 =  594.41, 1160.20 MB/s   SPEEDUP = 1.95
  N =    16384   R4WSET,R4WSET8 =  596.47, 1177.80 MB/s   SPEEDUP = 1.97
  N =    32768   R4WSET,R4WSET8 =  597.23, 1187.54 MB/s   SPEEDUP = 1.99
  N =    65536   R4WSET,R4WSET8 =  597.57, 1191.56 MB/s   SPEEDUP = 1.99
  N =   131072   R4WSET,R4WSET8 =  597.94, 1193.86 MB/s   SPEEDUP = 2.00
  N =   262144   R4WSET,R4WSET8 =  597.76, 1194.96 MB/s   SPEEDUP = 2.00
  N =   524288   R4WSET,R4WSET8 =  590.84, 1196.06 MB/s   SPEEDUP = 2.02
  N =  1048576   R4WSET,R4WSET8 =  492.22, 1063.97 MB/s   SPEEDUP = 2.16
  N =  2097152   R4WSET,R4WSET8 =  132.57,  147.14 MB/s   SPEEDUP = 1.11
  N =  4194304   R4WSET,R4WSET8 =  115.25,  126.86 MB/s   SPEEDUP = 1.10

DEC alpha (DEC2100_A500) results:


  N =        2   R4WSET,R4WSET8 =   59.82,   42.88 MB/s   SPEEDUP = 0.72
  N =        4   R4WSET,R4WSET8 =  154.17,   76.87 MB/s   SPEEDUP = 0.50
  N =        8   R4WSET,R4WSET8 =  202.23,  129.86 MB/s   SPEEDUP = 0.64
  N =       16   R4WSET,R4WSET8 =  309.73,  218.63 MB/s   SPEEDUP = 0.71
  N =       32   R4WSET,R4WSET8 =  392.91,  308.34 MB/s   SPEEDUP = 0.78
  N =       64   R4WSET,R4WSET8 =  399.76,  367.70 MB/s   SPEEDUP = 0.92
  N =      128   R4WSET,R4WSET8 =  404.46,  408.07 MB/s   SPEEDUP = 1.01
  N =      256   R4WSET,R4WSET8 =  408.07,  431.09 MB/s   SPEEDUP = 1.06
  N =      512   R4WSET,R4WSET8 =  334.59,  445.04 MB/s   SPEEDUP = 1.33
  N =     1024   R4WSET,R4WSET8 =  316.13,  439.35 MB/s   SPEEDUP = 1.39
  N =     2048   R4WSET,R4WSET8 =  265.48,  369.67 MB/s   SPEEDUP = 1.39
  N =     4096   R4WSET,R4WSET8 =  214.54,  258.49 MB/s   SPEEDUP = 1.20
  N =     8192   R4WSET,R4WSET8 =  162.94,  195.06 MB/s   SPEEDUP = 1.20
  N =    16384   R4WSET,R4WSET8 =  137.66,  159.90 MB/s   SPEEDUP = 1.16
  N =    32768   R4WSET,R4WSET8 =  127.21,  140.47 MB/s   SPEEDUP = 1.10
  N =    65536   R4WSET,R4WSET8 =  118.96,  128.16 MB/s   SPEEDUP = 1.08
  N =   131072   R4WSET,R4WSET8 =  112.17,  115.56 MB/s   SPEEDUP = 1.03
  N =   262144   R4WSET,R4WSET8 =  104.98,  107.69 MB/s   SPEEDUP = 1.03
  N =   524288   R4WSET,R4WSET8 =  101.64,  101.27 MB/s   SPEEDUP = 1.00
  N =  1048576   R4WSET,R4WSET8 =   98.16,   98.79 MB/s   SPEEDUP = 1.01
  N =  2097152   R4WSET,R4WSET8 =   96.57,   96.98 MB/s   SPEEDUP = 1.00
  N =  4194304   R4WSET,R4WSET8 =   95.43,   95.70 MB/s   SPEEDUP = 1.00

Sun SPARC 20/61 results:


  N =        2   R4WSET,R4WSET8 =   25.78,    9.07 MB/s   SPEEDUP = 0.35
  N =        4   R4WSET,R4WSET8 =   39.02,   12.38 MB/s   SPEEDUP = 0.32
  N =        8   R4WSET,R4WSET8 =   74.87,   17.37 MB/s   SPEEDUP = 0.23
  N =       16   R4WSET,R4WSET8 =  104.07,   23.05 MB/s   SPEEDUP = 0.22
  N =       32   R4WSET,R4WSET8 =  130.78,   30.57 MB/s   SPEEDUP = 0.23
  N =       64   R4WSET,R4WSET8 =  149.31,   53.29 MB/s   SPEEDUP = 0.36
  N =      128   R4WSET,R4WSET8 =  160.56,   89.04 MB/s   SPEEDUP = 0.55
  N =      256   R4WSET,R4WSET8 =  166.90,  134.45 MB/s   SPEEDUP = 0.81
  N =      512   R4WSET,R4WSET8 =  170.30,  179.61 MB/s   SPEEDUP = 1.05
  N =     1024   R4WSET,R4WSET8 =  172.14,  217.24 MB/s   SPEEDUP = 1.26
  N =     2048   R4WSET,R4WSET8 =  172.94,  242.02 MB/s   SPEEDUP = 1.40
  N =     4096   R4WSET,R4WSET8 =  173.50,  255.16 MB/s   SPEEDUP = 1.47
  N =     8192   R4WSET,R4WSET8 =  173.52,  264.75 MB/s   SPEEDUP = 1.53
  N =    16384   R4WSET,R4WSET8 =  173.47,  268.55 MB/s   SPEEDUP = 1.55
  N =    32768   R4WSET,R4WSET8 =  171.82,  267.37 MB/s   SPEEDUP = 1.56
  N =    65536   R4WSET,R4WSET8 =  168.56,  264.91 MB/s   SPEEDUP = 1.57
  N =   131072   R4WSET,R4WSET8 =  162.87,  256.73 MB/s   SPEEDUP = 1.58
  N =   262144   R4WSET,R4WSET8 =  161.47,  247.10 MB/s   SPEEDUP = 1.53
  N =   524288   R4WSET,R4WSET8 =   41.48,   44.59 MB/s   SPEEDUP = 1.07
  N =  1048576   R4WSET,R4WSET8 =   41.47,   44.76 MB/s   SPEEDUP = 1.08
  N =  2097152   R4WSET,R4WSET8 =   41.60,   44.61 MB/s   SPEEDUP = 1.07
  N =  4194304   R4WSET,R4WSET8 =   41.49,   44.62 MB/s   SPEEDUP = 1.08

Sun UltraSPARC 1/140 results:


  N =        2   R4WSET,R4WSET8 =   49.78,   25.36 MB/s   SPEEDUP = 0.51
  N =        4   R4WSET,R4WSET8 =  129.54,   46.50 MB/s   SPEEDUP = 0.36
  N =        8   R4WSET,R4WSET8 =  117.02,   89.79 MB/s   SPEEDUP = 0.77
  N =       16   R4WSET,R4WSET8 =  197.45,  130.32 MB/s   SPEEDUP = 0.66
  N =       32   R4WSET,R4WSET8 =  269.68,  234.06 MB/s   SPEEDUP = 0.87
  N =       64   R4WSET,R4WSET8 =  362.30,  350.17 MB/s   SPEEDUP = 0.97
  N =      128   R4WSET,R4WSET8 =  424.71,  472.96 MB/s   SPEEDUP = 1.11
  N =      256   R4WSET,R4WSET8 =  460.14,  589.28 MB/s   SPEEDUP = 1.28
  N =      512   R4WSET,R4WSET8 =  482.29,  656.96 MB/s   SPEEDUP = 1.36
  N =     1024   R4WSET,R4WSET8 =  495.44,  709.14 MB/s   SPEEDUP = 1.43
  N =     2048   R4WSET,R4WSET8 =  501.21,  735.97 MB/s   SPEEDUP = 1.47
  N =     4096   R4WSET,R4WSET8 =  504.29,  747.54 MB/s   SPEEDUP = 1.48
  N =     8192   R4WSET,R4WSET8 =  505.50,  753.96 MB/s   SPEEDUP = 1.49
  N =    16384   R4WSET,R4WSET8 =  506.07,  757.10 MB/s   SPEEDUP = 1.50
  N =    32768   R4WSET,R4WSET8 =  505.23,  754.07 MB/s   SPEEDUP = 1.49
  N =    65536   R4WSET,R4WSET8 =  499.08,  739.38 MB/s   SPEEDUP = 1.48
  N =   131072   R4WSET,R4WSET8 =  381.13,  506.37 MB/s   SPEEDUP = 1.33
  N =   262144   R4WSET,R4WSET8 =  153.88,  171.16 MB/s   SPEEDUP = 1.11
  N =   524288   R4WSET,R4WSET8 =  149.14,  165.40 MB/s   SPEEDUP = 1.11
  N =  1048576   R4WSET,R4WSET8 =  149.13,  165.53 MB/s   SPEEDUP = 1.11
  N =  2097152   R4WSET,R4WSET8 =  149.15,  165.47 MB/s   SPEEDUP = 1.11
  N =  4194304   R4WSET,R4WSET8 =  149.31,  165.78 MB/s   SPEEDUP = 1.11

In all cases, the REAL*8 version is faster for O(1000) vector lengths but not necessarily faster once the secondary cache size is exceeded. Presumably, hand coded assembly language could do even better.

The A = B case is similar, but only if A and B are appropriately aligned with each other. The BLAS routine SCOPY (HCOPY on T3D) should be the fastest way to do A = B, if it has been optimized for a given machine.

Quick-Tip Q & A


Q: What's a handy way to "vi" every file which contains a given 
   string, in the current working directory,   (E.g., You want to 
   read every T3D Newsletter which mentions "CRAFT".)


A: {{ How can you delete a file named "-i" ??? }}

    rm ./-i        # Succinct! Sent in by a reader. 

    rm -- -i       # Also sent in. The flag, "--" is common to many 
                   #   UNICOS commands (e.g., "f90"), and says; "I am
                   #   the last flag." 

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top