ARSC T3D Users' Newsletter 59, November 3, 1995

Looking at Generated Code on the T3D

Both the PVP and MPP compilers from CRI allow the user to look at the generated assembly language generated by the compiler. In particular, the command:


  cf77 -S main.f
will produce a file main.s that can be assembled by CAM (Cray Assembler for MPP) and produces the same main.o file as:

  cf77 -c main.f
Nobody likes looking at assembly language, but sometimes it can answer questions that timing programs can't or the documentation doesn't.

Example 1

to divide or not to divide

From the beginning, I've been curious whether there would be a difference between the two timing subroutines below:


  real function second( )
  second = irtc( ) / 150000000.0
  end
and

  real function second2( )
  second2 = irtc( ) * .0000000066667 
  end
where irtc() gets the number of clock ticks since the beginning of program execution. The first function divides by the 150 MHz speed of the T3D processor and the second one has done the division before compilation to trade a division for a multiplication. The motivation is that because the divide takes longer than the multiply, the second timer may be more precise. (This hardly matters on the T3D Alpha processor where the difference in time between a multiply and divide is only a few clocks.)

Compiling each routine with the -S flag and doing the diff between second.s and second2.s gives:


  1c1
  <         .ident    SECOND
  ---
  >         .ident    SECOND2
  54c55
  <         .quad                 ^X41a1e1a300000000
  ---
  >         .quad                 ^X3e3ca21d3a2c3310
  97,99c98,100
  <         divt/d    f0,  f1,        f0          ; Ln 2 0x00000080
  <         stt       f0,  -160(r15)              ; Ln 2 0x00000084 SECOND
  <         ldt       f0,  -160(r15)              ; Ln 3 0x00000088 SECOND
  ---
  >         mult/d    f1,  f0,        f0          ; Ln 2 0x00000080
  >         stt       f0,  -160(r15)              ; Ln 2 0x00000084 SECOND2
  >         ldt       f0,  -160(r15)              ; Ln 3 0x00000088 SECOND2
  154c155
So each has set up a different internal constant and the first function does the divide and the second one a multiply. It should not be surprising that the compiler didn't replace the divide with a multiply because the IEEE standard does not allow this "optimization". But the CRI MPP compiler has a flag to relax the IEEE restriction. From the cft77 MPP manpage:

  > ieeedivide, noieeedivide
  >   Specifying noieeedivide allows the compiler to
  >   decompose divides into multiply-by-reciprocal
  >   for situations in which a performance gain is
  >   anticipated.  Default: ieeedivide.
Using this "noieeedivide" can have a speedup on some programs but not on my small timing routine. The last experiment I tried had the following version of the timer:

  real function second3( )
  rjunk = irtc( )
  second3 = rjunk / 150000000.0
  end
and the following compile command:

  cf77 -S -Wf"-o noieeedivide,aggress" second3.f
But still the relevant lines of the diff of second.s and second3.s show that the divide has not been replaced with a multiply:

  1c1
  <         .ident    SECOND
  ---
  >         .ident    SECOND3
  4,5c4,5
  <         ; Using command line options  -dB -ssecond.s
  <         ; Created from source file second.f
  ---
  >         ; Using command line options  -dB -ssecond3.s -oaggress,noieeedivide
  >         ; Created from source file second3.f
  7c7
  <         ; On 10/30/95 at 09:38:30
  ---
  >         ; On 10/30/95 at 09:38:33
  97,99c98,100
  <         divt/d    f0,  f1,        f0          ; Ln 2 0x00000080
  <         stt       f0,  -160(r15)              ; Ln 2 0x00000084 SECOND
  <         ldt       f0,  -160(r15)              ; Ln 3 0x00000088 SECOND
  ---
  >         divt/d    f0,  f1,        f0          ; Ln 3 0x00000080
  >         stt       f0,  -160(r15)              ; Ln 3 0x00000084 SECOND3
  >         ldt       f0,  -160(r15)              ; Ln 4 0x00000088 SECOND3
In a future newsletter, I show a case where this "noieeedivide" has a very positive effect.

Example 2

not all twos are equal

One of ARSC's users tracked down a slow subroutine with apprentice and the slow down was due to the difference in evaluating an expression in a DO loop like:


  expression #1    t1( i ) = x( i ) ** 2.
  expression #2    t2( i ) = x( i ) ** 2
  expression #3    t3( i ) = x( i ) * x( i )
of course all of them compute the same result, but they don't all take the same amount of time on the T3D. Below is a table of timings using the irtc() clock for each of the three expressions. The timings were done 50 times and the average and minimum of those 50 trials are shown below for various compiler commands.

Table 1

Timings (in clock ticks) for three expressions and various compiler commands


  compiler command            expr. 1    expr. 2    expr. 3
                             avg. min.  avg. min.  avg. min.

  cf77 -g main.f            1506. 1491  361.  350  146.  140
  cf77 -O0 main.f           1508. 1492  447.  432  143.  140
  cf77 -O1 main.f             16.   16   39.   36   24.   24
  cf77 -Wf" -Ta" main.f       10.   10    9.    9   10.   10
  cf77 -Wf" -oaggress" main.f 18.   18   33.   30   25.   25
  cf77  -Oscalar1 main.f      16.   16   36.   33   25.   17
  cf77  -Oscalar2 main.f      16.   16   37.   33   24.   17
  cf77  -Oscalar3 main.f      18.   18   33.   30   24.   18
Although there seems to be a lot of variability in the times, by looking at the generated code there are only three ways in which the expressions are evaluated:

  evaluation #1 call to the library function RTOR (as in expression #1 and -g)
  evaluation #2 call to the library function RTOI (as in expression #2 and -O0)
  evaluation #3 a multiplication (as in all expressions and -O1 or better)
RTOR is the "Real to Real" version of the power operation and RTOI is the "Real to Integer" version of the power operations, both routines are loaded by default from the library libm.a.

Code for Totalview must be compiled with the -g switch and from the sample timings above, we can see what a big impact such a compilation can have on timings. On some compilers the -g flag is relatively innocuous but as described in the cf77 manpage for the flag -g we have:


  >  -g   Generates a debug symbol table in the output,
  >       suppressing optimization. Breakpoints can be
  >       set at each line. Equivalent to -Wf "-ed -o off".
Code for Apprentice need only be compiled with the "-Ta" flag sent to the cft77 compiler. In my timings above, this flag did not effect the timings much but for a program like the Whetstones benchmark it can effect timings:

Table 2

timings in seconds of the Whetstone benchmark for various compiler options:


                            time in seconds

  default compilation            36.8
  compiled for totalview         90.4
  compiled for apprentice       280.6

Documentation on SHMEMs

Below is the list of documentation available about SHMEMs, I can send copies to anyone who emails me a request:

  shmemug_c.ps   SHMEM Technical Note for C      , CRI SN-2517 2.3, 10/25/94
  shmemug_ftn.ps SHMEM Technical Note for Fortran, CRI SN-2516 2.3, 10/25/94
  shmem.ascii    SHARED MEMORY GETS AND PUTS ON THE CRAY T3D SYSTEM, from the
                 "CRAY T3D Optimization Techniques", Draft A, 11/94

Another Solution for SHMEM Cache Coherence

Bob Numerich of CRI sent in another method of maintaining cache coherence when using SHMEMS:

  > We have found that turning the cache_inv() flag on makes
  > cache coherence problems less of a problem. Any line coming
  > in from a remote processor automatically invalidates that
  > cache line. Hence, you can spin locally until the line
  > becomes invalid. Also, you don't need to do a cache flush
  > before reading data that has arrived from a remote processor.
From the man page for shmem_cache on Denali we have:

  > shmem_set_cache_inv enables automatic cache invalidation.
  > When automatic cache invalidation is enabled, remote
  > writes into the calling PE's local memory from other PEs
  > via shmem_put(3) or shmem_swap(3) do not disrupt data
  > cache coherency because the affected cache line in the
  > local PE gets invalidated automatically.
  >
  > shmem_clear_cache_inv disables automatic cache
  > invalidation previously enabled by shmem_set_cache_inv
  > or shmem_set_cache_line_inv.
These two routines could be used to accomplish cache consistency in the code region before the remotely written information is used.

Reading and Writing T3D and Y-MP Files

In the last newsletter we had Y-MP reading T3D files, in this issue we have the information going the other way: the Y-MP writes a direct access file and the T3D reads this file. This expands our table with 3 more examples:

  Example     Type of    Source (writing)          Target (reading)
  (newsletter)  File

  1 (58)  Direct Access  T3D(t3d1.f)               Y-MP(ymp1.f), using asnunit
  2 (58)    "      "     T3D(t3d1.f)               Y-MP(ymp2.f), using IEG2CRAY
  3 (58)    "      "     T3D(t3d2.f),cray format   Y-MP(ymp3.f) 
  4 (59)    "      "     Y-MP(ymp4.f)              Y-MP(t3d4.f), using asunit 
  5 (59)    "      "     Y-MP(ymp5.f),ieee 64bits  Y-MP(t3d5.f) 
  6 (59)    "      "     Y-MP(ymp6.f),ieee 32bits  Y-MP(t3d6.f) 
In the example below we have the Y-MP writing a nothing-special direct access file. To read this on the T3D side we use the asnunit call to tell the T3D that the floating point numbers are in CRAY format:

        On the Y-MP (ymp4.f)                  On the T3D (t3d4.f)

        real a( 10 ), b( 10 )                 real a( 10 ), b( 10 )
        sum1 = 0.0                            sum1 = 0.0
        open( unit=10, file='ymp.dir',        call asnunit( 10,
     +   form='unformatted',recl=80',       +  ' -F syscall -N cray', ier )
     +   access='direct',iostat=ist            print *, ier
     +   status='new' )                        open( unit=10,access='direct',
        print *, istat                      +   recl=80,file='ymp.dir',
        do 10 i = 1, 10                     +   form='unformatted',iostat=istat,
           a( i ) = i                       +   status='old')
           sum1 = sum1 + a( i )                print *, istat
  10    continue                               if( istat .ne. 0 ) stop
        print *, "sum1 = ", sum1               read( 10, rec=1 ) a
        write( 10, rec=1 ) a                   sum1 = 0.0
        sum2 = 0.0                             do 10 i = 1, 10
        do 20 i = 1, 10                           sum1 = sum1 + a( i )
           b( i ) = 10 * i                        write( 6, 600 ) a( i )
           sum2 = sum2 + b( i )             10 continue
  20    continue                     600       format( o22 )
        print *, "sum2 = ", sum2               print *, 'sum1 = ', sum1
        write( 10, rec=2 ) b                   read( 10, rec=2 ) b
        write( 10, rec=3 ) a                   sum2 = 0.0
        close( 10 )                            do 20 i = 1, 10
        print *, 'close done'                     sum2 = sum2 + b( i )
        open( unit=10,                      20 continue
     +   access='direct',                      print *, 'sum2 = ', sum2
     +   recl=80,file='ymp.dir',               read( 10, rec=3 ) a
     +   form='unformatted',                   sum3 = 0.0
     +   iostat=istat,                         do 30 i = 1, 10
     +   status='old')                            sum3 = sum3 + a( i )
        print *, istat                      30 continue
        read( 10, rec=1 ) a                    print *, 'sum3 = ', sum3
        sum1 = 0.0                             end
        do 11 i = 1, 10
           sum1 = sum1 + a( i )
           write( 6, 600 ) a( i )
  11    continue
 600    format( o22 )
        print *, 'sum1 = ', sum1
        read( 10, rec=2 ) b
        sum2 = 0.0
        do 21 i = 1, 10
           sum2 = sum2 + b( i )
  21    continue
        print *, 'sum2 = ', sum2
        read( 10, rec=3 ) a
        sum3 = 0.0
        do 31 i = 1, 10
           sum3 = sum3 + a( i )
  31    continue
        print *, 'sum3 = ', sum3
        close( 10 )
        end

        Results on the Y-MP from ymp4.f       Results on the T3D from t3d4.f

                                             0
         0                                   0
         sum1 =        55.                   0377600000000000000000
         sum2 =        550.                  0400000000000000000000
         close done                          0400100000000000000000
         0                                   0400200000000000000000
        0400014000000000000000               0400240000000000000000
        0400024000000000000000               0400300000000000000000
        0400026000000000000000               0400340000000000000000
        0400034000000000000000               0400400000000000000000
        0400035000000000000000               0400420000000000000000
        0400036000000000000000               0400440000000000000000
        0400037000000000000000                 sum1 = 55.
        0400044000000000000000                 sum2 = 550.
        0400044400000000000000                 sum3 = 55.
        0400045000000000000000
         sum1 =        55.
         sum2 =        550.
         sum3 =        55.
In the next example, the Y-MP writes the file with a foreign IEEE machine as the target for reading the file. Using the asnunit call, the I/O is set up to write floating point numbers in the 64 bit IEEE format:

        On the Y-MP (ymp5.f)                 On the T3D (t3d5.f)

        real a( 10 ), b( 10 )                real a( 10 ), b( 10 )
        sum1 = 0.0                           sum1 = 0.0
        call asnunit( 10,                    open( unit=10,access='direct',
     +   ' -F syscall -N ieee_dp',          + recl=80,file='ymp.dir',
        print *, 'ier = ', ier              + form='unformatted',
        open( unit=10, file='ymp.dir',      + iostat=istat,status='old')
     +   form='unformatted',recl=80,          print *, istat
     +   access='direct',iostat=istat,       if( istat .ne. 0 ) stop
     +   status='new' )                      read( 10, rec=1 ) a
        print *, 'istat = ' ,istat           sum1 = 0.0
        do 10 i = 1, 10                      do 10 i = 1, 10
           a( i ) = i                           sum1 = sum1 + a( i )
           sum1 = sum1 + a( i )              write( 6, 600 ) a( i )
  10    continue                         10  continue
        print *, "sum1 = ", sum1        600  format( o22 )
        write( 10, rec=1 ) a                 print *, 'sum1 = ', sum1
        sum2 = 0.0                           read( 10, rec=2 ) b
        do 20 i        = 1, 10               sum2 = 0.0
           b( i ) = 10 * i                   do 20 i = 1, 10
           sum2 = sum2 + b( i )                 sum2 = sum2 + b( i )
  20    continue                         20  continue
        print *, "sum2 = ", sum2             print *, 'sum2 = ',        sum2
        write( 10, rec=2 ) b                 read( 10, rec=3 ) a
        write( 10, rec=3 ) a                 sum3 = 0.0
        close( 10 )                          do 30 i = 1, 10
        print *, 'close done'                   sum3 = sum3 + a( i )
        open( unit=10,access='direct',   30  continue
     +   recl=80,file='ymp.dir',             print *, 'sum3 = ',        sum3
     +   form='unformatted',iostat=          end
     +   status='old')
        print *, istat
        read( 10, rec=1 ) a
        sum1 = 0.0
        do 11 i = 1, 10
           sum1 = sum1 + a( i )
           write( 6, 600 ) a( i )
  11    continue
 600    format( o22 )
        print *, 'sum1 = ', sum1
        read( 10, rec=2 ) b
        sum2 = 0.0
        do 21 i = 1, 10
           sum2 = sum2 + b( i )
  21    continue
        print *, 'sum2 = ', sum2
        read( 10, rec=3 ) a
        sum3 = 0.0
        do 31 i = 1, 10
           sum3 = sum3 + a( i )
  31    continue
        print *, 'sum3 = ', sum3
        close( 10 )
        end

        Output on the Y-MP from ymp5.f       Output on the T3D from t3d5.f

         ier = 0                             0
         istat = 0                          0377600000000000000000
         sum1 = 55.                         0400000000000000000000
         sum2 = 550.                        0400100000000000000000
         close done                         0400200000000000000000
         0                                  0400240000000000000000
        0400014000000000000000              0400300000000000000000
        0400024000000000000000              0400340000000000000000
        0400026000000000000000              0400400000000000000000
        0400034000000000000000              0400420000000000000000
        0400035000000000000000              0400440000000000000000
        0400036000000000000000               sum1 = 55.
        0400037000000000000000               sum2 = 550.
        0400044000000000000000               sum3 = 55.
        0400044400000000000000
        0400045000000000000000
         sum1 = 55.
         sum2 = 550.
         sum3 = 55.
This last example is similar to the previous one, but in this case the format of the written floating point numbers is a 32 bit IEEE number. On the T3D it is possible to read this number using arrays declared with the real*4 declaration and the f90 compiler. Notice that all previous cases used the "-N ieee_dp" flag with asnunit, but in this case the flag is "-N ieee":

        On the Y-MP (ymp6.f)                On the T3D (t3d6.f)

        real a( 10 ), b( 10 )               real*4 a( 10 ), b( 10 )
        sum1 = 0.0                          sum1 = 0.0
        call asnunit( 10,                   open( unit=10,access='direct',
     +   ' -F syscall -N ieee', ier )    +   recl=80,file='ymp.dir',
        print *, 'ier = ', ier           +   form='unformatted',
        open( unit=10, file='ymp.dir',   +   iostat=istat,status='old')
     +   form='unformatted',recl=80 ,        print *, istat
     +   access='direct',iostat=istat,      if( istat .ne. 0 ) stop
     +   status='new' )                     read( 10, rec=1 ) a
        print *, 'istat = ' ,istat          sum1 = 0.0
        do 10 i = 1, 10                     do 10 i = 1, 10
           a( i ) = i                          sum1 = sum1 + a( i )
           sum1 = sum1 + a( i )                write( 6, 600 ) a( i )
  10    continue                        10  continue
        print *, "sum1 = ", sum1       600  format( o22 )
        write( 10, rec=1 ) a                print *, 'sum1 = ', sum1
        sum2 = 0.0                          read( 10, rec=2 ) b
        do 20 i = 1, 10                     sum2 = 0.0
           b( i ) = 10 * i                  do 20 i = 1, 10
           sum2 = sum2 + b( i )                sum2 = sum2 + b( i )
  20    continue                        20  continue
        print *, "sum2 = ", sum2            print *, 'sum2 = ', sum2
        write( 10, rec=2 ) b                read( 10, rec=3 ) a
        write( 10, rec=3 ) a                sum3 = 0.0
        close( 10 )                         do 30 i = 1, 10
        print *, 'close done'               sum3 = sum3 + a( i )
        open( unit=10,access='direct',  30  continue
     +   recl=80,file='ymp.dir',            print *, 'sum3 = ', sum3
     +   form='unformatted',iostat=istat,   end
     +   status='old')
        print *, istat
        read( 10, rec=1 ) a
        sum1 = 0.0
        do 11 i = 1, 10
           sum1 = sum1 + a( i )
           write( 6, 600 ) a( i )
  11    continue
 600    format( o22 )
        print *, 'sum1 = ', sum1
        read( 10, rec=2 ) b
        sum2 = 0.0
        do 21 i = 1, 10
           sum2 = sum2 + b( i )
  21    continue
        print *, 'sum2 = ', sum2
        read( 10, rec=3 ) a
        sum3 = 0.0
        do 31 i = 1, 10
           sum3 = sum3 + a( i )
  31    continue
        print *, 'sum3 = ', sum3
        close( 10 )
        end
                                         0
         ier = 0                        7740000000
         istat = 0                     10000000000
         sum1 = 55.                    10020000000
         sum2 = 550.                   10040000000
         close done                    10050000000
         0                             10060000000
        0400014000000000000000         10070000000
        0400024000000000000000         10100000000
        0400026000000000000000         10104000000
        0400034000000000000000         10110000000
        0400035000000000000000           sum1 = 55.
        0400036000000000000000           sum2 = 550.
        0400037000000000000000           sum3 = 55.
        0400044000000000000000
        0400044400000000000000
        0400045000000000000000
         sum1 = 55.
         sum2 = 550.
         sum3 = 55.
In the next newsletter, we'll look more at these direct access files.

ARSC is Now at MAX 1.2.0.5

During the downtime on Tuesday 11/31/95, ARSC upgraded to UNICOS 8.0.4.1 and MAX 1.2.0.5. If any gremlins got into your T3D code since then please contact Mike Ess. With this version on MAX, we have started using mppview instead of mppmon.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top