ARSC T3D Users' Newsletter 58, October 27, 1995

Getting the Right Answer with SHMEM_PUTS

There are two conditions that must be met to get the correct answers with SHMEM_PUTS: synchronization and cache management. Maybe in the past this wasn't made explicit and the time spent making the cache coherent actually was the delay needed to synchronize the processors. In the following scenarios, let's see which sequence produces the correct value of B = 1 on PE1:

Scenario 1


     time           activity on PE0       activity on PE1
  (arbitrary units)
       0            A = 1                 A = 0
       1            start shmem_put A
       2                                  shmem_udcflush()
       3                                  B = A (update #1)
       4            end shmem_put               
       5                                  memory update finished
       6
       7                                  shmem_udcflush()
       8                                  B = A (update #3)
Update #1 is not correct even though it is preceded by a cache flush operation because the new value of A has not been furnished by PE0 yet. Update #3 should be correct because the memory of PE1 has been updated and the old value of A has been flushed from cache.

Scenario 2


     time           activity on PE0       activity on PE1
  (arbitrary units)
       0            A = 1                 A = 0
       1            start shmem_put A                  
       2
       3
       4            end shmem_put
       5                                  memory update finished
       6                                  B = A (update #2) 
       7                                  shmem_udcflush()
       8                                  B = A (update #3)
Even though update #2 happens after the memory of PE1 has the correct value for A, we still need the cache management routine to flush the cache of the old value of A. So to get the correct value on PE1 we need these two steps to happen:
  1. PE1's memory gets the correct value
  2. PE1's cache is flushed of the previous value
The cache flush can happen before or after the memory is updated but both the flush and the memory update must happen before the updated value is used. But a program is hardly ever as simple as the two scenarios above. In particular, we could have other actions implementing steps 1 and 2. For example:
  1. Processing on PE1 could invalidate A's cache line, meaning the flush was not necessary.
  2. The cache flush could push the access of A to a time after the memory update was complete.
But action 1 is just fortuitous and action 2 gets the correct answer because of timing considerations. A program that depends on either 1 or 2 could not be considered a correct program. Below is a program that shows that using the cache management routines are not enough:

  #include <stdio.h>
  #define NMAX 1000000
  
  main()
  {
          int i, j, itest, NPES, MYPE;
          long a[ NMAX ];
          long b[ NMAX ]; 
          int count, wait;
  
          NPES = _num_pes();
          MYPE = _my_pe();
          itest = 0;
  
    for( i = 0; i < NMAX; i++ ) b[ i ] = 2;
    for( j = 0; j < 10; j++ ) {
          barrier( );
            for( i = 0; i < NMAX; i++ ) a[ i ] = 1;
            if( MYPE == 0 ) printf( "case 1\n" ); fflush( stdout );
          barrier( );
          if( MYPE == 0 ) {
            shmem_put( a, b, 1, 1 );
          } else {
            count = 0;
  again:;
            if( a[ itest ] != 2 ) {
              count++;
              a[ itest+1 ] = 3;      /* one of these statements with flush     */ 
              a[ itest-1 ] = 3;      /* A[itest]'s cache line                  */
              goto again;
            }
            printf( " Case1, a[%d] on PE1 = %d, count = %d\n", itest, a[itest], count );
          }
          barrier( );
            if( MYPE == 0 ) printf( "case 2\n" ); fflush( stdout );
            for( i = 0; i < NMAX; i++ ) a[ i ] = 1;
          barrier( );
          if( MYPE == 0 ) {
            shmem_put( a, b, 1, 1 );
          } else {
            shmem_udcflush();
            printf( " Case2, a[%d] on PE1 = %d\n", itest, a[itest] );
          }
          barrier( );
            if( MYPE == 0 ) printf( "case 3\n" );
            for( i = 0; i < NMAX; i++ ) a[ i ] = 1;
          barrier( );
          if( MYPE == 0 ) {
            shmem_put( a, b, 1, 1 );
          } else {
            shmem_udcflush_line( &a[ itest ] );
            printf( " Case3, a[%d] on PE1 = %d\n", itest, a[itest] );
          }
          barrier( );
            if( MYPE == 0 ) printf( "case 4\n" ); fflush( stdout );
            for( i = 0; i < NMAX; i++ ) a[ i ] = 1;
          barrier( );
          if( MYPE == 0 ) {
            shmem_put( a, b, 1, 1 );
          } else {
            wait = 0;
            if( a[ itest ] == 1 ) {
              wait = 1;
              shmem_wait( &a[ itest ], 1 );
            }
            printf( " Case4, a[%d] on PE1 = %d, wait = %d\n", itest, a[itest], wait );
          }
          barrier( );
    }
  }
In all 4 cases we have PE0 changing the value of A on PE1 and PE1 using some strategy to get the updated value. We have:

  case1
  Case1, a[0] on PE1 = 2, count = 90503
  case 2
  Case2, a[0] on PE1 = 1
  case 3
  Case3, a[0] on PE1 = 1
  case 4
  Case4, a[0] on PE1 = 2, wait = 1

  Case1: Shows that both synchronization and a cache flush are necessary
  Case2: Shows shmem_udcflush is not sufficient
  Case3: Shows shmem_udcflush_line is not sufficient
  Case4: Shows shmem_wait is sufficient
Using the SHMEM_PUTS may be faster than SHMEM_GETS, PVM or MPI, but it requires that the programmer be more careful.

How Are SHMEM PUTS Implemented?

Moon Kyeong-Deok of KIST/SERI, in Korea, suggested that I reexamine the shmem_put timings of Newsletter #56 (10/13/95) in MB/second instead of raw times. This produces a very interesting gap in the timing data, that leads me to think that the implementation of shmem_puts are implemented with one algorithm for messages less than 384 words and one algorithm for those above 384 words. A plot of the timings shows this more dramatically (I can email the postscript plots to anyone who requests them), but the table below also shows this "discontinuity":

Table 1

execution speeds for shmem_put (MB/second)

  size of
  message  PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
  (8-byte words)
   370       95.2     95.8     96.4     95.4     94.2     96.1     95.6
   380       94.3     95.2     95.5     94.3     94.1     94.4     94.0
   381       94.4     95.2     94.7     95.2     94.7     94.7     94.1
   382       94.7     95.5     95.0     95.9     95.5     95.5     95.3
   383       93.9     96.0     95.2     96.0     95.4     95.7     94.9
   384      110.8    113.6    113.9    113.7    114.6    114.3    113.7
   385      112.7    114.2    113.3    114.4    114.2    113.5    114.5
   386      111.7    114.7    114.6    114.7    114.8    113.9    114.8
   387      113.4    114.1    113.9    115.3    114.9    114.1     34.7
   388      113.1    112.6    113.1    113.3    114.7    113.7    104.5
   389      112.7    114.1    113.0    113.5    114.5    113.7    114.6
   390      113.1    114.5    114.3    115.4    114.0    114.3    114.3
   400      113.4    114.6    114.4    115.1    115.1    114.4    113.9
Without documentation, many library routines become opportunities for such "black box" experiments.

Reading and Writing T3D and Y-MP Files

I have been avoiding this subject of reading and writing files for the other machine because it is a mix of trickology and magic. I think the best way to approach it is with a list of examples for the common cases and then a resourceful programmer can work from there. In the coming newsletters, I will try to come up with a list of working examples and expand the list below. In this issue, we have:

  Example      Type of    Source (writing)         Target (reading)
  (newsletter)   File

  1 (58)   Direct Access  T3D(t3d1.f)              Y-MP(ymp1.f), using asnunit
  2 (58)                  T3D(t3d1.f)              Y-MP(ymp2.f), using IEG2CRAY
  3 (58)                  T3D(t3d2.f),cray format  Y-MP(ymp3.f) 
Below are the sample programs and their output on each machine. The makefile that I used for this set of examples is:

  TT3D=TARGET=cray-t3d
  TY-MP=TARGET=cray-ymp
  MNPE=MPP_NPES=1

  t3d1:   t3d1.f
          -rm t3d.dir
          assign -R
          (export $(TT3D); cf77 t3d1.f )
          (export $(TT3D); export $(MNPE); a.out )

  t3d3:   t3d3.f
          -rm ymp.dir
          assign -R
          (export $(TT3D); cf77 t3d3.f )
          (export $(TT3D); export $(MNPE); a.out )

  ymp1:   ymp1.f
          assign -R
          (export $(TY-MP); cf77 ymp1.f )
          (export $(TY-MP); a.out )
  
  ymp2:   ymp2.f
          assign -R
          (export $(TY-MP); cf77 ymp2.f )
          (export $(TY-MP); a.out )
  
  ymp3:   ymp3.f
          assign -R
          (export $(TY-MP); cf77 ymp3.f )
          (export $(TY-MP); a.out )

  clean:
          rm -f a.out ymp.dir t3d.dir mppcore core
The "assign -R" command removes all assigned options from previous assigns and opens. It is essential to start with a clean slate for each example. Another invaluable trick is that the error messages out of the Fortran open statement are in the file:

  /usr/include/liberrno.h.

Example 1

T3D written direct access file read by the Y-MP with data conversion done with the assign (or asnunit)

        t3d1.f                              ymp1.f

        real a( 10 ), b( 10 )               real a( 10 ), b( 10 )
        sum1 = 0.0                          sum1 = 0.0
        open( unit=10, file='t3d.dir',      call asnunit( 10,
     +   form='unformatted',recl=80,     +    '-F syscall -N ieee_dp', ier )
     +   access='direct',iostat=istat,      print *, ier
     +   status='new' )                     open( unit=10,access='direct',
        print *, istat                   +   recl=80,file='t3d.dir',
        do 10 i = 1, 10                  +   form='unformatted',iostat=istat,
           a( i ) = i                    +   status='old')
           sum1 = sum1 + a( i )             print *, istat
  10    continue                            if( istat .ne. 0 ) stop
        print *, "sum1 = ", sum1            read( 10, rec=1 ) a
        write( 10, rec=1 ) a                sum1 = 0.0
        sum2 = 0.0                          do 10 i = 1, 10
        do 20 i = 1, 10                        sum1 = sum1 + a( i )
           b( i ) = 10 * i                     write( 6, 600 ) a( i )
           sum2 = sum2 + b( i )       10    continue
  20    continue                     600    format( o22 )
        print *, "sum2 = ", sum2            print *, 'sum1 = ', sum1
        write( 10, rec=2 ) b                read( 10, rec=2 ) b
        write( 10, rec=3 ) a                sum2 = 0.0
        close( 10 )                         do 20 i = 1, 10
        print *, 'close done'                  sum2 = sum2 + b( i )
        open( unit=10,                20    continue
     +   access='direct',                   print *, 'sum2 = ', sum2
     +   recl=80,file='t3d.dir',            read( 10, rec=3 ) a
     +   form='unformatted',                sum3 = 0.0
     +   iostat=istat,                      do 30 i = 1, 10
     +   status='old')                         sum3 = sum3 + a( i )
        print *, istat                30    continue
        read( 10, rec=1 ) a                 print *, 'sum3 = ', sum3
        sum1 = 0.0                          end
        do 11 i = 1, 10
           sum1 = sum1 + a( i )
           write( 6, 600 ) a( i )
  11    continue
 600    format( o22 )
        print *, 'sum1 = ', sum1
        read( 10, rec=2 ) b
        sum2 = 0.0
        do 21 i = 1, 10
           sum2 = sum2 + b( i )
  21    continue
        print *, 'sum2 = ', sum2
        read( 10, rec=3 ) a
        sum3 = 0.0
        do 31 i = 1, 10
           sum3 = sum3 + a( i )
  31    continue
        print *, 'sum3 = ', sum3
        close( 10 )
        end

        Output on T3D                        Output on Y-MP
         
          0                                  0
          sum1 =        55.                  0
          sum2 =        550.                0400014000000000000000
          close done                        0400024000000000000000
          0                                 0400026000000000000000
         0377600000000000000000             0400034000000000000000
         0400000000000000000000             0400035000000000000000
         0400100000000000000000             0400036000000000000000
         0400200000000000000000             0400037000000000000000
         0400240000000000000000             0400044000000000000000
         0400300000000000000000             0400044400000000000000
         0400340000000000000000             0400045000000000000000
         0400400000000000000000              sum1 = 55.
         0400420000000000000000              sum2 = 550.
         0400440000000000000000              sum3 = 55.
          sum1 =        55.
          sum2 =        550.
          sum3 =        55.

Example 2

T3D written direct access file read by the Y-MP with data conversion done with the routine IEG2CRAY (see manpage on denali)

        t3d1.f(same as above)               ymp2.f

                                            real a( 10 ), b( 10 )
                                            real c( 10 )
                                            sum1 = 0.0
                                            open( unit=10,
                                         +   access='direct',recl=80,
                                         +   file='t3d.dir',form='unformatted',
                                         +   iostat=istat,status='old')
                                            print *, istat
                                            if( istat .ne. 0 ) stop
                                            read( 10, rec=1 ) a
                                            ier = ieg2cray( 8,10,a,0,c,1 )
                                            print *, 'ier =', ier
                                            sum1 = 0.0
                                            do 10 i = 1, 10
                                               sum1 = sum1 + c( i )
                                               write( 6, 600 ) a( i ), 
                                      10    continue
                                     600    format( o22, 1x, o22 )
                                            print *, 'sum1 = ', sum1
                                            read( 10, rec=2 ) b
                                            ier = ieg2cray( 8,10,b,0,c,1 )
                                            sum2 = 0.0
                                            do 20 i = 1, 10
                                               sum2 = sum2 + c( i )
                                      20    continue
                                            print *, 'sum2 = ', sum2
                                            read( 10, rec=3 ) a
                                            ier = ieg2cray( 8,10,a,0,c,1 )
                                            sum3 = 0.0
                                            do 30 i = 1, 10
                                               sum3 = sum3 + c( i )
                                      30    continue
                                            print *, 'sum3 = ', sum3
                                            end
                                   
                                       Output on Y-MP

                                0
                                ier =0
                               0377600000000000000000 0400014000000000000000
                               0400000000000000000000 0400024000000000000000
                               0400100000000000000000 0400026000000000000000
                               0400200000000000000000 0400034000000000000000
                               0400240000000000000000 0400035000000000000000
                               0400300000000000000000 0400036000000000000000
                               0400340000000000000000 0400037000000000000000
                               0400400000000000000000 0400044000000000000000
                               0400420000000000000000 0400044400000000000000
                               0400440000000000000000 0400045000000000000000
                                sum1 = 55.
                                sum2 = 550.
                                sum3 = 55.

Example 3

T3D written direct access file, written with cray formatted floating point numbers read by the Y-MP

        t3d3.f                                ymp2.f


        real a( 10 ), b( 10 )               real a( 10 ), b( 10 )
        sum1 = 0.0                          sum1 = 0.0
        call asnunit( 10,                   open( unit=10,access='direct',
     +   ' -F syscall -N cray', ier )    +   recl=80,file='ymp.dir', 
        print *, 'ier = ', ier           +   form='unformatted',
        open( unit=10, file='ymp.dir',   +   iostat=istat,status='old')
     +   form='unformatted',recl=80,        print *, istat
     +   access='direct',iostat=istat,      if( istat .ne. 0 ) stop
     +   status='new' )                     read( 10, rec=1 ) a
        print *, 'istat = ' ,istat          sum1 = 0.0
        do 10 i = 1, 10                     do 10 i = 1, 10
           a( i ) = i                          sum1 = sum1 + a( i )
           sum1 = sum1 + a( i )                write( 6, 600 ) a( i )
  10    continue                      10    continue
        print *, "sum1 = ", sum1     600    format( o22 )
        write( 10, rec=1 ) a                print *, 'sum1 = ', sum1
        sum2 = 0.0                          read( 10, rec=2 ) b
        do 20 i = 1, 10                     sum2 = 0.0
           b( i ) = 10 * i                  do 20 i = 1, 10
           sum2 = sum2 + b( i )                sum2 = sum2 + b( i )
  20    continue                      20    continue
        print *, "sum2 = ", sum2            print *, 'sum2 = ', sum2
        write( 10, rec=2 ) b                read( 10, rec=3 ) a
        write( 10, rec=3 ) a                sum3 = 0.0
        close( 10 )                         do 30 i = 1, 10
        print *, 'close done'               sum3 = sum3 + a( i )
        open( unit=10,access='direct',  30    continue
     +   recl=80,file='ymp.dir',            print *, 'sum3 = ', sum3
     +   form='unformatted',                end
     +   iostat=istat,status='old')
        print *, istat
        read( 10, rec=1 ) a
        sum1 = 0.0
        do 11 i = 1, 10
           sum1 = sum1 + a( i )
           write( 6, 600 ) a( i )
  11    continue
 600    format( o22 )
        print *, 'sum1 = ', sum1
        read( 10, rec=2 ) b
        sum2 = 0.0
        do 21 i = 1, 10
           sum2 = sum2 + b( i )
  21    continue
        print *, 'sum2 = ', sum2
        read( 10, rec=3 ) a
        sum3 = 0.0
        do 31 i = 1, 10
           sum3 = sum3 + a( i )
  31    continue
        print *, 'sum3 = ', sum3
        close( 10 )
        end

        Output on T3D                        Output on Y-MP

          ier =     0                         0
          istat = 0                         0400014000000000000000
          sum1 =        55.                 0400024000000000000000
          sum2 =        550.                0400026000000000000000
          close done                        0400034000000000000000
          0                                 0400035000000000000000
         0377600000000000000000             0400036000000000000000
         0400000000000000000000             0400037000000000000000
         0400100000000000000000             0400044000000000000000
         0400200000000000000000             0400044400000000000000
         0400240000000000000000             0400045000000000000000
         0400300000000000000000              sum1 = 55.
         0400340000000000000000              sum2 = 550.
         0400400000000000000000              sum3 = 55.
         0400420000000000000000
         0400440000000000000000
          sum1 =        55.
          sum2 =        550.
          sum3 =        55.
I would like to thank those who sent in some suggestions or examples about reading and writing files on the other machine. We will all benefit from their effort to share hard earned experience:
  • Wieslaw Maslowski, NCAR, Boulder, Colorado
  • Vera Voronina, Geophysical Institute, UAF, Fairbanks, Alaska
  • Adwait Sathye, Center for Analysis and Prediction of Storms, Norman, OK
  • Frank Chism, CRI, Seattle, WA
I am still looking for more tricks and magic; if you have some, we'd all like to see them.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top