ARSC T3D Users' Newsletter 62, November 24, 1995

CRAFT 'atomic update' Extension

The following article appeared in the Cray Service Bulletin in August 1994. It gives some description of the 'atomic update' extension of CRAFT Fortran that I used in the second article of this newsletter.


  > CRAFT vector updates and array syntax
  >  
  > Systems:     CRAY T3D
  > OS:          UNICOS MAX
  > Product:     CF77 Programming Environment for MPP
  > Audience:    Programmers
  > Date:        August 1994
  >  
  > Vector updates are assignment statements that modify or update an
  > array reference that has an element of indirection. The following
  > is an example:
  >  
  > DO I=1, N
  >   X(IX(I)) = X(IX(I)) + V(I)
  > END DO
  >  
  > In this example, IX may contain values that occur more than once;
  > if so, executing the update in parallel can cause race
  > conditions, which may produce incorrect results for X.
  >  
  > The Cray Research Adaptive Fortran (CRAFT) language has a
  > directive, called ATOMIC UPDATE, that directs the compiler to
  > ensure that multiple updates to one shared element occur
  > atomically; that is, X(IX(I)) will be evaluated by only one
  > processor element (PE) at a time. The syntax of the ATOMIC UPDATE
  > directive is as follows:
  >  
  > CDIR$ DOSHARED (I) on IX(I)
  >       DO I = 1, N
  > CDIR$ ATOMIC UPDATE
  >         X(IX(I)) = X(IX(I)) + V(I)
  >       END DO
  >  
  > The following examples all are very similar, and they all will
  > produce the correct results only when run on one PE. Each example
  > contains a race condition that will cause incorrect results when
  > run using multiple PEs.
  >  
  >  
  > Example 1: DOSHARED vector update without using ATOMIC UPDATE
  > directive 
  >  
  > Failure to use an ATOMIC UPDATE directive to protect the vector
  > update in the following test case will lead to a race condition:
  >  
  >         program ex1
  >         real x(1024)
  >         integer ix(1024)
  > cdir$ shared x(:block), ix(:block)
  >         data x/1024*0./, ix/1024*1/
  > cdir$ doshared (i) on ix(i)
  >         do i = 1, 1024
  >           x(ix(i)) = x(ix(i)) + 1. 
  >         end do
  > cdir$ master
  >         print *, 'x(1) = ', x(1)
  > cdir$ end master
  >         end
  >  
  > Compilation and execution is as follows:
  >  
  > t3d: cf77 ex1.f
  > t3d: a.out -npes 8
  > x(1)= 187.
  > t3d: a.out -npes 8
  > x(1)= 185.
  > t3d: a.out -npes 1
  > x(1)= 1024.
  >  
  > To run this code using multiple PEs and obtain the correct
  > results, you must include an ATOMIC UPDATE directive immediately
  > before the vector update.
  >  
  >  
  > Example 2: Use of array syntax 
  >  
  > If IX contains values that occur more than once, using array
  > syntax to express the vector update of X will introduce a race
  > condition. When using the following array syntax, the indirection
  > of X(IX(I)) is less obvious:
  >  
  >         program ex2
  >         real x(1024)
  >         integer ix(1024)
  > cdir$ shared x(:block), ix(:block)
  >         data x/1024*0./, ix/1024*1/
  >         x(ix)=x(ix)+1.
  > cdir$ master
  >         print *, 'x(1)= ', x(1)
  > cdir$ end master
  >         end
  >  
  > Compilation and execution is as follows:
  >  
  > t3d: cf77 ex2.f
  > t3d: a.out -npes 8
  > x(1)= 183.
  > t3d: a.out -npes 8
  > x(1)= 180.
  > t3d: a.out -npes 1
  > x(1)= 1024.
  >  
  > Because the compiler treats array syntax that uses shared arrays
  > as shared loops, the preceding syntax introduces a race
  > condition; also, it violates the Fortran 90 "Vector Subscript"
  > standard (paragraph 6.2.2.3.2), which prohibits array syntax that
  > has repeated indices in a vector update. The standard states:
  >  
  > "A vector subscript designates a sequence of subscripts
  > corresponding to the values of the elements of the expression.
  > Each element of the expression must be defined. A `many-one array
  > section' is an array section with a vector subscript having two
  > or more elements with the same value. A `many-one array section'
  > must not appear on the left of the equals in an assignment
  > statement or as an input item in a READ statement."
  >  
  >  
  > Example 3: Using a CDIR$ ATOMIC UPDATE before the array syntax in
  > example 2
  >  
  > Although it may appear that the use of a CDIR$ ATOMIC UPDATE
  > directive before the x(ix)=x(ix)+1. statement in Example 2 would
  > correct the race condition, it will not, as the following test
  > case shows:
  >  
  >         program ex3
  >         real x(1024)
  >         integer ix(1024)
  > cdir$ shared x(:block), ix(:block)
  >         data x/1024*0./, ix/1024*1/
  > cdir$ atomic update
  >         x(ix)=x(ix)+1.
  > cdir$ master
  >         print *, 'x(1)= ', x(1)
  > cdir$ end master
  >         end
  >  
  > Compilation and execution is as follows:
  >  
  > t3d: cf77 ex3.f
  > t3d: a.out -npes 8
  > x(1)= 184.
  > t3d: a.out -npes 8
  > x(1)= 185.
  > t3d: a.out -npes 1
  > x(1)= 1024.
  >  
  > It still violates the Fortran 90 "Vector Subscript" standard;
  > therefore, the compiler does not consider the ATOMIC UPDATE
  > directive before the array syntax when it do-shares the array
  > syntax.
  >  
  > In summary, the ATOMIC UPDATE directive is required for vector
  > updates that contain repeated indices, and you must place this
  > directive immediately before the assignment statement, inside a
  > user DOSHARED loop.<end of article>

Managing Parallel I/O on the T3D

Over the past year there have been several ARSC users who have complained their T3D job hangs when doing I/O. Usually these cases implement parallel I/O with either of the two methods described in newsletter #60 (11/10/95). Both methods of using different unformatted files per PE or different direct access records per PE allow the user to swamp the I/O capabilities of the system. I believe this is the reason for the hangs and I advocate managing I/O to avoid these hangs. I haven't had much success with this because it requires extra coding to manage these large transfers. In this article I try to show that the effort to manage I/O is small.

In most cases, it seems that after a long time computing, the results are ready for saving to disk and this is a particularly unfortunate time for a hang. That the hangs are intermittent and not reproducible adds to the user's frustration. Of the three cases I know of, the results have been:

  1. One user gave up and moved his application to the SP2.
  2. One user continues to hope the run wouldn't hang but kills the run if the I/O phase takes too long. (Being a nonmultiprogrammed machine, the T3D is time-wise a very predictable machine)
  3. Another user (me) has wasted a lot of computer time and programming time trying to find a program that reliably reproduces the I/O hang, but with no luck.
As an added benefit, managing this I/O load also improves overall transfer rate because of reduced contention for shared resources. Let's look as a very simple starting case:

      parameter( MAX = 7000000 )
      real a ( MAX )
      intrinsic my_pe
      character*16 filename
      character*2 ciun
      iun = 10 + my_pe()
      write( ciun, "(i2)" ) iun
      filename = '/tmp/ess/fort.'//ciun
      open( iun, form = 'unformatted', file=filename )
      call barrier()
      t1 = second()
      write( iun ) a
      t1 = second() - t1
      write( 6, 600 ) MAX, iun, MAX/( 1000000.0 * t1 ), t1
  600 format( i10, i4, f10.3, f10.6 ) 
      end
This program, running on N$PES PEs, has each PE write its own huge file, a typical output on 8 PEs is:

  7000000  12     0.486 14.390356
  7000000  16     0.485 14.419075
  7000000  17     0.485 14.447319
  7000000  10     0.484 14.455764
  7000000  15     0.484 14.464045
  7000000  14     0.484 14.475598
  7000000  11     0.483 14.483802
  7000000  13     0.478 14.643125
So we can say that the aggregate output is about 8 * .484 = 3.87 MW/s. Now as a first attempt to manage this I/O, lets have an array of N$PES flags to signal when a PE can write its file. As each PE finishes its I/O it changes the status of the next PEs flag and then that PE does its I/O without any contention with other PEs. Here's a possibility:

        parameter( MAX = 7000000 )
        real a ( MAX )
        intrinsic my_pe
        integer mype, npes 
        character*16 filename
        character*2 ciun
        integer flags( 0:127 )
  cdir$ shared flags( :block )
        mype = my_pe()
        npes = n$pes
  cdir$ master
        do i = 0, npes-1
           flags( i ) = 0
        enddo
        flags( 0 ) = 1
  cdir$ endmaster
        iun = 10 + my_pe()
        write( ciun, "(i2)" ) iun
        filename = '/tmp/ess/fort.'//ciun
        open( iun, form = 'unformatted', file=filename )
        call barrier()
        t1 = second()
    10  continue
  cdir$ suppress
        if( flags( mype ) .eq. 1 ) then
           write( iun ) a
           flags( mype+1 ) = 1
        else
           goto 10
        endif
        t1 = second() - t1
        write( 6, 600 ) MAX, iun, MAX/( 1000000.0 * t1 ), t1
   600  format( i10, i4, f10.3, f10.6 ) 
        end
To keep this program from hanging, it must be compiled with -ooff or use CDIR$ SUPPRESS as shown above. (Testing a shared array this way was discussed in newsletter #12 (11/11/94)). A typical run is:

  7000000  10     2.349  2.979504
  7000000  11     1.278  5.479143
  7000000  12     0.882  7.940334
  7000000  13     0.675 10.369488
  7000000  14     0.541 12.935420
  7000000  15     0.456 15.353703
  7000000  16     0.391 17.880117
  7000000  17     0.344 20.330558
From this run we can compute the aggregate I/O speed as:

  7000000 * 8 / 20.330558 = 2.75 Mw/s.
slower but managed (notice that the output now comes out in PE order).

The I/O hangs at ARSC are with more than 8 PEs and rather than 'single thread' all I/O operations, as in the above example, it is possible to use multiple PEs but less than N$PES using the "atomic update" Fortran extension. Here is an example:


        parameter( MAX = 7000000 )
        real a ( MAX )
        intrinsic my_pe
        integer mype, npes 
        character*17 filename
        character*2 ciun
  cdir$ shared ihi
        mype = my_pe()
        npes = n$pes
        iun = 10 + my_pe()
        write( ciun, "(i2)" ) iun
        filename = '/tmp/ess/fort.'//ciun
        open( iun, form = 'unformatted', file=filename )
  c set the number of PEs that can do I/O concurrently
        ihi = 4
        call barrier()
        t1 = second()
    10  continue
  cdir$ suppress
           if( mype .lt. ihi ) then
              write( iun ) a
  cdir$ atomic update
              ihi = ihi + 1
           else
              goto 10
           endif
        t1 = second() - t1
        call barrier()
        if( mype .eq. (npes-1) ) then
           write( 6, 601 ) ihi, npes*MAX, npes*MAX/( 1000000.0 * t1 )
        endif
   100  continue
   601  format( i4, i14, 4x, f10.3        ) 
        end
Below are the results for various values of N$PES and an increasing number of concurrent I/O operations. The variability is typical of a heavily loaded machine like ARSC's Y-MP, but some trends can be discerned:

Table 1

I/O transfer rates (MW/s) for all PEs concurrently transferring 7MW

# of  test 1 test 2 test 3 test 4 test 5 test 6 test 7 test 8
conc-
urrent 8 PEs  8 PEs 16 PEs 16 PEs 32 PEs 32 PEs 64 PEs 64 PEs
xfers    

   1   2.706  2.191  2.083  2.108  1.688  2.336  1.624
   2   4.508  2.849  4.360  3.319  2.728  2.242  2.611  2.529
   3   4.945  2.660  5.431  3.772  2.853  2.972  2.648
   4   3.999  2.299  4.881  3.109  2.482  2.633  2.392  4.216
   5   4.663  2.750  1.941  4.533  2.671  2.537  1.792
   6   4.495  3.822  5.040  3.003  2.264  2.571  1.807  3.649
   7   4.244  3.010  4.493  3.273  1.932  2.404  2.324
   8   4.235  2.796  4.385  2.998  2.405  1.886  2.290  3.407
   9                 1.915  3.288  2.377  1.648  2.694
  10                 3.605  3.784  1.809  2.079  2.967  4.923
  11                 4.365  3.266  1.705  1.584  2.722
  12                 4.526  2.712  1.745  1.795  2.665  3.917
  13                 4.884  3.276  1.510  1.572  2.959
  14                 2.134  2.627  4.301  1.852  3.149  5.047
  15                 1.807  1.917  1.025  1.850  2.890
  16                 3.378  2.690  1.691  1.458  3.054  4.041
  17                               1.542  1.642  2.910
  18                               1.968  2.207  2.758  3.677
  19                               3.931  1.432  3.412
  20                               1.604  1.640  2.578  4.240
  21                               1.666  1.939  3.318
  22                               2.886  1.384  3.200  3.327
  23                               1.345  1.188  3.061
  24                               1.515  1.700  3.723  3.331
  25                               1.506  1.771  2.803
  26                               1.824  2.089  3.268  3.648
  27                               1.533  1.783  2.970
  28                               1.565  1.428  2.604  3.660
  29                               1.507  1.593  2.545
  30                               2.024  1.614  3.351  3.302
  31                               1.516  1.442  2.970
  32                               1.436  1.789  3.073  3.383
  33                                             2.936
  34                                             3.029  2.773
  35                                             3.244
  36                                             3.258  2.719
  37                                             2.812
  38                                             2.487  3.154
  39                                             2.347
  40                                             2.913  3.695
  41                                             2.188
  42                                             3.126  2.718
  43                                             2.963
  44                                             3.971  2.796
  45                                             3.221
  46                                             2.317  2.868
  47                                             2.320
  48                                             2.839  2.845
  49                                             2.460
  50                                             2.433  3.196
  51                                             2.363
  52                                             2.445  3.213
  53                                             2.850
  54                                             2.858  3.402
  55                                             2.843
  56                                             3.069  2.920
  57                                             3.184
  58                                             2.648  2.798
  59                                             2.744
  60                                                    3.561
  61
  62                                                    3.203
  63
  64                                                    3.113
Although the variability of the timings obscure possible conclusions, I think we can say:
  1. Beyond a relatively small number of concurrent (say 16 PEs) there is no improvement in I/O transfer rate.
  2. Best results seem consistently below 8 concurrent processors.
  3. For runs with 32 and 64 PEs, using all PEs to transfer concurrently produces less than optimal transfer rates.
As shown in the example program above, the modification from all N$PES concurrent to a managed number of PEs writing concurrently is small and the I/O rate is not affected. If this technique averts the possibility of a hang that is an added benefit.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top