ARSC T3D Users' Newsletter 62, November 24, 1995
CRAFT 'atomic update' Extension
The following article appeared in the Cray Service Bulletin in August 1994. It gives some description of the 'atomic update' extension of CRAFT Fortran that I used in the second article of this newsletter.
> CRAFT vector updates and array syntax > > Systems: CRAY T3D > OS: UNICOS MAX > Product: CF77 Programming Environment for MPP > Audience: Programmers > Date: August 1994 > > Vector updates are assignment statements that modify or update an > array reference that has an element of indirection. The following > is an example: > > DO I=1, N > X(IX(I)) = X(IX(I)) + V(I) > END DO > > In this example, IX may contain values that occur more than once; > if so, executing the update in parallel can cause race > conditions, which may produce incorrect results for X. > > The Cray Research Adaptive Fortran (CRAFT) language has a > directive, called ATOMIC UPDATE, that directs the compiler to > ensure that multiple updates to one shared element occur > atomically; that is, X(IX(I)) will be evaluated by only one > processor element (PE) at a time. The syntax of the ATOMIC UPDATE > directive is as follows: > > CDIR$ DOSHARED (I) on IX(I) > DO I = 1, N > CDIR$ ATOMIC UPDATE > X(IX(I)) = X(IX(I)) + V(I) > END DO > > The following examples all are very similar, and they all will > produce the correct results only when run on one PE. Each example > contains a race condition that will cause incorrect results when > run using multiple PEs. > > > Example 1: DOSHARED vector update without using ATOMIC UPDATE > directive > > Failure to use an ATOMIC UPDATE directive to protect the vector > update in the following test case will lead to a race condition: > > program ex1 > real x(1024) > integer ix(1024) > cdir$ shared x(:block), ix(:block) > data x/1024*0./, ix/1024*1/ > cdir$ doshared (i) on ix(i) > do i = 1, 1024 > x(ix(i)) = x(ix(i)) + 1. > end do > cdir$ master > print *, 'x(1) = ', x(1) > cdir$ end master > end > > Compilation and execution is as follows: > > t3d: cf77 ex1.f > t3d: a.out -npes 8 > x(1)= 187. > t3d: a.out -npes 8 > x(1)= 185. > t3d: a.out -npes 1 > x(1)= 1024. > > To run this code using multiple PEs and obtain the correct > results, you must include an ATOMIC UPDATE directive immediately > before the vector update. > > > Example 2: Use of array syntax > > If IX contains values that occur more than once, using array > syntax to express the vector update of X will introduce a race > condition. When using the following array syntax, the indirection > of X(IX(I)) is less obvious: > > program ex2 > real x(1024) > integer ix(1024) > cdir$ shared x(:block), ix(:block) > data x/1024*0./, ix/1024*1/ > x(ix)=x(ix)+1. > cdir$ master > print *, 'x(1)= ', x(1) > cdir$ end master > end > > Compilation and execution is as follows: > > t3d: cf77 ex2.f > t3d: a.out -npes 8 > x(1)= 183. > t3d: a.out -npes 8 > x(1)= 180. > t3d: a.out -npes 1 > x(1)= 1024. > > Because the compiler treats array syntax that uses shared arrays > as shared loops, the preceding syntax introduces a race > condition; also, it violates the Fortran 90 "Vector Subscript" > standard (paragraph 6.2.2.3.2), which prohibits array syntax that > has repeated indices in a vector update. The standard states: > > "A vector subscript designates a sequence of subscripts > corresponding to the values of the elements of the expression. > Each element of the expression must be defined. A `many-one array > section' is an array section with a vector subscript having two > or more elements with the same value. A `many-one array section' > must not appear on the left of the equals in an assignment > statement or as an input item in a READ statement." > > > Example 3: Using a CDIR$ ATOMIC UPDATE before the array syntax in > example 2 > > Although it may appear that the use of a CDIR$ ATOMIC UPDATE > directive before the x(ix)=x(ix)+1. statement in Example 2 would > correct the race condition, it will not, as the following test > case shows: > > program ex3 > real x(1024) > integer ix(1024) > cdir$ shared x(:block), ix(:block) > data x/1024*0./, ix/1024*1/ > cdir$ atomic update > x(ix)=x(ix)+1. > cdir$ master > print *, 'x(1)= ', x(1) > cdir$ end master > end > > Compilation and execution is as follows: > > t3d: cf77 ex3.f > t3d: a.out -npes 8 > x(1)= 184. > t3d: a.out -npes 8 > x(1)= 185. > t3d: a.out -npes 1 > x(1)= 1024. > > It still violates the Fortran 90 "Vector Subscript" standard; > therefore, the compiler does not consider the ATOMIC UPDATE > directive before the array syntax when it do-shares the array > syntax. > > In summary, the ATOMIC UPDATE directive is required for vector > updates that contain repeated indices, and you must place this > directive immediately before the assignment statement, inside a > user DOSHARED loop.<end of article>
Managing Parallel I/O on the T3D
Over the past year there have been several ARSC users who have complained their T3D job hangs when doing I/O. Usually these cases implement parallel I/O with either of the two methods described in newsletter #60 (11/10/95). Both methods of using different unformatted files per PE or different direct access records per PE allow the user to swamp the I/O capabilities of the system. I believe this is the reason for the hangs and I advocate managing I/O to avoid these hangs. I haven't had much success with this because it requires extra coding to manage these large transfers. In this article I try to show that the effort to manage I/O is small.In most cases, it seems that after a long time computing, the results are ready for saving to disk and this is a particularly unfortunate time for a hang. That the hangs are intermittent and not reproducible adds to the user's frustration. Of the three cases I know of, the results have been:
- One user gave up and moved his application to the SP2.
- One user continues to hope the run wouldn't hang but kills the run if the I/O phase takes too long. (Being a nonmultiprogrammed machine, the T3D is time-wise a very predictable machine)
- Another user (me) has wasted a lot of computer time and programming time trying to find a program that reliably reproduces the I/O hang, but with no luck.
parameter( MAX = 7000000 )
real a ( MAX )
intrinsic my_pe
character*16 filename
character*2 ciun
iun = 10 + my_pe()
write( ciun, "(i2)" ) iun
filename = '/tmp/ess/fort.'//ciun
open( iun, form = 'unformatted', file=filename )
call barrier()
t1 = second()
write( iun ) a
t1 = second() - t1
write( 6, 600 ) MAX, iun, MAX/( 1000000.0 * t1 ), t1
600 format( i10, i4, f10.3, f10.6 )
end
This program, running on N$PES PEs, has each PE write its own huge file, a typical output on 8 PEs is:
7000000 12 0.486 14.390356 7000000 16 0.485 14.419075 7000000 17 0.485 14.447319 7000000 10 0.484 14.455764 7000000 15 0.484 14.464045 7000000 14 0.484 14.475598 7000000 11 0.483 14.483802 7000000 13 0.478 14.643125So we can say that the aggregate output is about 8 * .484 = 3.87 MW/s. Now as a first attempt to manage this I/O, lets have an array of N$PES flags to signal when a PE can write its file. As each PE finishes its I/O it changes the status of the next PEs flag and then that PE does its I/O without any contention with other PEs. Here's a possibility:
parameter( MAX = 7000000 )
real a ( MAX )
intrinsic my_pe
integer mype, npes
character*16 filename
character*2 ciun
integer flags( 0:127 )
cdir$ shared flags( :block )
mype = my_pe()
npes = n$pes
cdir$ master
do i = 0, npes-1
flags( i ) = 0
enddo
flags( 0 ) = 1
cdir$ endmaster
iun = 10 + my_pe()
write( ciun, "(i2)" ) iun
filename = '/tmp/ess/fort.'//ciun
open( iun, form = 'unformatted', file=filename )
call barrier()
t1 = second()
10 continue
cdir$ suppress
if( flags( mype ) .eq. 1 ) then
write( iun ) a
flags( mype+1 ) = 1
else
goto 10
endif
t1 = second() - t1
write( 6, 600 ) MAX, iun, MAX/( 1000000.0 * t1 ), t1
600 format( i10, i4, f10.3, f10.6 )
end
To keep this program from hanging, it must be compiled with
-ooff
or use
CDIR$ SUPPRESS
as shown above. (Testing a shared array this way was discussed in newsletter #12 (11/11/94)). A typical run is:
7000000 10 2.349 2.979504 7000000 11 1.278 5.479143 7000000 12 0.882 7.940334 7000000 13 0.675 10.369488 7000000 14 0.541 12.935420 7000000 15 0.456 15.353703 7000000 16 0.391 17.880117 7000000 17 0.344 20.330558From this run we can compute the aggregate I/O speed as:
7000000 * 8 / 20.330558 = 2.75 Mw/s.slower but managed (notice that the output now comes out in PE order).
The I/O hangs at ARSC are with more than 8 PEs and rather than 'single thread' all I/O operations, as in the above example, it is possible to use multiple PEs but less than N$PES using the "atomic update" Fortran extension. Here is an example:
parameter( MAX = 7000000 )
real a ( MAX )
intrinsic my_pe
integer mype, npes
character*17 filename
character*2 ciun
cdir$ shared ihi
mype = my_pe()
npes = n$pes
iun = 10 + my_pe()
write( ciun, "(i2)" ) iun
filename = '/tmp/ess/fort.'//ciun
open( iun, form = 'unformatted', file=filename )
c set the number of PEs that can do I/O concurrently
ihi = 4
call barrier()
t1 = second()
10 continue
cdir$ suppress
if( mype .lt. ihi ) then
write( iun ) a
cdir$ atomic update
ihi = ihi + 1
else
goto 10
endif
t1 = second() - t1
call barrier()
if( mype .eq. (npes-1) ) then
write( 6, 601 ) ihi, npes*MAX, npes*MAX/( 1000000.0 * t1 )
endif
100 continue
601 format( i4, i14, 4x, f10.3 )
end
Below are the results for various values of N$PES and an increasing number of concurrent I/O operations. The variability is typical of a heavily loaded machine like ARSC's Y-MP, but some trends can be discerned:
Table 1
I/O transfer rates (MW/s) for all PEs concurrently transferring 7MW# of test 1 test 2 test 3 test 4 test 5 test 6 test 7 test 8 conc- urrent 8 PEs 8 PEs 16 PEs 16 PEs 32 PEs 32 PEs 64 PEs 64 PEs xfers 1 2.706 2.191 2.083 2.108 1.688 2.336 1.624 2 4.508 2.849 4.360 3.319 2.728 2.242 2.611 2.529 3 4.945 2.660 5.431 3.772 2.853 2.972 2.648 4 3.999 2.299 4.881 3.109 2.482 2.633 2.392 4.216 5 4.663 2.750 1.941 4.533 2.671 2.537 1.792 6 4.495 3.822 5.040 3.003 2.264 2.571 1.807 3.649 7 4.244 3.010 4.493 3.273 1.932 2.404 2.324 8 4.235 2.796 4.385 2.998 2.405 1.886 2.290 3.407 9 1.915 3.288 2.377 1.648 2.694 10 3.605 3.784 1.809 2.079 2.967 4.923 11 4.365 3.266 1.705 1.584 2.722 12 4.526 2.712 1.745 1.795 2.665 3.917 13 4.884 3.276 1.510 1.572 2.959 14 2.134 2.627 4.301 1.852 3.149 5.047 15 1.807 1.917 1.025 1.850 2.890 16 3.378 2.690 1.691 1.458 3.054 4.041 17 1.542 1.642 2.910 18 1.968 2.207 2.758 3.677 19 3.931 1.432 3.412 20 1.604 1.640 2.578 4.240 21 1.666 1.939 3.318 22 2.886 1.384 3.200 3.327 23 1.345 1.188 3.061 24 1.515 1.700 3.723 3.331 25 1.506 1.771 2.803 26 1.824 2.089 3.268 3.648 27 1.533 1.783 2.970 28 1.565 1.428 2.604 3.660 29 1.507 1.593 2.545 30 2.024 1.614 3.351 3.302 31 1.516 1.442 2.970 32 1.436 1.789 3.073 3.383 33 2.936 34 3.029 2.773 35 3.244 36 3.258 2.719 37 2.812 38 2.487 3.154 39 2.347 40 2.913 3.695 41 2.188 42 3.126 2.718 43 2.963 44 3.971 2.796 45 3.221 46 2.317 2.868 47 2.320 48 2.839 2.845 49 2.460 50 2.433 3.196 51 2.363 52 2.445 3.213 53 2.850 54 2.858 3.402 55 2.843 56 3.069 2.920 57 3.184 58 2.648 2.798 59 2.744 60 3.561 61 62 3.203 63 64 3.113Although the variability of the timings obscure possible conclusions, I think we can say:
- Beyond a relatively small number of concurrent (say 16 PEs) there is no improvement in I/O transfer rate.
- Best results seem consistently below 8 concurrent processors.
- For runs with 32 and 64 PEs, using all PEs to transfer concurrently produces less than optimal transfer rates.
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
