ARSC T3D Users' Newsletter 58, October 27, 1995
Getting the Right Answer with SHMEM_PUTS
There are two conditions that must be met to get the correct answers with SHMEM_PUTS: synchronization and cache management. Maybe in the past this wasn't made explicit and the time spent making the cache coherent actually was the delay needed to synchronize the processors. In the following scenarios, let's see which sequence produces the correct value of B = 1 on PE1:
Scenario 1
time activity on PE0 activity on PE1
(arbitrary units)
0 A = 1 A = 0
1 start shmem_put A
2 shmem_udcflush()
3 B = A (update #1)
4 end shmem_put
5 memory update finished
6
7 shmem_udcflush()
8 B = A (update #3)
Update #1 is not correct even though it is preceded by a cache flush operation because the new value of A has not been furnished by PE0 yet. Update #3 should be correct because the memory of PE1 has been updated and the old value of A has been flushed from cache.
Scenario 2
time activity on PE0 activity on PE1
(arbitrary units)
0 A = 1 A = 0
1 start shmem_put A
2
3
4 end shmem_put
5 memory update finished
6 B = A (update #2)
7 shmem_udcflush()
8 B = A (update #3)
Even though update #2 happens after the memory of PE1 has the correct value for A, we still need the cache management routine to flush the cache of the old value of A. So to get the correct value on PE1 we need these two steps to happen:
- PE1's memory gets the correct value
- PE1's cache is flushed of the previous value
- Processing on PE1 could invalidate A's cache line, meaning the flush was not necessary.
- The cache flush could push the access of A to a time after the memory update was complete.
#include <stdio.h>
#define NMAX 1000000
main()
{
int i, j, itest, NPES, MYPE;
long a[ NMAX ];
long b[ NMAX ];
int count, wait;
NPES = _num_pes();
MYPE = _my_pe();
itest = 0;
for( i = 0; i < NMAX; i++ ) b[ i ] = 2;
for( j = 0; j < 10; j++ ) {
barrier( );
for( i = 0; i < NMAX; i++ ) a[ i ] = 1;
if( MYPE == 0 ) printf( "case 1\n" ); fflush( stdout );
barrier( );
if( MYPE == 0 ) {
shmem_put( a, b, 1, 1 );
} else {
count = 0;
again:;
if( a[ itest ] != 2 ) {
count++;
a[ itest+1 ] = 3; /* one of these statements with flush */
a[ itest-1 ] = 3; /* A[itest]'s cache line */
goto again;
}
printf( " Case1, a[%d] on PE1 = %d, count = %d\n", itest, a[itest], count );
}
barrier( );
if( MYPE == 0 ) printf( "case 2\n" ); fflush( stdout );
for( i = 0; i < NMAX; i++ ) a[ i ] = 1;
barrier( );
if( MYPE == 0 ) {
shmem_put( a, b, 1, 1 );
} else {
shmem_udcflush();
printf( " Case2, a[%d] on PE1 = %d\n", itest, a[itest] );
}
barrier( );
if( MYPE == 0 ) printf( "case 3\n" );
for( i = 0; i < NMAX; i++ ) a[ i ] = 1;
barrier( );
if( MYPE == 0 ) {
shmem_put( a, b, 1, 1 );
} else {
shmem_udcflush_line( &a[ itest ] );
printf( " Case3, a[%d] on PE1 = %d\n", itest, a[itest] );
}
barrier( );
if( MYPE == 0 ) printf( "case 4\n" ); fflush( stdout );
for( i = 0; i < NMAX; i++ ) a[ i ] = 1;
barrier( );
if( MYPE == 0 ) {
shmem_put( a, b, 1, 1 );
} else {
wait = 0;
if( a[ itest ] == 1 ) {
wait = 1;
shmem_wait( &a[ itest ], 1 );
}
printf( " Case4, a[%d] on PE1 = %d, wait = %d\n", itest, a[itest], wait );
}
barrier( );
}
}
In all 4 cases we have PE0 changing the value of A on PE1 and PE1 using some strategy to get the updated value. We have:
case1 Case1, a[0] on PE1 = 2, count = 90503 case 2 Case2, a[0] on PE1 = 1 case 3 Case3, a[0] on PE1 = 1 case 4 Case4, a[0] on PE1 = 2, wait = 1 Case1: Shows that both synchronization and a cache flush are necessary Case2: Shows shmem_udcflush is not sufficient Case3: Shows shmem_udcflush_line is not sufficient Case4: Shows shmem_wait is sufficientUsing the SHMEM_PUTS may be faster than SHMEM_GETS, PVM or MPI, but it requires that the programmer be more careful.
How Are SHMEM PUTS Implemented?
Moon Kyeong-Deok of KIST/SERI, in Korea, suggested that I reexamine the shmem_put timings of Newsletter #56 (10/13/95) in MB/second instead of raw times. This produces a very interesting gap in the timing data, that leads me to think that the implementation of shmem_puts are implemented with one algorithm for messages less than 384 words and one algorithm for those above 384 words. A plot of the timings shows this more dramatically (I can email the postscript plots to anyone who requests them), but the table below also shows this "discontinuity":Table 1
execution speeds for shmem_put (MB/second)size of message PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7 (8-byte words) 370 95.2 95.8 96.4 95.4 94.2 96.1 95.6 380 94.3 95.2 95.5 94.3 94.1 94.4 94.0 381 94.4 95.2 94.7 95.2 94.7 94.7 94.1 382 94.7 95.5 95.0 95.9 95.5 95.5 95.3 383 93.9 96.0 95.2 96.0 95.4 95.7 94.9 384 110.8 113.6 113.9 113.7 114.6 114.3 113.7 385 112.7 114.2 113.3 114.4 114.2 113.5 114.5 386 111.7 114.7 114.6 114.7 114.8 113.9 114.8 387 113.4 114.1 113.9 115.3 114.9 114.1 34.7 388 113.1 112.6 113.1 113.3 114.7 113.7 104.5 389 112.7 114.1 113.0 113.5 114.5 113.7 114.6 390 113.1 114.5 114.3 115.4 114.0 114.3 114.3 400 113.4 114.6 114.4 115.1 115.1 114.4 113.9Without documentation, many library routines become opportunities for such "black box" experiments.
Reading and Writing T3D and Y-MP Files
I have been avoiding this subject of reading and writing files for the other machine because it is a mix of trickology and magic. I think the best way to approach it is with a list of examples for the common cases and then a resourceful programmer can work from there. In the coming newsletters, I will try to come up with a list of working examples and expand the list below. In this issue, we have:Example Type of Source (writing) Target (reading) (newsletter) File 1 (58) Direct Access T3D(t3d1.f) Y-MP(ymp1.f), using asnunit 2 (58) T3D(t3d1.f) Y-MP(ymp2.f), using IEG2CRAY 3 (58) T3D(t3d2.f),cray format Y-MP(ymp3.f)Below are the sample programs and their output on each machine. The makefile that I used for this set of examples is:
TT3D=TARGET=cray-t3d
TY-MP=TARGET=cray-ymp
MNPE=MPP_NPES=1
t3d1: t3d1.f
-rm t3d.dir
assign -R
(export $(TT3D); cf77 t3d1.f )
(export $(TT3D); export $(MNPE); a.out )
t3d3: t3d3.f
-rm ymp.dir
assign -R
(export $(TT3D); cf77 t3d3.f )
(export $(TT3D); export $(MNPE); a.out )
ymp1: ymp1.f
assign -R
(export $(TY-MP); cf77 ymp1.f )
(export $(TY-MP); a.out )
ymp2: ymp2.f
assign -R
(export $(TY-MP); cf77 ymp2.f )
(export $(TY-MP); a.out )
ymp3: ymp3.f
assign -R
(export $(TY-MP); cf77 ymp3.f )
(export $(TY-MP); a.out )
clean:
rm -f a.out ymp.dir t3d.dir mppcore core
The "assign -R" command removes all assigned options from previous assigns and opens. It is essential to start with a clean slate for each example. Another invaluable trick is that the error messages out of the Fortran open statement are in the file:
/usr/include/liberrno.h.
Example 1
T3D written direct access file read by the Y-MP with data conversion done with the assign (or asnunit)
t3d1.f ymp1.f
real a( 10 ), b( 10 ) real a( 10 ), b( 10 )
sum1 = 0.0 sum1 = 0.0
open( unit=10, file='t3d.dir', call asnunit( 10,
+ form='unformatted',recl=80, + '-F syscall -N ieee_dp', ier )
+ access='direct',iostat=istat, print *, ier
+ status='new' ) open( unit=10,access='direct',
print *, istat + recl=80,file='t3d.dir',
do 10 i = 1, 10 + form='unformatted',iostat=istat,
a( i ) = i + status='old')
sum1 = sum1 + a( i ) print *, istat
10 continue if( istat .ne. 0 ) stop
print *, "sum1 = ", sum1 read( 10, rec=1 ) a
write( 10, rec=1 ) a sum1 = 0.0
sum2 = 0.0 do 10 i = 1, 10
do 20 i = 1, 10 sum1 = sum1 + a( i )
b( i ) = 10 * i write( 6, 600 ) a( i )
sum2 = sum2 + b( i ) 10 continue
20 continue 600 format( o22 )
print *, "sum2 = ", sum2 print *, 'sum1 = ', sum1
write( 10, rec=2 ) b read( 10, rec=2 ) b
write( 10, rec=3 ) a sum2 = 0.0
close( 10 ) do 20 i = 1, 10
print *, 'close done' sum2 = sum2 + b( i )
open( unit=10, 20 continue
+ access='direct', print *, 'sum2 = ', sum2
+ recl=80,file='t3d.dir', read( 10, rec=3 ) a
+ form='unformatted', sum3 = 0.0
+ iostat=istat, do 30 i = 1, 10
+ status='old') sum3 = sum3 + a( i )
print *, istat 30 continue
read( 10, rec=1 ) a print *, 'sum3 = ', sum3
sum1 = 0.0 end
do 11 i = 1, 10
sum1 = sum1 + a( i )
write( 6, 600 ) a( i )
11 continue
600 format( o22 )
print *, 'sum1 = ', sum1
read( 10, rec=2 ) b
sum2 = 0.0
do 21 i = 1, 10
sum2 = sum2 + b( i )
21 continue
print *, 'sum2 = ', sum2
read( 10, rec=3 ) a
sum3 = 0.0
do 31 i = 1, 10
sum3 = sum3 + a( i )
31 continue
print *, 'sum3 = ', sum3
close( 10 )
end
Output on T3D Output on Y-MP
0 0
sum1 = 55. 0
sum2 = 550. 0400014000000000000000
close done 0400024000000000000000
0 0400026000000000000000
0377600000000000000000 0400034000000000000000
0400000000000000000000 0400035000000000000000
0400100000000000000000 0400036000000000000000
0400200000000000000000 0400037000000000000000
0400240000000000000000 0400044000000000000000
0400300000000000000000 0400044400000000000000
0400340000000000000000 0400045000000000000000
0400400000000000000000 sum1 = 55.
0400420000000000000000 sum2 = 550.
0400440000000000000000 sum3 = 55.
sum1 = 55.
sum2 = 550.
sum3 = 55.
Example 2
T3D written direct access file read by the Y-MP with data conversion done with the routine IEG2CRAY (see manpage on denali)
t3d1.f(same as above) ymp2.f
real a( 10 ), b( 10 )
real c( 10 )
sum1 = 0.0
open( unit=10,
+ access='direct',recl=80,
+ file='t3d.dir',form='unformatted',
+ iostat=istat,status='old')
print *, istat
if( istat .ne. 0 ) stop
read( 10, rec=1 ) a
ier = ieg2cray( 8,10,a,0,c,1 )
print *, 'ier =', ier
sum1 = 0.0
do 10 i = 1, 10
sum1 = sum1 + c( i )
write( 6, 600 ) a( i ),
10 continue
600 format( o22, 1x, o22 )
print *, 'sum1 = ', sum1
read( 10, rec=2 ) b
ier = ieg2cray( 8,10,b,0,c,1 )
sum2 = 0.0
do 20 i = 1, 10
sum2 = sum2 + c( i )
20 continue
print *, 'sum2 = ', sum2
read( 10, rec=3 ) a
ier = ieg2cray( 8,10,a,0,c,1 )
sum3 = 0.0
do 30 i = 1, 10
sum3 = sum3 + c( i )
30 continue
print *, 'sum3 = ', sum3
end
Output on Y-MP
0
ier =0
0377600000000000000000 0400014000000000000000
0400000000000000000000 0400024000000000000000
0400100000000000000000 0400026000000000000000
0400200000000000000000 0400034000000000000000
0400240000000000000000 0400035000000000000000
0400300000000000000000 0400036000000000000000
0400340000000000000000 0400037000000000000000
0400400000000000000000 0400044000000000000000
0400420000000000000000 0400044400000000000000
0400440000000000000000 0400045000000000000000
sum1 = 55.
sum2 = 550.
sum3 = 55.
Example 3
T3D written direct access file, written with cray formatted floating point numbers read by the Y-MP
t3d3.f ymp2.f
real a( 10 ), b( 10 ) real a( 10 ), b( 10 )
sum1 = 0.0 sum1 = 0.0
call asnunit( 10, open( unit=10,access='direct',
+ ' -F syscall -N cray', ier ) + recl=80,file='ymp.dir',
print *, 'ier = ', ier + form='unformatted',
open( unit=10, file='ymp.dir', + iostat=istat,status='old')
+ form='unformatted',recl=80, print *, istat
+ access='direct',iostat=istat, if( istat .ne. 0 ) stop
+ status='new' ) read( 10, rec=1 ) a
print *, 'istat = ' ,istat sum1 = 0.0
do 10 i = 1, 10 do 10 i = 1, 10
a( i ) = i sum1 = sum1 + a( i )
sum1 = sum1 + a( i ) write( 6, 600 ) a( i )
10 continue 10 continue
print *, "sum1 = ", sum1 600 format( o22 )
write( 10, rec=1 ) a print *, 'sum1 = ', sum1
sum2 = 0.0 read( 10, rec=2 ) b
do 20 i = 1, 10 sum2 = 0.0
b( i ) = 10 * i do 20 i = 1, 10
sum2 = sum2 + b( i ) sum2 = sum2 + b( i )
20 continue 20 continue
print *, "sum2 = ", sum2 print *, 'sum2 = ', sum2
write( 10, rec=2 ) b read( 10, rec=3 ) a
write( 10, rec=3 ) a sum3 = 0.0
close( 10 ) do 30 i = 1, 10
print *, 'close done' sum3 = sum3 + a( i )
open( unit=10,access='direct', 30 continue
+ recl=80,file='ymp.dir', print *, 'sum3 = ', sum3
+ form='unformatted', end
+ iostat=istat,status='old')
print *, istat
read( 10, rec=1 ) a
sum1 = 0.0
do 11 i = 1, 10
sum1 = sum1 + a( i )
write( 6, 600 ) a( i )
11 continue
600 format( o22 )
print *, 'sum1 = ', sum1
read( 10, rec=2 ) b
sum2 = 0.0
do 21 i = 1, 10
sum2 = sum2 + b( i )
21 continue
print *, 'sum2 = ', sum2
read( 10, rec=3 ) a
sum3 = 0.0
do 31 i = 1, 10
sum3 = sum3 + a( i )
31 continue
print *, 'sum3 = ', sum3
close( 10 )
end
Output on T3D Output on Y-MP
ier = 0 0
istat = 0 0400014000000000000000
sum1 = 55. 0400024000000000000000
sum2 = 550. 0400026000000000000000
close done 0400034000000000000000
0 0400035000000000000000
0377600000000000000000 0400036000000000000000
0400000000000000000000 0400037000000000000000
0400100000000000000000 0400044000000000000000
0400200000000000000000 0400044400000000000000
0400240000000000000000 0400045000000000000000
0400300000000000000000 sum1 = 55.
0400340000000000000000 sum2 = 550.
0400400000000000000000 sum3 = 55.
0400420000000000000000
0400440000000000000000
sum1 = 55.
sum2 = 550.
sum3 = 55.
I would like to thank those who sent in some suggestions or examples about reading and writing files on the other machine. We will all benefit from their effort to share hard earned experience:
- Wieslaw Maslowski, NCAR, Boulder, Colorado
- Vera Voronina, Geophysical Institute, UAF, Fairbanks, Alaska
- Adwait Sathye, Center for Analysis and Prediction of Storms, Norman, OK
- Frank Chism, CRI, Seattle, WA
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
