ARSC T3D Users' Newsletter 56, October 13, 1995
The Speed of SHMEMS
At the Alaska CUG, there were several polished papers on using SHMEMS to get high performance. For those of us brought up on send and receive message passing, these one-sided message passing routines seem a little daring. In the two-sided message passing world of send and receive, if a program didn't work (or hung) it was always possible to start matching up message ids in pairs of a send and then a receive. With one-sided routines like SHMEMS, the programmer must have some other criterion for knowing that the sender and receiver are in sync.
But these one-sided routines have much lower overhead, increased opportunities for asynchronous operations and are easier to use (even if more dangerous). I believe that the trend in parallel processing is towards this type of communication. I have two more data points to support this contention:
- CRI has extended SHMEMs to its PVP line of computers.
- The new MPI2 standard will have a similar form of one-sided messages.
I did all of my timings in C but I'm sure the timings are similar in the Fortran world. The easiest function to measure is a shmem_get, because it is a "blocking" function. "Blocking" means that the PE that executes the shmem_get doesn't continue executing until the data from another PE has been copied into the memory of the PE that executed the shmem_get. The program below measures the time for shmem_get on PE0 for increasingly larger messages from PE0 to PE7:
main()
{
int i,j,NPES,MYPE;
long space[128];
double t1,t2,second();
double a[2000000];
double b[2000000];
double t[18][128];
int n[18] = {0,1,2,3,4,5,10,20,100,200,1000,2000,
10000,20000,100000,200000,1000000,2000000};
NPES = _num_pes();
MYPE = _my_pe();
shmem_barrier(0,1,NPES,space);
if (MYPE == 0) {
for (i = 0; i < NPES; i++) {
for (j = 0; j < 18; j++) {
t1 = second();
shmem_get(a,b,n[j],i);
t2 = second();
t[j][i] = t2 - t1;
}
}
for (j = 0; j < 18; j++) {
printf("%7d",n[j]);
for (i = 0; i < 8; i++) {
printf("%9.6f",t[j][i]);
}
printf("\n");
}
} else {
}
shmem_barrier(0,1,NPES,space);
}
double second()
{
double junk;
fortran irtc();
junk = irtc( ) / 150000000.0;
return(junk);
}
The program above produces the following table in seconds:
Table 1
Execution times for shmem_get:
Size of
Message PE0toPE0 PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
(8byte words)
0 0.000070 0.000005 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
V 1 0.000010 0.000004 0.000004 0.000004 0.000004 0.000004 0.000004 0.000004
2 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003
3 0.000003 0.000004 0.000004 0.000004 0.000004 0.000004 0.000004 0.000004
4 0.000006 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007
5 0.000006 0.000006 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007
10 0.000006 0.000008 0.000009 0.000009 0.000008 0.000008 0.000008 0.000008
20 0.000007 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010
100 0.000012 0.000026 0.000028 0.000029 0.000027 0.000027 0.000029 0.000028
200 0.000020 0.000049 0.000051 0.000050 0.000048 0.000048 0.000051 0.000051
1000 0.000069 0.000215 0.000232 0.000232 0.000219 0.000219 0.000234 0.000234
2000 0.000150 0.000424 0.000459 0.000526 0.000434 0.000434 0.000464 0.000464
10000 0.000735 0.002105 0.002279 0.002280 0.002153 0.002154 0.002301 0.002301
20000 0.001457 0.004200 0.004554 0.004555 0.004369 0.004374 0.004597 0.004665
100000 0.007348 0.021118 0.022949 0.022886 0.021625 0.021627 0.023156 0.023121
200000 0.014663 0.042219 0.045762 0.045834 0.043246 0.043246 0.046181 0.046436
1000000 0.073088 0.211071 0.228920 0.228919 0.216425 0.216191 0.231018 0.231043
2000000 0.146075 0.422373 0.457952 0.457735 0.432346 0.432682 0.461994 0.462064
So for small messages the latency is on the order of a few microseconds and for large messages the bandwidth is about 35 Mbytes/second for off PE messages.
Measuring the speed of the shmem_put is a much more challenging task, because shmem_put is a "non-blocking" operation. "Non-blocking" means that the operation is executed and the PE that executed the operation may continue executing before the operation has completed. As a first attempt, we just substitute 'get' for 'put' in the above program:
< shmem_put(a,b,n[j],i); --- > shmem_get(a,b,n[j],i);Now we have a lower bound on the time for a shmem_put for different size messages between PE0 and PE0 to PE7:
Table 2
Lower bound execution times for shmem_put:
Size of
Message PE0toPE0 PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
(8byte words)
0 0.000068 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
V 1 0.000006 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
2 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
3 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
4 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
5 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
10 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003
20 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003
100 0.000007 0.000008 0.000008 0.000008 0.000008 0.000008 0.000008 0.000008
200 0.000011 0.000014 0.000014 0.000014 0.000014 0.000014 0.000014 0.000014
1000 0.000061 0.000065 0.000065 0.000065 0.000065 0.000065 0.000065 0.000065
2000 0.000214 0.000127 0.000127 0.000127 0.000127 0.000194 0.000127 0.000127
10000 0.000873 0.000632 0.000631 0.000631 0.000631 0.000632 0.000631 0.000632
20000 0.001725 0.001259 0.001259 0.001259 0.001326 0.001260 0.001259 0.001259
100000 0.008704 0.006350 0.006351 0.006351 0.006284 0.006283 0.006283 0.006284
200000 0.017330 0.012637 0.012633 0.012634 0.012630 0.012630 0.012692 0.012692
1000000 0.086890 0.063190 0.063184 0.063184 0.063187 0.063250 0.063187 0.063186
2000000 0.173579 0.126370 0.126359 0.126564 0.126358 0.126357 0.126359 0.126361
This lower bound seems in line with our assumption that shmem_puts are faster but the PE0 to PE0 shmem_put is slower than the shmem_get timing, so maybe it is better to special case such "on PE" operations with a copy operation.
The reason these timings are too low is that we're not sure that the operation has completed before we take the second reading of the clock with the second function. To get a more accurate timing, we read the clock after the receiving PE has sent a message back to the sending PE. This will be an upper bound because there will have to be spin loops on both PEs:
- The receiving PE must spin until the put is finished.
- The sending PE must spin until told the operation is complete.
main()
{
int i,j,k,NPES,MYPE;
long space[128];
double t1,t2,second();
int a[2000000];
int b[2000000];
double t[18][128];
int n[18] = {0,1,2,3,4,5,10,20,100,200,1000,2000,
10000,20000,100000,200000,1000000,2000000};
int icase,count;
int ok[1];
int volatile go[1];
NPES = _num_pes();
MYPE = _my_pe();
for (i = 0; i < 2000000; i++) a[i] = -1;
a[0] = 0;
shmem_barrier(0,1,NPES,space);
if (MYPE == 0) {
for (i = 1; i < NPES; i++) {
for (j = 1; j < 18; j++) {
go[0] = 0;
for (k = 0; k < n[ j ]; k++) b[k] = n[j];
t1 = second();
shmem_put(a,b,n[j],i);
loop:;
if (go[0] == 0) goto loop;
t2 = second();
t[j][i] = t2 - t1;
}
}
for (j = 0; j < 18; j++) {
printf("%7d",n[j]);
for (i = 0; i < 8; i++) {
printf("%9.6f",t[j][i]);
}
printf("\n");
}
} else {
count = 1;
a[count-1] = -2;
ok[0] = 1;
again:;
if (a[count-1] == n[count]) {
shmem_put(go,ok,1,0);
count++;
if (count == 18) goto out;
}
goto again;
}
out:;
shmem_barrier(0,1,NPES,space);
}
The two explicit spin loops hammer memory too much and interfere with the put operation. The timings are worthless but illustrate that synchronization in the one-sided message passing world has to be handled well.
Table 3
Bad upper bound execution times for shmem_put:
Size of
Message PE0toPE0 PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
(8byte words)
0 NaN NaN NaN NaN NaN NaN NaN NaN
V 1 NaN 1.720021 1.331687 1.342175 1.342173 1.331689 1.331685 2.483250
2 NaN 1.331687 1.342173 1.331686 1.331686 1.331685 1.331688 2.673862
3 NaN 1.331691 1.331689 1.331687 1.331840 1.342174 1.342176 2.673867
4 NaN 1.342169 1.342171 1.342174 1.342022 1.331686 1.331685 2.663377
5 NaN 1.331687 1.331688 1.331689 1.331687 1.331691 1.342174 1.342174
10 NaN 1.342173 1.331686 1.331685 1.342174 1.342170 1.331686 2.673864
20 NaN 1.331689 1.342177 1.342173 1.331691 1.331686 1.331689 2.663378
100 NaN 1.331682 1.331681 1.331684 1.331679 1.331685 1.342169 1.342174
200 NaN 1.342168 1.331682 1.342320 1.342168 1.342172 1.331681 1.331679
1000 NaN 1.331661 1.342148 1.331509 1.331661 1.331657 1.331661 1.331662
2000 NaN 1.331634 1.331634 1.331634 1.331634 1.342120 1.342123 1.342120
10000 NaN 1.331416 1.331420 1.341907 1.341906 1.331421 1.331418 1.331419
20000 NaN 1.341640 1.341640 1.331154 1.331153 1.331153 1.331154 1.331155
100000 NaN 1.318535 1.329022 1.329020 1.329037 1.339507 1.339507 1.339507
200000 NaN 1.326343 1.336838 1.336840 1.336824 1.326353 1.326353 0.012632
1000000 NaN 0.063192 1.304759 1.304924 1.304919 1.304926 1.315406 2.629070
2000000 NaN 2.488364 1.278112 1.278115 1.288599 1.288600 0.126551 2.609808
But CRI can handle this type of synchronization with a function, shmem_wait (see man page), this function executes entirely in cache and does not interfere with the on going shmem_put memory operations. Now the above program is modified with on sending side, the old spin loop code:
loop:;
if (go[0] == 0) goto loop;
changes to new shmem_wait code:
if (go[0] == 0 ) {
shmem_wait(go,0);
}
and on the receiving side, the old spin loop code:
count = 1;
a[count-1] = -2;
ok[0] = 1;
again:;
if (a[count-1] == n[count]) {
shmem_put(go,ok,1,0);
count++;
if (count == 18) goto out;
}
goto again;
changes to new shmem_wait code:
count = 1;
ok[0] = 1;
again:;
a[count-1] = -2;
if (a[count-1] == -2 ) {
shmem_wait(&a[count-1], -2 );
shmem_put(go,ok,1,0);
count++;
if (count == 18) goto out;
}
goto again;
Now the timings seem much better:
Table 4
Reasonable upper bound execution times for shmem_put:
Size of
Message PE0toPE0 PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
(8byte words)
0 NaN NaN NaN NaN NaN NaN NaN NaN
V 1 NaN 0.000083 0.000012 0.000012 0.000014 0.000013 0.000013 0.000013
2 NaN 0.000006 0.000006 0.000006 0.000006 0.000006 0.000006 0.000006
3 NaN 0.000006 0.000006 0.000006 0.000006 0.000006 0.000006 0.000006
4 NaN 0.000006 0.000007 0.000006 0.000007 0.000007 0.000007 0.000006
5 NaN 0.000006 0.000007 0.000006 0.000006 0.000006 0.000007 0.000006
10 NaN 0.000007 0.000007 0.000006 0.000007 0.000007 0.000007 0.000007
20 NaN 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007
100 NaN 0.000012 0.000012 0.000012 0.000012 0.000012 0.000012 0.000012
200 NaN 0.000018 0.000018 0.000018 0.000018 0.000018 0.000018 0.000018
1000 NaN 0.000066 0.000066 0.000066 0.000066 0.000066 0.000066 0.000066
2000 NaN 0.000128 0.000128 0.000128 0.000128 0.000128 0.000128 0.000128
10000 NaN 0.000632 0.000632 0.000631 0.000632 0.000632 0.000631 0.000631
20000 NaN 0.001261 0.001260 0.001260 0.001260 0.001260 0.001260 0.001260
100000 NaN 0.006355 0.006287 0.006353 0.006287 0.006352 0.006287 0.006352
200000 NaN 0.012633 0.012634 0.012633 0.012633 0.012633 0.012636 0.012633
1000000 NaN 0.063203 0.063186 0.063183 0.063183 0.063187 0.063181 0.063186
2000000 NaN 0.126370 0.126356 0.126594 0.126359 0.126355 0.126354 0.126530
The overhead of stopping the clock when the operation is done makes the latencies for small messages too large but the bandwidth for large transfers at 125 Mbytes/second seem to uphold the contention that shmem_puts are faster than shmem_gets, in this case by more than a factor of three.
Surprising, for me, is that the naive attempt that produced table 2. is close to the final table 4. I think these examples are correct as to most of their content but I would like to hear about any mistakes or improvements and I will put them in upcoming issues of the ARSC T3D newsletter. I have postscript copies of the SHMEM for C and Fortran user's manuals and I can e-mail them to anyone who requests them.
Correction to the 'pref' Routine in benchlib
Jeff Brooks of CRI again gave his famous talk about single PE optimization at the Alaska CUG with updated examples. He illustrated the speedups available by the judicious cache usage with plenty of examples. One utility for managing cache usage is the utility 'pref' which was described in Jeff Brook's paper on single PE optimization (available on the ARSC anonymous ftp server, ftp.arsc.edu, as pub/submissions/t3d_opt.ps.Z) and also described in ARSC T3D newsletter #30 (4/7/95).With the 32 bit loads available from f90 and cc, there has been an improvement to the utility routine 'pref' to handle "misaligned loads". The library lib_util.a at ARSC and the benchlib source on the anonymous FTP server as pub/submissions/libbnch.tar.Z have been updated with this improvement.
New Version of MPICH at ARSC
The latest version (1.0.11) of the Argonne/Mississippi State implementation of MPI has been installed in /usr/local/examples/mpp/mpich. There are several directories to investigate with this version of MPI:/usr/local/examples/mpp/mpich/lib/cray_t3d/t3d the libraries /usr/local/examples/mpp/mpich/include the include files /usr/local/examples/mpp/mpich/doc the documentation /usr/local/examples/mpp/mpich/man the man pages /usr/local/examples/mpp/mpich/examples ready to run examples for cf77 and CThis version of MPI fixes at least one previously reported timing anomaly with messages of an odd number of 32 bit words. Any problems with this version should be reported to Mike Ess.
List of Differences Between T3D and Y-MP
The current list of differences between the T3D and the Y-MP is:- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
- Missing Linpack and Eispack routines in libsci (Newsletter #25)
- F90 manual for Y-MP, no manual for T3D (Newsletter #31)
- RANF() and its manpage differ between machines (Newsletter #37)
- CRAY2IEG is available only on the Y-MP (Newsletter #40)
- Missing sort routines on the T3D (Newsletter #41)
- Missing compiler allocation flags (Newsletter #52)
- Missing compiler listing flags (Newsletter #53)
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
