ARSC HPC Users' Newsletter 240, March 5, 2002
MPI Send/Recv Performance
The two articles on Cray's updated message passing toolkit (MPT), mpt 1.4.0.4 which appeared in our last issue didn't consider performance.
Given the multitude of MPI-enabled systems available these days, it seemed like a good time to resurrect a program we've used for doing simple timings of MPI sends/recvs. It was contributed by Alan Wallcraft, and made it's first appearance in issue #66 of the T3D Newsletter.
The program sends messages of different sizes around a ring of processors and returns the average time for a single send/recv. The program was modified to use larger messages, 8-byte words, and to report bandwidth as well as time. The code is included, below.
Runs were done on the T3E, IBM-SP, SV1ex, and a small linux cluster here at ARSC. Runs were made on 4- and 16-processors on each system, and other system-specific parameters were varied as well to make this more interesting.
The tables of results include:
- "mb/s"
- bandwidth in mbytes/sec for all message sizes
- "(usec)"
- to show latency, the absolute time in microseconds for small messages.
- "Size 8-byte words"
- number of REAL*8 words per message.
Cray T3E-900
Observations:
MPT 1.4.0.4 gives a major improvement in the performance of MPI sends/recvs. Also note that on the T3E, there's no added cost in using 16 rather than 4 processors. Transfer rates across the torus tend to be very uniform.
T3E
========
Size
MPT 1.3.0.0 mpt.1.3.0.0 mpt.1.4.0.4 mpt.1.4.0.4
8-byte
4 PEs 16 PEs 4 PEs 16 PEs
words
mb/s (usec) mb/s (usec) mb/s (usec) mb/s (usec)
-------
---- ------ ---- ------ ---- ------ ---- ------
1
0.4 ( 18) 0.4 ( 18) 1.5 (5.3) 1.5 (5.2)
2
0.9 ( 18) 0.9 ( 18) 2.9 (5.4) 3.0 (5.4)
3
1.3 ( 18) 1.3 ( 18) 4.4 (5.5) 4.4 (5.5)
4
1.5 ( 22) 1.5 ( 22) 5.8 (5.5) 5.8 (5.5)
7
2.4 ( 23) 2.4 ( 23) 9.9 (5.7) 9.9 (5.6)
8
2.7 ( 24) 2.7 ( 24) 5.8 ( 11) 5.9 ( 11)
15
5.2 ( 23) 5.2 ( 23) 10.7 ( 11) 10.9 ( 11)
16
5.5 ( 23) 5.7 ( 23) 11.5 ( 11) 11.6 ( 11)
31
10.0 ( 25) 10.0 ( 25) 21.2 ( 12) 21.2 ( 12)
32
10.6 ( 24) 10.5 ( 24) 21.0 ( 12) 22.2 ( 12)
63
17.9 18.1 39.1 40.3
64
18.5 18.9 41.1 40.8
127
29.5 29.1 68.6 72.2
128
30.2 30.6 72.9 73.4
255
44.1 42.5 104.0 118.1
256
44.0 43.9 120.5 119.5
511
60.7 62.1 163.4 163.6
512
61.7 62.6 176.8 176.3
1023
73.8 75.9 162.6 213.4
1024
92.0 90.7 236.1 231.5
2047
103.1 118.6 272.1 273.1
2048
121.3 118.5 221.7 275.1
4095
130.5 139.8 254.4 304.0
4096
145.4 140.2 312.5 314.6
8191
67.8 70.8 333.4 298.9
8192
161.1 156.9 330.0 332.5
16383
59.7 56.2 253.1 318.7
16384
172.1 165.8 329.9 342.5
32767
79.3 78.3 346.0 348.0
32768
176.6 178.3 344.1 348.3
IBM SP 365MHz Power3 processors, 4 processors per shared memory node.
Notes on the table:
-
Runs using both 4 and 16 CPUs were made. On icehawk, there are 4 CPUs available per node. For these runs, the loadleveler specifications "nodes" and "tasks_per_node" are shown as:
<nodes> x <tasks_per_node>
E.g., "16 x 1" indicates that 16 nodes were used, one CPU per node, for a total of 16 CPUs.
-
On icehawk, two networks are available for internode communication, and tests using both were conducted. Annontated as follows:
- IP
- Internet protocol. Selected using the loadleveler option: # @ network.mpi = css0,shared,IP
- US
- IBM high-performance switch: # @ network.mpi = css0,shared,US
-
Intranode MPI traffic can use the processor's shared memory or it can be sent out and back through the node router.
- "not.sh."
- shared memory was not used for MPI traffic.
- "sh.mem."
- shared memory used. Set the environment variable: export MP_SHARED_MEMORY=yes
Observations:
You should always specify the US network and MP_SHARED_MEMORY=yes . The performance penalty for doing otherwise is clear from the second and third columns.
The non-uniformity in communication rates expected in the distributed shared-memory architecture shows up in the comparison between the 4 and 16 CPU runs and the 4x4 and 16x1 runs. The fastest transfers, for this particular program, are intranode. The very best bandwidth observed was in the 1x4 case, with MPI traffic using the node's shared memory (see the first column). From the 4x4 and 16x1 cases (last two columns), it's clear that there's a reward for keeping tasks in co-habitation on the nodes.
Other programs, data types, and communication patterns will show different results, as will runs on IBM's most recent switch and network technologies (Colony or Federation switches, for instance).
IBM-SP
========
US US IP US US
Size
sh.mem. not.sh. sh.mem. sh.mem. sh.mem.
8-byte
4 CPUs 16 CPUs 16 CPUs 16 CPUs 16 CPUs
words
1 x 4 4 x 4 4 x 4 4 x 4 16 x 1
mb/s (usec) mb/s (usec) mb/s (usec) mb/s (usec) mb/s (usec)
-------
---- ------ ---- ------ ---- ------ ---- ------ ---- ------
1
1.2 (6.9) 0.2 ( 32) 0.2 ( 48) 0.5 ( 17) 0.4 ( 21)
2
2.3 (6.9) 0.5 ( 35) 0.3 ( 47) 1.0 ( 15) 0.8 ( 21)
3
3.4 (7.0) 0.7 ( 34) 0.5 ( 47) 1.6 ( 15) 1.1 ( 22)
4
4.6 (7.0) 0.9 ( 34) 0.7 ( 48) 2.1 ( 16) 1.5 ( 22)
7
7.8 (7.2) 1.6 ( 35) 1.2 ( 48) 3.5 ( 16) 2.6 ( 22)
8
9.0 (7.1) 1.8 ( 36) 1.3 ( 48) 4.0 ( 16) 2.9 ( 22)
15
16.4 (7.3) 3.0 ( 40) 1.9 ( 62) 6.7 ( 18) 5.2 ( 23)
16
17.0 (7.5) 3.2 ( 40) 2.6 ( 50) 7.0 ( 18) 5.5 ( 23)
31
31.4 (7.9) 5.9 ( 42) 4.2 ( 60) 12.4 ( 20) 8.7 ( 28)
32
32.3 (7.9) 6.1 ( 42) 4.2 ( 60) 12.6 ( 20) 9.0 ( 28)
63
56.7 11.0 8.2 23.7 15.7
64
55.9 10.8 8.2 24.4 15.9
127
95.2 18.4 15.0 37.9 23.0
128
95.6 18.6 15.1 41.5 23.2
255
143.8 29.1 27.1 67.2 36.3
256
144.8 30.0 27.2 66.8 36.5
511
202.8 38.5 45.5 101.6 51.4
512
188.1 36.6 45.4 101.0 51.1
1023
191.6 52.4 67.2 138.1 68.7
1024
244.8 52.1 68.1 141.4 68.5
2047
282.0 64.5 100.3 174.1 87.5
2048
272.0 64.6 101.1 173.1 87.7
4095
296.3 69.7 131.6 194.6 104.0
4096
295.0 69.8 132.8 188.8 103.9
8191
343.0 75.9 146.1 219.0 119.5
8192
343.0 76.6 154.6 217.0 119.9
16383
372.4 77.4 171.7 244.7 122.7
16384
379.9 78.4 158.8 248.2 123.2
32767
433.2 82.7 195.8 263.6 131.6
32768
430.9 83.0 190.6 261.5 131.5
Cray SV1ex
It might surprise a traditional PVP user, but we have an increasing number of users with some MPI component to SV1ex jobs. For example, NCAR's climate system model (CSM) has several individual components (land, atmosphere, ice, etc..) which are individually multi-tasked and vectorized, but which are coupled using MPI.
All of the MPI messages are "passed" using shared memory.
ARSC's SV1ex can be considered a single 32-processor shared memory node, and if we had more nodes, MPI could be used between them. Clusters of SMPs are going to be around for a while, so this is likely to remain a portable approach.
Cray SV1ex
==========
Size
8-byte
4 CPUs 16 CPUs
words
mb/s (usec) mb/s (usec)
-------
---- ------ ---- ------
1
0.1 ( 67) 0.1 ( 69)
2
0.2 ( 68) 0.2 ( 70)
3
0.3 ( 69) 0.3 ( 70)
4
0.5 ( 68) 0.5 ( 70)
7
0.8 ( 69) 0.8 ( 70)
8
0.9 ( 69) 0.9 ( 70)
15
1.3 ( 90) 1.3 ( 91)
16
1.4 ( 89) 1.4 ( 93)
31
2.8 ( 89) 2.7 ( 91)
32
2.9 ( 89) 2.8 ( 92)
63
5.6 5.5
64
5.7 5.6
127
11.3 10.9
128
11.3 11.2
255
22.3 21.9
256
22.3 21.3
511
43.1 42.3
512
43.1 42.6
1023
78.6 77.1
1024
80.4 80.1
2047
143.4 142.5
2048
142.2 142.3
4095
293.1 288.6
4096
290.3 288.6
8191
473.5 441.3
8192
475.2 467.4
16383
663.9 667.9
16384
670.4 655.8
32767
867.1 865.4
32768
881.0 862.0
Linux Cluster with Myrinet network
This is an 8 node, dual 333MHz pentium II cluster used for training. For this test I used the default, gnu f77 compiler and MPICH.
Linux Cluster (Myrinet network)
===============================
Size
8-byte
4 CPUs 16 CPUs
words
mb/s (usec) mb/s (usec)
-------
---- ------ ---- ------
1
0.8 (9.6) 0.8 (9.6)
2
1.6 (9.7) 1.7 (9.6)
3
2.4 (9.8) 2.5 (9.8)
4
3.2 ( 10) 3.2 ( 10)
7
5.4 ( 10) 5.4 ( 10)
8
6.1 ( 11) 6.1 ( 10)
15
9.0 ( 13) 9.0 ( 13)
16
9.2 ( 14) 9.2 ( 14)
31
13.5 ( 18) 13.6 ( 18)
32
13.9 ( 18) 14.0 ( 18)
63
20.6 20.7
64
20.5 20.6
127
26.8 27.0
128
24.9 25.1
255
30.0 30.4
256
30.0 26.5
511
34.6 36.0
512
35.4 36.2
1023
36.8 37.4
1024
37.1 37.3
2047
38.5 38.2
2048
57.9 58.1
4095
65.5 66.5
4096
65.9 65.8
8191
69.6 70.9
8192
70.0 71.0
16383
71.5 71.5
16384
72.6 71.3
32767
68.6 69.1
32768
71.8 71.0
Note that many aspects of MPI performance on a given architecture are not measured by this particular code. All it does is one send/recv at a time, and there is no contention by multiple pairs communicating simultaneously. Collective operations and different algorithms using point-to-point communication can create competition for resources such as switches, routers, and buffers, and result in different results.
Here's the program used in the above runs:
PROGRAM RING
IMPLICIT NONE
C
INTEGER MPROC,NPROC
COMMON/CPROCI/ MPROC,NPROC
C
C**********
C*
C 1) PROGRAM TIMING A 'RING' FOR VARIOUS BUFFER LENGTHS.
C
C 2) MPI VERSION.
C*
C**********
C
INCLUDE "mpif.h"
C
INTEGER MPIERR,MPIREQ(4),MPISTAT(MPI_STATUS_SIZE,4)
INTEGER MYPE,MYPEM,MYPEP,NPES
C
* REAL*8 MPI_Wtime
REAL*8 T0,T1
C
INTEGER I,IRING,N2,NN,NR
REAL*8 BUFFER(32768)
C
C INITIALIZE.
C
CALL MPI_INIT(MPIERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, MYPE, MPIERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, NPES, MPIERR)
MYPEM = MOD( NPES + MYPE - 1, NPES)
MYPEP = MOD( MYPE + 1, NPES)
C
IF (MYPE.EQ.0) THEN
WRITE(6,*)
WRITE(6,*) 'NPES,MYPE,MYPE[MP],KIND = ',
& NPES,MYPE,MYPEM,MYPEP,KIND(BUFFER)
WRITE(6,*)
CALL FLUSH(6)
ENDIF
CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
C
DO I= 1,32768
BUFFER(I) = I
ENDDO
C
C SMALL BUFFER TIMING LOOP.
C
DO N2= 1,15
DO I= -1,0
NN = 2**N2 + I
NR = 32768/(2**N2)
CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
T0 = MPI_Wtime()
DO IRING= 0,NR-1
IF (MYPE.EQ.0) THEN
CALL MPI_SEND(BUFFER(1+IRING*NN),NN,MPI_REAL8,
+ MYPEP, 9901, MPI_COMM_WORLD,
+ MPIERR)
CALL MPI_RECV(BUFFER(1+IRING*NN),NN,MPI_REAL8,
+ MYPEM, 9901, MPI_COMM_WORLD,
+ MPISTAT, MPIERR)
ELSE
CALL MPI_RECV(BUFFER(1+IRING*NN),NN,MPI_REAL8,
+ MYPEM, 9901, MPI_COMM_WORLD,
+ MPISTAT, MPIERR)
CALL MPI_SEND(BUFFER(1+IRING*NN),NN,MPI_REAL8,
+ MYPEP, 9901, MPI_COMM_WORLD,
+ MPIERR)
ENDIF
ENDDO
T1 = MPI_Wtime()
* CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
IF (MYPE.EQ.0) THEN
WRITE(6,6000) NN,(T1-T0)*1.0D6/(NR*NPES),
& (NN*8*NR*NPES)/((T1-T0)*1.0D6)
CALL FLUSH(6)
ENDIF
ENDDO
ENDDO
CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
CALL MPI_FINALIZE(MPIERR)
STOP
6000 FORMAT(' BUFFER = ',I6,' TIME =',F10.1,' Microsec',
& ' BW =',F6.1, ' MB/sec')
C END OF RING.
END
More Memory for SP Jobs: -bmaxdata
On icehawk, if your program needs more than 256 MB memory, you need to tell the loader. The compiler option:
-bmaxdata:<size_in_bytes>
will do it. For instance:mpxlf90 -bmaxdata:375000000 -o prog prog.f
orxlc -bmaxdata:375000000 -o prog prog.c
will request 375 MB.Even though nodes on icehawk have 2 GB, we advise caution in using more than about 1.5 GB. The OS and MPI buffers need some too. If you're running four MPI tasks on a single node, they allocate their memory individually, and you might stay below about 1.5 / 4 GB, or 375 MB, per task. OpenMP threads, on the other hand, share memory. Thus, if you're using multilevel parallel programming, using MPI between nodes and OpenMP within nodes, you can specify the full 1.5 GB per task.
Here's more on "maxdata," from man ld :
Options (-bOptions)
The following values are possible for the Options variable of the -b
flag. You can list more than one option after the -b flag, separating
them with a single blank.
[...]
D: Number or maxdata:Number Sets the maximum size (in bytes)
allowed for the user data area (or user heap) when the executable
program is run. This value is saved in the auxiliary header and
used by the system loader to set the soft data ulimit. The default
value is 0.
Loadleveler Scripts Should Specify Time Limit
Icehawk users: please try to specify an accurate time request in loadleveler scripts. The loadleveler specification is:
# @ wall_clock_limit=<time_in_seconds>
For instance, the specification:# @ wall_clock_limit=7200
requests two hours. Over estimate. If you request less time than your job needs the system will cut it off. You may be able to refine the request by timing a series of runs.
When you neglect the wall_clock_limit specification, the system must assume your job needs the maximum time available to the class. Given loadleveler's backfill algorithm, shorter requests are more likely to start sooner, so this probably isn't what you want.
Quick-Tip Q & A
Q: I resolve to stop using sed, awk, cut, split, complicated egrep commands, etc..., in favor of perl. My goal is to simplify, and learn just one way to do everything. Can you help me get started? I'd appreciate a couple examples --with explanations--of using perl on the command line or in short scripts, to accomplish common unix tasks.
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
