ARSC HPC Users' Newsletter 240, March 5, 2002

MPI Send/Recv Performance

The two articles on Cray's updated message passing toolkit (MPT), mpt 1.4.0.4 which appeared in our last issue didn't consider performance.

Given the multitude of MPI-enabled systems available these days, it seemed like a good time to resurrect a program we've used for doing simple timings of MPI sends/recvs. It was contributed by Alan Wallcraft, and made it's first appearance in issue #66 of the T3D Newsletter.

The program sends messages of different sizes around a ring of processors and returns the average time for a single send/recv. The program was modified to use larger messages, 8-byte words, and to report bandwidth as well as time. The code is included, below.

Runs were done on the T3E, IBM-SP, SV1ex, and a small linux cluster here at ARSC. Runs were made on 4- and 16-processors on each system, and other system-specific parameters were varied as well to make this more interesting.

The tables of results include:

"mb/s"
bandwidth in mbytes/sec for all message sizes
"(usec)"
to show latency, the absolute time in microseconds for small messages.
"Size 8-byte words"
number of REAL*8 words per message.

Cray T3E-900

Observations:

MPT 1.4.0.4 gives a major improvement in the performance of MPI sends/recvs. Also note that on the T3E, there's no added cost in using 16 rather than 4 processors. Transfer rates across the torus tend to be very uniform.


T3E
========
 Size  

 MPT 1.3.0.0  mpt.1.3.0.0  mpt.1.4.0.4 mpt.1.4.0.4  
8-byte 

   4 PEs       16 PEs      4 PEs       16 PEs      
 words 


       


       

  mb/s (usec)  mb/s (usec) mb/s (usec) mb/s (usec)  
-------

  ---- ------  ---- ------ ---- ------ ---- ------ 
     1 

   0.4 ( 18)   0.4 ( 18)   1.5 (5.3)   1.5 (5.2) 
     2 

   0.9 ( 18)   0.9 ( 18)   2.9 (5.4)   3.0 (5.4) 
     3 

   1.3 ( 18)   1.3 ( 18)   4.4 (5.5)   4.4 (5.5) 
     4 

   1.5 ( 22)   1.5 ( 22)   5.8 (5.5)   5.8 (5.5) 
     7 

   2.4 ( 23)   2.4 ( 23)   9.9 (5.7)   9.9 (5.6) 
     8 

   2.7 ( 24)   2.7 ( 24)   5.8 ( 11)   5.9 ( 11) 
    15 

   5.2 ( 23)   5.2 ( 23)  10.7 ( 11)  10.9 ( 11) 
    16 

   5.5 ( 23)   5.7 ( 23)  11.5 ( 11)  11.6 ( 11) 
    31 

  10.0 ( 25)  10.0 ( 25)  21.2 ( 12)  21.2 ( 12) 
    32 

  10.6 ( 24)  10.5 ( 24)  21.0 ( 12)  22.2 ( 12) 
    63 

  17.9        18.1        39.1        40.3       
    64 

  18.5        18.9        41.1        40.8       
   127 

  29.5        29.1        68.6        72.2       
   128 

  30.2        30.6        72.9        73.4       
   255 

  44.1        42.5       104.0       118.1       
   256 

  44.0        43.9       120.5       119.5       
   511 

  60.7        62.1       163.4       163.6       
   512 

  61.7        62.6       176.8       176.3       
  1023 

  73.8        75.9       162.6       213.4       
  1024 

  92.0        90.7       236.1       231.5       
  2047 

 103.1       118.6       272.1       273.1       
  2048 

 121.3       118.5       221.7       275.1       
  4095 

 130.5       139.8       254.4       304.0       
  4096 

 145.4       140.2       312.5       314.6       
  8191 

  67.8        70.8       333.4       298.9       
  8192 

 161.1       156.9       330.0       332.5       
 16383 

  59.7        56.2       253.1       318.7       
 16384 

 172.1       165.8       329.9       342.5       
 32767 

  79.3        78.3       346.0       348.0       
 32768 

 176.6       178.3       344.1       348.3       

IBM SP 365MHz Power3 processors, 4 processors per shared memory node.

Notes on the table:

  • Runs using both 4 and 16 CPUs were made. On icehawk, there are 4 CPUs available per node. For these runs, the loadleveler specifications "nodes" and "tasks_per_node" are shown as:

    <nodes> x <tasks_per_node>

    E.g., "16 x 1" indicates that 16 nodes were used, one CPU per node, for a total of 16 CPUs.

  • On icehawk, two networks are available for internode communication, and tests using both were conducted. Annontated as follows:
    IP
    Internet protocol. Selected using the loadleveler option: # @ network.mpi = css0,shared,IP
    US
    IBM high-performance switch: # @ network.mpi = css0,shared,US
  • Intranode MPI traffic can use the processor's shared memory or it can be sent out and back through the node router.
    "not.sh."
    shared memory was not used for MPI traffic.
    "sh.mem."
    shared memory used. Set the environment variable: export MP_SHARED_MEMORY=yes

Observations:

You should always specify the US network and MP_SHARED_MEMORY=yes . The performance penalty for doing otherwise is clear from the second and third columns.

The non-uniformity in communication rates expected in the distributed shared-memory architecture shows up in the comparison between the 4 and 16 CPU runs and the 4x4 and 16x1 runs. The fastest transfers, for this particular program, are intranode. The very best bandwidth observed was in the 1x4 case, with MPI traffic using the node's shared memory (see the first column). From the 4x4 and 16x1 cases (last two columns), it's clear that there's a reward for keeping tasks in co-habitation on the nodes.

Other programs, data types, and communication patterns will show different results, as will runs on IBM's most recent switch and network technologies (Colony or Federation switches, for instance).


IBM-SP
========
       

   US          US          IP          US          US
 Size  

   sh.mem.     not.sh.     sh.mem.     sh.mem.     sh.mem.
8-byte 

   4 CPUs      16 CPUs     16 CPUs     16 CPUs     16 CPUs
 words 

   1 x 4       4 x 4       4 x 4       4 x 4       16 x 1
       


       

  mb/s (usec) mb/s (usec) mb/s (usec) mb/s (usec) mb/s (usec)
-------

  ---- ------ ---- ------ ---- ------ ---- ------ ---- ------
     1 

   1.2 (6.9)   0.2 ( 32)   0.2 ( 48)   0.5 ( 17)   0.4 ( 21)
     2 

   2.3 (6.9)   0.5 ( 35)   0.3 ( 47)   1.0 ( 15)   0.8 ( 21)
     3 

   3.4 (7.0)   0.7 ( 34)   0.5 ( 47)   1.6 ( 15)   1.1 ( 22)
     4 

   4.6 (7.0)   0.9 ( 34)   0.7 ( 48)   2.1 ( 16)   1.5 ( 22)
     7 

   7.8 (7.2)   1.6 ( 35)   1.2 ( 48)   3.5 ( 16)   2.6 ( 22)
     8 

   9.0 (7.1)   1.8 ( 36)   1.3 ( 48)   4.0 ( 16)   2.9 ( 22)
    15 

  16.4 (7.3)   3.0 ( 40)   1.9 ( 62)   6.7 ( 18)   5.2 ( 23)
    16 

  17.0 (7.5)   3.2 ( 40)   2.6 ( 50)   7.0 ( 18)   5.5 ( 23)
    31 

  31.4 (7.9)   5.9 ( 42)   4.2 ( 60)  12.4 ( 20)   8.7 ( 28)
    32 

  32.3 (7.9)   6.1 ( 42)   4.2 ( 60)  12.6 ( 20)   9.0 ( 28)
    63 

  56.7        11.0         8.2        23.7        15.7
    64 

  55.9        10.8         8.2        24.4        15.9
   127 

  95.2        18.4        15.0        37.9        23.0
   128 

  95.6        18.6        15.1        41.5        23.2
   255 

 143.8        29.1        27.1        67.2        36.3
   256 

 144.8        30.0        27.2        66.8        36.5
   511 

 202.8        38.5        45.5       101.6        51.4
   512 

 188.1        36.6        45.4       101.0        51.1
  1023 

 191.6        52.4        67.2       138.1        68.7
  1024 

 244.8        52.1        68.1       141.4        68.5
  2047 

 282.0        64.5       100.3       174.1        87.5
  2048 

 272.0        64.6       101.1       173.1        87.7
  4095 

 296.3        69.7       131.6       194.6       104.0
  4096 

 295.0        69.8       132.8       188.8       103.9
  8191 

 343.0        75.9       146.1       219.0       119.5
  8192 

 343.0        76.6       154.6       217.0       119.9
 16383 

 372.4        77.4       171.7       244.7       122.7
 16384 

 379.9        78.4       158.8       248.2       123.2
 32767 

 433.2        82.7       195.8       263.6       131.6
 32768 

 430.9        83.0       190.6       261.5       131.5

Cray SV1ex

It might surprise a traditional PVP user, but we have an increasing number of users with some MPI component to SV1ex jobs. For example, NCAR's climate system model (CSM) has several individual components (land, atmosphere, ice, etc..) which are individually multi-tasked and vectorized, but which are coupled using MPI.

All of the MPI messages are "passed" using shared memory.

ARSC's SV1ex can be considered a single 32-processor shared memory node, and if we had more nodes, MPI could be used between them. Clusters of SMPs are going to be around for a while, so this is likely to remain a portable approach.


Cray SV1ex
==========
       

   
 Size  

  
8-byte 

   4 CPUs      16 CPUs   
 words 


       

  mb/s (usec) mb/s (usec)
-------

  ---- ------ ---- ------ 
     1 

   0.1 ( 67)   0.1 ( 69)
     2 

   0.2 ( 68)   0.2 ( 70)
     3 

   0.3 ( 69)   0.3 ( 70)
     4 

   0.5 ( 68)   0.5 ( 70)
     7 

   0.8 ( 69)   0.8 ( 70)
     8 

   0.9 ( 69)   0.9 ( 70)
    15 

   1.3 ( 90)   1.3 ( 91)
    16 

   1.4 ( 89)   1.4 ( 93)
    31 

   2.8 ( 89)   2.7 ( 91)
    32 

   2.9 ( 89)   2.8 ( 92)
    63 

   5.6         5.5
    64 

   5.7         5.6
   127 

  11.3        10.9
   128 

  11.3        11.2
   255 

  22.3        21.9
   256 

  22.3        21.3
   511 

  43.1        42.3
   512 

  43.1        42.6
  1023 

  78.6        77.1
  1024 

  80.4        80.1
  2047 

 143.4       142.5
  2048 

 142.2       142.3
  4095 

 293.1       288.6
  4096 

 290.3       288.6
  8191 

 473.5       441.3
  8192 

 475.2       467.4
 16383 

 663.9       667.9
 16384 

 670.4       655.8
 32767 

 867.1       865.4
 32768 

 881.0       862.0

Linux Cluster with Myrinet network

This is an 8 node, dual 333MHz pentium II cluster used for training. For this test I used the default, gnu f77 compiler and MPICH.


Linux Cluster (Myrinet network)
===============================
 Size  

  
8-byte 

   4 CPUs      16 CPUs   
 words 


       

  mb/s (usec) mb/s (usec)
-------

  ---- ------ ---- ------ 
     1 

   0.8 (9.6)   0.8 (9.6)
     2 

   1.6 (9.7)   1.7 (9.6)
     3 

   2.4 (9.8)   2.5 (9.8)
     4 

   3.2 ( 10)   3.2 ( 10)
     7 

   5.4 ( 10)   5.4 ( 10)
     8 

   6.1 ( 11)   6.1 ( 10)
    15 

   9.0 ( 13)   9.0 ( 13)
    16 

   9.2 ( 14)   9.2 ( 14)
    31 

  13.5 ( 18)  13.6 ( 18)
    32 

  13.9 ( 18)  14.0 ( 18)
    63 

  20.6        20.7
    64 

  20.5        20.6
   127 

  26.8        27.0
   128 

  24.9        25.1
   255 

  30.0        30.4
   256 

  30.0        26.5
   511 

  34.6        36.0
   512 

  35.4        36.2
  1023 

  36.8        37.4
  1024 

  37.1        37.3
  2047 

  38.5        38.2
  2048 

  57.9        58.1
  4095 

  65.5        66.5
  4096 

  65.9        65.8
  8191 

  69.6        70.9
  8192 

  70.0        71.0
 16383 

  71.5        71.5
 16384 

  72.6        71.3
 32767 

  68.6        69.1
 32768 

  71.8        71.0
 

Note that many aspects of MPI performance on a given architecture are not measured by this particular code. All it does is one send/recv at a time, and there is no contention by multiple pairs communicating simultaneously. Collective operations and different algorithms using point-to-point communication can create competition for resources such as switches, routers, and buffers, and result in different results.

Here's the program used in the above runs:


      PROGRAM RING
      IMPLICIT NONE
C
      INTEGER          MPROC,NPROC
      COMMON/CPROCI/   MPROC,NPROC
C
C**********
C*
C 1)  PROGRAM TIMING A 'RING' FOR VARIOUS BUFFER LENGTHS.
C
C 2)  MPI VERSION.
C*
C**********
C
      INCLUDE "mpif.h"
C
      INTEGER MPIERR,MPIREQ(4),MPISTAT(MPI_STATUS_SIZE,4)
      INTEGER MYPE,MYPEM,MYPEP,NPES
C
*     REAL*8 MPI_Wtime
      REAL*8 T0,T1
C
      INTEGER I,IRING,N2,NN,NR
      REAL*8  BUFFER(32768)
C
C     INITIALIZE.
C
      CALL MPI_INIT(MPIERR)
      CALL MPI_COMM_RANK(MPI_COMM_WORLD, MYPE, MPIERR)
      CALL MPI_COMM_SIZE(MPI_COMM_WORLD, NPES, MPIERR)
      MYPEM = MOD( NPES + MYPE - 1, NPES)
      MYPEP = MOD(        MYPE + 1, NPES)
C
      IF     (MYPE.EQ.0) THEN
        WRITE(6,*) 
        WRITE(6,*) 'NPES,MYPE,MYPE[MP],KIND = ',
     &              NPES,MYPE,MYPEM,MYPEP,KIND(BUFFER)
        WRITE(6,*) 
        CALL FLUSH(6)
      ENDIF
      CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
C
      DO I= 1,32768
        BUFFER(I) = I
      ENDDO
C
C     SMALL BUFFER TIMING LOOP.
C
      DO N2= 1,15
        DO I= -1,0
          NN = 2**N2 + I
          NR = 32768/(2**N2)
          CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
          T0 = MPI_Wtime()
          DO IRING= 0,NR-1
            IF     (MYPE.EQ.0) THEN
              CALL MPI_SEND(BUFFER(1+IRING*NN),NN,MPI_REAL8,
     +                      MYPEP, 9901, MPI_COMM_WORLD,
     +                      MPIERR)
              CALL MPI_RECV(BUFFER(1+IRING*NN),NN,MPI_REAL8,
     +                      MYPEM, 9901, MPI_COMM_WORLD,
     +                      MPISTAT, MPIERR)
            ELSE
              CALL MPI_RECV(BUFFER(1+IRING*NN),NN,MPI_REAL8,
     +                      MYPEM, 9901, MPI_COMM_WORLD,
     +                      MPISTAT, MPIERR)
              CALL MPI_SEND(BUFFER(1+IRING*NN),NN,MPI_REAL8,
     +                      MYPEP, 9901, MPI_COMM_WORLD, 
     +                      MPIERR)
            ENDIF
          ENDDO
          T1 = MPI_Wtime()
*         CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
          IF     (MYPE.EQ.0) THEN
            WRITE(6,6000) NN,(T1-T0)*1.0D6/(NR*NPES),
     &                    (NN*8*NR*NPES)/((T1-T0)*1.0D6)
            CALL FLUSH(6)
          ENDIF
        ENDDO
      ENDDO
      CALL MPI_BARRIER(MPI_COMM_WORLD,MPIERR)
      CALL MPI_FINALIZE(MPIERR)
      STOP
 6000 FORMAT(' BUFFER = ',I6,' TIME =',F10.1,' Microsec',
     &       '   BW =',F6.1, ' MB/sec')
C     END OF RING.
      END

More Memory for SP Jobs: -bmaxdata

On icehawk, if your program needs more than 256 MB memory, you need to tell the loader. The compiler option:

-bmaxdata:<size_in_bytes>

will do it. For instance:

mpxlf90 -bmaxdata:375000000 -o prog prog.f

or

xlc -bmaxdata:375000000 -o prog prog.c

will request 375 MB.

Even though nodes on icehawk have 2 GB, we advise caution in using more than about 1.5 GB. The OS and MPI buffers need some too. If you're running four MPI tasks on a single node, they allocate their memory individually, and you might stay below about 1.5 / 4 GB, or 375 MB, per task. OpenMP threads, on the other hand, share memory. Thus, if you're using multilevel parallel programming, using MPI between nodes and OpenMP within nodes, you can specify the full 1.5 GB per task.

Here's more on "maxdata," from man ld :


  Options (-bOptions)
  
  The following values are possible for the Options variable of the -b
  flag. You can list more than one option after the -b flag, separating
  them with a single blank.

  [...]

      D: Number or maxdata:Number Sets the maximum size (in bytes)
      allowed for the user data area (or user heap) when the executable
      program is run.  This value is saved in the auxiliary header and
      used by the system loader to set the soft data ulimit. The default
      value is 0.

Loadleveler Scripts Should Specify Time Limit

Icehawk users: please try to specify an accurate time request in loadleveler scripts. The loadleveler specification is:

# @ wall_clock_limit=<time_in_seconds>

For instance, the specification:

# @ wall_clock_limit=7200

requests two hours. Over estimate. If you request less time than your job needs the system will cut it off. You may be able to refine the request by timing a series of runs.

When you neglect the wall_clock_limit specification, the system must assume your job needs the maximum time available to the class. Given loadleveler's backfill algorithm, shorter requests are more likely to start sooner, so this probably isn't what you want.

Quick-Tip Q & A


Q: I resolve to stop using sed, awk, cut, split, complicated egrep
   commands, etc..., in favor of perl. 
   
   My goal is to simplify, and learn just one way to do everything.

   Can you help me get started? I'd appreciate a couple examples
   --with explanations--of using perl on the command line or in short
   scripts, to accomplish common unix tasks.

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top