ARSC T3D Users' Newsletter 81, April 5, 1996

MPI and PVM Speeds

In a recent benchmarking effort, I converted a uniprocessor code to a multiprocessor version using MPI. I chose MPI as the tool for implementing parallelism because there was interest from the customer in seeing how the program runs on both the T3D and a COW, "Cluster of Workstations" (one of the best acronyms going around). MPI works relatively seamlessly on both the T3D and the ARSC cluster of SGI workstations.

The benchmarking effort using MPI was not a success story on either platform because there was too much communication of relatively small messages between time steps and not enough computation to do in parallel. That's just the characteristic of this application. But I wasn't satisfied with this result as I thought maybe the overhead of using MPI was too high. So I recoded the MPI multiprocessor version in PVM. I thought that since PVM is a more mature library than MPI, it might be faster on small messages. I would have tried "shmems" too, but I thought I better wait until shmems become an industry standard and are available on COWS (like that will ever happen).

Unfortunately, the PVM version was even slower than MPI on the T3D. So I thought I'd measure the speeds on the two libraries, MPI and PVM, for the message passing constructs that were used in this benchmark. I basically wanted to measure ping and bandwidth times between two processors with one as the master (doing the timing) and one as a slave receiving messages and acknowledging the receive. A typical time measurement would be a "two way ping".

On the master:


  t1 = second()
  MPI_Send( iarray, 1, MPI_INT, other, tagsend, MPI_COMM_WORLD );
  MPI_Recv( iarray, 1, MPI_INT, other, tagrecv, MPI_COMM_WORLD, &stat );
  t1 = second() - t1
On the slave:

  MPI_Recv( iarray, 1, MPI_INT, other, tagsend, MPI_COMM_WORLD, &stat );
  MPI_Send( iarray, 1, MPI_INT, other, tagrecv, MPI_COMM_WORLD );
In this case, a single int is passed between the master and the slave. Similarly, I wanted to see how the bandwidth scales up as the message gets longer and longer. In particular, I used this sequence in the PVM version.

On the master:


  t1 = second();
  pvm_pkint( iarray, numint, 1 );
  if( pvm_send( other, tagsend ) < 0 ) {
    printf( "can't send to bandwidth test\n" );
    goto bail;
  }
  if( pvm_recv( other, tagrecv ) < 0 ) {
    printf( "recv error in bandwidth test\n" );
    goto bail;
  }
  pvm_upkint( iarray, 1, 1 );
  t1 = second() - t1
On the slave:

  pvm_recv( other, tagsend );
  pvm_upkint( iarray, numint, 1 );
  pvm_pkint( iarray, 1, 1 );
  pvm_send( other, tagrecv );
On both master and slave, the value of numint goes through the sequence: 25, 250, 2500, 25000, 250000. Most of the ideas of this timing program comes from the timing program by Robert Manchek of Oak Ridge National Labs, that comes with the public domain distribution of PVM. (I have presented the results of that program on various machines in the T3D class at ARSC.)

Both MPI and PVM have a wide variety of function calls and not always is the 1-to-1 association between MPI and PVM as clear as I program them in this benchmark. To get a feel for the common functionality between MPI and PVM, below is a "diff" of the MPI and PVM versions:


  3c3
  < #include <mpi.h>
  ---
  > #include <pvm3.h>
  24d23
  <   MPI_Status stat;
  26,28c25,27
  <   MPI_Init( &argc, &argv );
  <   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  <   MPI_Comm_size(MPI_COMM_WORLD, &size);
  ---
  >   rank = pvm_get_PE( pvm_mytid() );
  >   pvm_setopt( PvmRoute, PvmRouteDirect );
  >   pvm_initsend( PvmDataRaw );
  39,40c38,41
  <       MPI_Send( iarray, 1, MPI_INT, other, tagsend, MPI_COMM_WORLD );
  <       MPI_Recv( iarray, 1, MPI_INT, other, tagrecv, MPI_COMM_WORLD, &stat );
  ---
  >          pvm_pkint( iarray, 1, 1 );
  >          pvm_send( other, tagsend );
  >          pvm_recv( other, tagrecv );
  >          pvm_upkint( iarray, 1, 1 );
  56c57,58
  <         if( MPI_Send( iarray,numint,MPI_INT,other,tagsend,MPI_COMM_WORLD ) ) {
  ---
  >         pvm_pkint( iarray, numint, 1 );
  >         if( pvm_send( other, tagsend ) < 0 ) {
  60c62
  <         if( MPI_Recv( iarray,1,MPI_INT,other,tagrecv,MPI_COMM_WORLD,&stat)){
  ---
  >         if( pvm_recv( other, tagrecv ) < 0 ) {
  63a66
  >         pvm_upkint( iarray, 1, 1 );
  78,79c81,85
  <       MPI_Recv( iarray, 1, MPI_INT, other, tagsend, MPI_COMM_WORLD, &stat );
  <       MPI_Send( iarray, 1, MPI_INT, other, tagrecv, MPI_COMM_WORLD );
  ---
  >       pvm_recv( other, tagsend );
  >       pvm_upkint( iarray, 1, 1 );
  >       pvm_pkint( iarray, 1, 1 );
  >       pvm_send( other, tagrecv );
  > 
  83,85c89,92
  <         MPI_Recv( iarray, numint, MPI_INT,other,tagsend,MPI_COMM_WORLD,&stat );
  <         MPI_Send( iarray, 1, MPI_INT, other, tagrecv, MPI_COMM_WORLD );
  < 
  ---
  >          pvm_recv( other, tagsend );
  >          pvm_upkint( iarray, numint, 1 );
  >          pvm_pkint( iarray, 1, 1 );
  >          pvm_send( other, tagrecv );
  89c96
  <   MPI_Finalize( );
  ---
  >   pvm_exit( );
As a sample of the results so far I have:

                       Preliminary results for MPI/PVM comparison

                          <--------T3D------>         <----SGI COW---->

                          CRI    EPCC   Argonne      Oak Ridge  Argonne
                          PVM     MPI     MPI           PVM       MPI

  ping test               263      77      96          2495      2848
  (microseconds)

  bandwidth test (Mbytes/second)
  length of message (bytes) 

      100                 .43    1.03     .98           .04       .03
     1000                3.09    5.08    8.20           .24       .28
    10000                9.19   11.89   33.33           .54       .77
   100000               12.58   14.56   49.31           .60       .99
  1000000               13.07   14.77   51.68           .52       .88
In the above table, the programs were run many times (for the COWS hundreds), but I report only the best times. Although the term "MPP" is applied to both the T3D and the COW the performance difference is striking. Another startling difference between the platforms is that the T3D timings were always reproduceable within 1%, while from run to run, the COW results could vary by more than a factor of five. With such differing characteristics, it's unlikely that one version of any multiprocessing code could be optimal on both platforms. Although there exist multiprocessing codes in MPI and PVM that will run on both on the T3D and a COW, these portable libraries can't hide the hardware differences of the platforms.

On the T3D and COW, the MPI implementations were faster than PVM. (Don't have a COW, man!). I like this kind of results and in future newsletters, I will expand the table above to include other experiments:

  1. What is the effect of reducing PVM overhead by using pvm_psend?
  2. What is the effect of changing from PE0 to PE1 to PE0 to PEn where n = 2, 3, 4, ... ?
  3. What about other send and recv functions?
  4. What changes were made to the T3D versions to get COW versions?
If some readers have input on this topic, I would be happy to incorporate it into future newsletters.

MPI Source for the Above Benchmark


  #include <stdio.h>
  #include <time.h>
  #include <mpi.h>
  #define MIN( a, b )  (( a < b ) ? a : b )
  
  int iarray[ 250000 ];
  main(argc, argv)
    int argc;
    char *argv[];
  {
    double t1, t2, second();
    int reps = 10;      /* number of samples per test */
    struct timeval tv1, tv2;  /* for timing */
    int dt1, dt2;      /* time for one iter */
    int at1, at2;      /* accum. time */
    int mt1, mt2, mt3;              /* minimum times */
    int numint;      /* message length */
    int n;
    int i;
    int rank;
    int size;
    int other;
    int tagsend, tagrecv;
    MPI_Status stat;
  
    MPI_Init( &argc, &argv );
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    tagsend = 100;
    tagrecv = 101;
    if( rank == 0 ) other = 1;
    if( rank == 1 ) other = 0;
    if( rank == 0 ) {
      at1 = 0;
  /*    puts(" N     uSec"); */
      mt1 = 10000000;
      for (n = 1; n <= reps; n++) {
        t1 = second( );
        MPI_Send( iarray, 1, MPI_INT, other, tagsend, MPI_COMM_WORLD );
        MPI_Recv( iarray, 1, MPI_INT, other, tagrecv, MPI_COMM_WORLD, &stat );
        t2 = second( );
        dt1 = ( t2 - t1 ) * 1000000.0;
  /*      printf("%2d %8d\n", n, dt1); */
        at1 += dt1;
        mt1 = MIN( dt1, mt1 );
      }
      printf("RTT Avg uSec %d ", at1 / reps);
      printf("RTT Min uSec %d\n", mt1 );
      for (numint = 25; numint < 1000000; numint *= 10) {
        printf("\nMessage size %d\n", numint * 4);
        at2 = 0;
        mt2 = 10000000;
  /*      puts(" N  Pack uSec  Send uSec"); */
        for (n = 1; n <= reps; n++) {
          t1 = second();
          if( MPI_Send( iarray,numint,MPI_INT,other,tagsend,MPI_COMM_WORLD ) ) {
            printf( "can't send to bandwidth test\n" );
            goto bail;
          } 
          if( MPI_Recv( iarray,1,MPI_INT,other,tagrecv,MPI_COMM_WORLD,&stat)){
            printf( "recv error in bandwidth test\n" );
            goto bail;
          }
          t2 = second();
          dt2 = ( t2 - t1 ) * 1000000.0;
  /*        printf("%2d   %8d\n", n, dt2); */
          at2 += dt2;
          mt2 = MIN( mt2, dt2 );
        }
        at2 /= reps;
  /*      printf("Avg uSec %8d for %d", at2, numint*4); */
        printf("Avg Byte/uSec %8f ", (numint * 4) / (double)at2);
  /*      printf("Min uSec %8d\n", mt2); */
        printf("Max Byte/uSec %8f\n", (numint * 4) / (double)mt2);
      }
    } else {
      for ( n = 1; n <= reps; n++ ) {
        MPI_Recv( iarray, 1, MPI_INT, other, tagsend, MPI_COMM_WORLD, &stat );
        MPI_Send( iarray, 1, MPI_INT, other, tagrecv, MPI_COMM_WORLD );
      }
      for (numint = 25; numint < 1000000; numint *= 10) {
        for (n = 1; n <= reps; n++) {
          MPI_Recv( iarray, numint, MPI_INT,other,tagsend,MPI_COMM_WORLD,&stat );
          MPI_Send( iarray, 1, MPI_INT, other, tagrecv, MPI_COMM_WORLD );
  
        }
      }
    }
    MPI_Finalize( );
    printf( "Done on PE%d\n", rank );
    exit( 0 );
  bail:
    printf( "Bailing out on PE%d\n", rank );
    exit( -1 );
  }
  double second()
  {
    double junk;
    fortran irtc();
    junk = irtc( ) / 150000000.0;
    return( junk );
  }

PVM Source for the Above Benchmark


  #include <stdio.h>
  #include <time.h>
  #include <pvm3.h>
  #define MIN( a, b )  (( a < b ) ? a : b )

  int iarray[ 250000 ];
  main(argc, argv)
    int argc;
    char *argv[];
  {
    double t1, t2, second();
    int reps = 10;      /* number of samples per test */
    struct timeval tv1, tv2;  /* for timing */
    int dt1, dt2;      /* time for one iter */
    int at1, at2;      /* accum. time */
    int mt1, mt2, mt3;              /* minimum times */
    int numint;      /* message length */
    int n;
    int i;
    int rank;
    int size;
    int other;
    int tagsend, tagrecv;
  
    rank = pvm_get_PE( pvm_mytid() );
    pvm_setopt( PvmRoute, PvmRouteDirect );
    pvm_initsend( PvmDataRaw );
    tagsend = 100;
    tagrecv = 101;
    if( rank == 0 ) other = 1;
    if( rank == 1 ) other = 0;
    if( rank == 0 ) {
      at1 = 0;
  /*    puts(" N     uSec"); */
      mt1 = 10000000;
      for (n = 1; n <= reps; n++) {
        t1 = second( );
           pvm_pkint( iarray, 1, 1 );
           pvm_send( other, tagsend );
           pvm_recv( other, tagrecv );
           pvm_upkint( iarray, 1, 1 );
        t2 = second( );
        dt1 = ( t2 - t1 ) * 1000000.0;
  /*      printf("%2d %8d\n", n, dt1); */
        at1 += dt1;
        mt1 = MIN( dt1, mt1 );
      }
      printf("RTT Avg uSec %d ", at1 / reps);
      printf("RTT Min uSec %d\n", mt1 );
      for (numint = 25; numint < 1000000; numint *= 10) {
        printf("\nMessage size %d\n", numint * 4);
        at2 = 0;
        mt2 = 10000000;
  /*      puts(" N  Pack uSec  Send uSec"); */
        for (n = 1; n <= reps; n++) {
          t1 = second();
          pvm_pkint( iarray, numint, 1 );
          if( pvm_send( other, tagsend ) < 0 ) {
            printf( "can't send to bandwidth test\n" );
            goto bail;
          } 
          if( pvm_recv( other, tagrecv ) < 0 ) {
            printf( "recv error in bandwidth test\n" );
            goto bail;
          }
          pvm_upkint( iarray, 1, 1 );
          t2 = second();
          dt2 = ( t2 - t1 ) * 1000000.0;
  /*        printf("%2d   %8d\n", n, dt2); */
          at2 += dt2;
          mt2 = MIN( mt2, dt2 );
        }
        at2 /= reps;
  /*      printf("Avg uSec %8d for %d", at2, numint*4); */
        printf("Avg Byte/uSec %8f ", (numint * 4) / (double)at2);
  /*      printf("Min uSec %8d\n", mt2); */
        printf("Max Byte/uSec %8f\n", (numint * 4) / (double)mt2);
      }
    } else {
      for ( n = 1; n <= reps; n++ ) {
        pvm_recv( other, tagsend );
        pvm_upkint( iarray, 1, 1 );
        pvm_pkint( iarray, 1, 1 );
        pvm_send( other, tagrecv );
  
      }
      for (numint = 25; numint < 1000000; numint *= 10) {
        for (n = 1; n <= reps; n++) {
           pvm_recv( other, tagsend );
           pvm_upkint( iarray, numint, 1 );
           pvm_pkint( iarray, 1, 1 );
           pvm_send( other, tagrecv );
        }
      }
    }
    pvm_exit( );
    printf( "Done on PE%d\n", rank );
    exit( 0 );
  bail:
    printf( "Bailing out on PE%d\n", rank );
    exit( -1 );
  }
  double second()
  {
    double junk;
    fortran irtc();
    junk = irtc( ) / 150000000.0;
    return( junk );
  }

T3D Class at ARSC

Next week I will be teaching the "Introduction to Parallel Programming on the CRAY T3D". The class will be held April 9th to the 11th from 9AM to 5PM at the Butrovich Building. If you are interested in the class please contact Mike Ess.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top