ARSC T3D Users' Newsletter 56, October 13, 1995

The Speed of SHMEMS

At the Alaska CUG, there were several polished papers on using SHMEMS to get high performance. For those of us brought up on send and receive message passing, these one-sided message passing routines seem a little daring. In the two-sided message passing world of send and receive, if a program didn't work (or hung) it was always possible to start matching up message ids in pairs of a send and then a receive. With one-sided routines like SHMEMS, the programmer must have some other criterion for knowing that the sender and receiver are in sync.

But these one-sided routines have much lower overhead, increased opportunities for asynchronous operations and are easier to use (even if more dangerous). I believe that the trend in parallel processing is towards this type of communication. I have two more data points to support this contention:

  1. CRI has extended SHMEMs to its PVP line of computers.
  2. The new MPI2 standard will have a similar form of one-sided messages.
But their speed is currently their most endearing quality. So what is their speed? All of the presenters who spoke about using SHMEMs said they used SHMEM PUTS because they were faster that SHMEM GETS. Are the SHMEM PUTS faster because they execute faster or because they are one-sided? In the exercises below I tried to answer these questions.

I did all of my timings in C but I'm sure the timings are similar in the Fortran world. The easiest function to measure is a shmem_get, because it is a "blocking" function. "Blocking" means that the PE that executes the shmem_get doesn't continue executing until the data from another PE has been copied into the memory of the PE that executed the shmem_get. The program below measures the time for shmem_get on PE0 for increasingly larger messages from PE0 to PE7:


  main()
  {
    int i,j,NPES,MYPE;
    long space[128];
    double t1,t2,second();
    double a[2000000];
    double b[2000000]; 
    double t[18][128];
    int n[18] = {0,1,2,3,4,5,10,20,100,200,1000,2000,
          10000,20000,100000,200000,1000000,2000000};

    NPES = _num_pes();
    MYPE = _my_pe();

    shmem_barrier(0,1,NPES,space);
    if (MYPE == 0) {
      for (i = 0; i < NPES; i++) { 
        for (j = 0; j < 18; j++) {
          t1 = second();
            shmem_get(a,b,n[j],i);
          t2 = second();
          t[j][i] = t2 - t1;
        }
      }

      for (j = 0; j < 18; j++) {
        printf("%7d",n[j]);
        for (i = 0; i < 8;  i++) {
          printf("%9.6f",t[j][i]);
      }
        printf("\n");
      }
    } else {
    }
    shmem_barrier(0,1,NPES,space);
  }
  double second()
  {
    double junk;
    fortran irtc();
    junk = irtc( ) / 150000000.0;
    return(junk);
  }
The program above produces the following table in seconds:

Table 1

Execution times for shmem_get:

  Size of
  Message PE0toPE0 PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
  (8byte words)
   
    0 0.000070 0.000005 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
   V    1 0.000010 0.000004 0.000004 0.000004 0.000004 0.000004 0.000004 0.000004
        2 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003
        3 0.000003 0.000004 0.000004 0.000004 0.000004 0.000004 0.000004 0.000004
        4 0.000006 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007
        5 0.000006 0.000006 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007
       10 0.000006 0.000008 0.000009 0.000009 0.000008 0.000008 0.000008 0.000008
       20 0.000007 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010
      100 0.000012 0.000026 0.000028 0.000029 0.000027 0.000027 0.000029 0.000028
      200 0.000020 0.000049 0.000051 0.000050 0.000048 0.000048 0.000051 0.000051
     1000 0.000069 0.000215 0.000232 0.000232 0.000219 0.000219 0.000234 0.000234
     2000 0.000150 0.000424 0.000459 0.000526 0.000434 0.000434 0.000464 0.000464
    10000 0.000735 0.002105 0.002279 0.002280 0.002153 0.002154 0.002301 0.002301
    20000 0.001457 0.004200 0.004554 0.004555 0.004369 0.004374 0.004597 0.004665
   100000 0.007348 0.021118 0.022949 0.022886 0.021625 0.021627 0.023156 0.023121
   200000 0.014663 0.042219 0.045762 0.045834 0.043246 0.043246 0.046181 0.046436
  1000000 0.073088 0.211071 0.228920 0.228919 0.216425 0.216191 0.231018 0.231043
  2000000 0.146075 0.422373 0.457952 0.457735 0.432346 0.432682 0.461994 0.462064
So for small messages the latency is on the order of a few microseconds and for large messages the bandwidth is about 35 Mbytes/second for off PE messages.

Measuring the speed of the shmem_put is a much more challenging task, because shmem_put is a "non-blocking" operation. "Non-blocking" means that the operation is executed and the PE that executed the operation may continue executing before the operation has completed. As a first attempt, we just substitute 'get' for 'put' in the above program:


  <         shmem_put(a,b,n[j],i);
  ---
  >         shmem_get(a,b,n[j],i);
Now we have a lower bound on the time for a shmem_put for different size messages between PE0 and PE0 to PE7:

Table 2

Lower bound execution times for shmem_put:

  Size of
  Message PE0toPE0 PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
  (8byte words)
   
    0 0.000068 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
   V    1 0.000006 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
        2 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
        3 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
        4 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
        5 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002 0.000002
       10 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003
       20 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003 0.000003
      100 0.000007 0.000008 0.000008 0.000008 0.000008 0.000008 0.000008 0.000008
      200 0.000011 0.000014 0.000014 0.000014 0.000014 0.000014 0.000014 0.000014
     1000 0.000061 0.000065 0.000065 0.000065 0.000065 0.000065 0.000065 0.000065
     2000 0.000214 0.000127 0.000127 0.000127 0.000127 0.000194 0.000127 0.000127
    10000 0.000873 0.000632 0.000631 0.000631 0.000631 0.000632 0.000631 0.000632
    20000 0.001725 0.001259 0.001259 0.001259 0.001326 0.001260 0.001259 0.001259
   100000 0.008704 0.006350 0.006351 0.006351 0.006284 0.006283 0.006283 0.006284
   200000 0.017330 0.012637 0.012633 0.012634 0.012630 0.012630 0.012692 0.012692
  1000000 0.086890 0.063190 0.063184 0.063184 0.063187 0.063250 0.063187 0.063186
  2000000 0.173579 0.126370 0.126359 0.126564 0.126358 0.126357 0.126359 0.126361
This lower bound seems in line with our assumption that shmem_puts are faster but the PE0 to PE0 shmem_put is slower than the shmem_get timing, so maybe it is better to special case such "on PE" operations with a copy operation.

The reason these timings are too low is that we're not sure that the operation has completed before we take the second reading of the clock with the second function. To get a more accurate timing, we read the clock after the receiving PE has sent a message back to the sending PE. This will be an upper bound because there will have to be spin loops on both PEs:

  1. The receiving PE must spin until the put is finished.
  2. The sending PE must spin until told the operation is complete.
Below is a bad attempt at this:

  main()
  {
    int i,j,k,NPES,MYPE;
    long space[128];
    double t1,t2,second();
    int a[2000000];
    int b[2000000]; 
    double t[18][128];
    int n[18] = {0,1,2,3,4,5,10,20,100,200,1000,2000,
          10000,20000,100000,200000,1000000,2000000};

    int icase,count;
    int ok[1];
    int volatile go[1];

    NPES = _num_pes();
    MYPE = _my_pe();

    for (i = 0; i < 2000000; i++) a[i] = -1;
    a[0] = 0;
    shmem_barrier(0,1,NPES,space);
    if (MYPE == 0) {
      for (i = 1; i < NPES; i++) { 
        for (j = 1; j < 18; j++) {
          go[0] = 0;
          for (k = 0; k < n[ j ]; k++) b[k] = n[j]; 
          t1 = second();
            shmem_put(a,b,n[j],i);
  loop:;
             if (go[0] == 0) goto loop;
          t2 = second();
          t[j][i] = t2 - t1;
        }
     }

      for (j = 0; j < 18; j++) {
        printf("%7d",n[j]);
        for (i = 0; i < 8; i++) {
          printf("%9.6f",t[j][i]);
       }
        printf("\n");
      }
    } else {
      count = 1;
      a[count-1] = -2;
      ok[0] = 1;
  again:;
      if (a[count-1] == n[count]) {
        shmem_put(go,ok,1,0);
        count++;
        if (count == 18) goto out;
      }
      goto again;
    }
  out:;
    shmem_barrier(0,1,NPES,space);
  }
The two explicit spin loops hammer memory too much and interfere with the put operation. The timings are worthless but illustrate that synchronization in the one-sided message passing world has to be handled well.

Table 3

Bad upper bound execution times for shmem_put:

  Size of
  Message PE0toPE0 PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
  (8byte words)
   
    0      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
   V    1      NaN 1.720021 1.331687 1.342175 1.342173 1.331689 1.331685 2.483250
        2      NaN 1.331687 1.342173 1.331686 1.331686 1.331685 1.331688 2.673862
        3      NaN 1.331691 1.331689 1.331687 1.331840 1.342174 1.342176 2.673867
        4      NaN 1.342169 1.342171 1.342174 1.342022 1.331686 1.331685 2.663377
        5      NaN 1.331687 1.331688 1.331689 1.331687 1.331691 1.342174 1.342174
       10      NaN 1.342173 1.331686 1.331685 1.342174 1.342170 1.331686 2.673864
       20      NaN 1.331689 1.342177 1.342173 1.331691 1.331686 1.331689 2.663378
      100      NaN 1.331682 1.331681 1.331684 1.331679 1.331685 1.342169 1.342174
      200      NaN 1.342168 1.331682 1.342320 1.342168 1.342172 1.331681 1.331679
     1000      NaN 1.331661 1.342148 1.331509 1.331661 1.331657 1.331661 1.331662
     2000      NaN 1.331634 1.331634 1.331634 1.331634 1.342120 1.342123 1.342120
    10000      NaN 1.331416 1.331420 1.341907 1.341906 1.331421 1.331418 1.331419
    20000      NaN 1.341640 1.341640 1.331154 1.331153 1.331153 1.331154 1.331155
   100000      NaN 1.318535 1.329022 1.329020 1.329037 1.339507 1.339507 1.339507
   200000      NaN 1.326343 1.336838 1.336840 1.336824 1.326353 1.326353 0.012632
  1000000      NaN 0.063192 1.304759 1.304924 1.304919 1.304926 1.315406 2.629070
  2000000      NaN 2.488364 1.278112 1.278115 1.288599 1.288600 0.126551 2.609808
But CRI can handle this type of synchronization with a function, shmem_wait (see man page), this function executes entirely in cache and does not interfere with the on going shmem_put memory operations. Now the above program is modified with on sending side, the old spin loop code:

  loop:;
    if (go[0] == 0) goto loop;
changes to new shmem_wait code:

    if (go[0] == 0 ) {
      shmem_wait(go,0);
    }
and on the receiving side, the old spin loop code:

  count = 1;
  a[count-1] = -2;
  ok[0] = 1;
again:;
  if (a[count-1] == n[count]) {
    shmem_put(go,ok,1,0);
    count++;
    if (count == 18) goto out;
   }
  goto again;
changes to new shmem_wait code:

    count = 1;
    ok[0] = 1;
  again:;
    a[count-1] = -2;
    if (a[count-1] == -2 ) {
      shmem_wait(&a[count-1], -2 );
      shmem_put(go,ok,1,0);
      count++;
      if (count == 18) goto out;
    }
    goto again;
Now the timings seem much better:

Table 4

Reasonable upper bound execution times for shmem_put:

  Size of
  Message PE0toPE0 PE0toPE1 PE0toPE2 PE0toPE3 PE0toPE4 PE0toPE5 PE0toPE6 PE0toPE7
  (8byte words)
   
    0      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
   V    1      NaN 0.000083 0.000012 0.000012 0.000014 0.000013 0.000013 0.000013
        2      NaN 0.000006 0.000006 0.000006 0.000006 0.000006 0.000006 0.000006
        3      NaN 0.000006 0.000006 0.000006 0.000006 0.000006 0.000006 0.000006
        4      NaN 0.000006 0.000007 0.000006 0.000007 0.000007 0.000007 0.000006
        5      NaN 0.000006 0.000007 0.000006 0.000006 0.000006 0.000007 0.000006
       10      NaN 0.000007 0.000007 0.000006 0.000007 0.000007 0.000007 0.000007
       20      NaN 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007 0.000007
      100      NaN 0.000012 0.000012 0.000012 0.000012 0.000012 0.000012 0.000012
      200      NaN 0.000018 0.000018 0.000018 0.000018 0.000018 0.000018 0.000018
     1000      NaN 0.000066 0.000066 0.000066 0.000066 0.000066 0.000066 0.000066
     2000      NaN 0.000128 0.000128 0.000128 0.000128 0.000128 0.000128 0.000128
    10000      NaN 0.000632 0.000632 0.000631 0.000632 0.000632 0.000631 0.000631
    20000      NaN 0.001261 0.001260 0.001260 0.001260 0.001260 0.001260 0.001260
   100000      NaN 0.006355 0.006287 0.006353 0.006287 0.006352 0.006287 0.006352
   200000      NaN 0.012633 0.012634 0.012633 0.012633 0.012633 0.012636 0.012633
  1000000      NaN 0.063203 0.063186 0.063183 0.063183 0.063187 0.063181 0.063186
  2000000      NaN 0.126370 0.126356 0.126594 0.126359 0.126355 0.126354 0.126530
The overhead of stopping the clock when the operation is done makes the latencies for small messages too large but the bandwidth for large transfers at 125 Mbytes/second seem to uphold the contention that shmem_puts are faster than shmem_gets, in this case by more than a factor of three.

Surprising, for me, is that the naive attempt that produced table 2. is close to the final table 4. I think these examples are correct as to most of their content but I would like to hear about any mistakes or improvements and I will put them in upcoming issues of the ARSC T3D newsletter. I have postscript copies of the SHMEM for C and Fortran user's manuals and I can e-mail them to anyone who requests them.

Correction to the 'pref' Routine in benchlib

Jeff Brooks of CRI again gave his famous talk about single PE optimization at the Alaska CUG with updated examples. He illustrated the speedups available by the judicious cache usage with plenty of examples. One utility for managing cache usage is the utility 'pref' which was described in Jeff Brook's paper on single PE optimization (available on the ARSC anonymous ftp server, ftp.arsc.edu, as pub/submissions/t3d_opt.ps.Z) and also described in ARSC T3D newsletter #30 (4/7/95).

With the 32 bit loads available from f90 and cc, there has been an improvement to the utility routine 'pref' to handle "misaligned loads". The library lib_util.a at ARSC and the benchlib source on the anonymous FTP server as pub/submissions/libbnch.tar.Z have been updated with this improvement.

New Version of MPICH at ARSC

The latest version (1.0.11) of the Argonne/Mississippi State implementation of MPI has been installed in /usr/local/examples/mpp/mpich. There are several directories to investigate with this version of MPI:

  /usr/local/examples/mpp/mpich/lib/cray_t3d/t3d  the libraries 
  /usr/local/examples/mpp/mpich/include           the include files 
  /usr/local/examples/mpp/mpich/doc               the documentation
  /usr/local/examples/mpp/mpich/man               the man pages
  /usr/local/examples/mpp/mpich/examples          ready to run examples for cf77 and C
This version of MPI fixes at least one previously reported timing anomaly with messages of an odd number of 32 bit words. Any problems with this version should be reported to Mike Ess.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
  12. RANF() and its manpage differ between machines (Newsletter #37)
  13. CRAY2IEG is available only on the Y-MP (Newsletter #40)
  14. Missing sort routines on the T3D (Newsletter #41)
  15. Missing compiler allocation flags (Newsletter #52)
  16. Missing compiler listing flags (Newsletter #53)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top