ARSC HPC Users' Newsletter 317, June 03, 2005

IBM: Using RDMA and Striping


[ Thanks to Tom Logan of ARSC for sharing his slides for his talk
at Scicomp-11.  His presentation was used for some of the content of 
this article.  Scicomp is the IBM System Scientific User Group Meeting. 


http://www.spscicomp.org
 ]

In April, upgrades to the High Performance Switch on the IBM system, iceberg enabled Remote Direct Memory Access (RDMA) transport and striping. In this article we will take a look what these features do and how each can be enabled.

RDMA

RDMA transport allows a portion of the segmentation and reassembly of messages to be done by the switch hardware rather than processors. The result of this is reduced load on the processors during communications, which ideally will free CPU resources for computation.

Adding the Loadleveler keyword "bulkxfer" to your Loadleveler job script will enable RDMA.


E.g.
# @ bulkxfer= yes

The poe environment variable MP_USE_BULK_XFER similarly allows RDMA to be used with interactive jobs. However if you are using Loadleveler the Loadleveler keyword should be used to guarantee that RDMA resources are available (i.e. not being used by another job) on each node the job is scheduled on.

We can verify that a job will use RDMA by looking at the output of "llq -l". Below is the output for a job requesting RDMA. The "Resources" and "Bulk Transfer" fields are of particular interest.


iceberg1 1% llq -l b1n2.35171.0 
 egrep "Bulk
RDMA"
          Resources: RDMA(1)
      Bulk Transfer: Yes

Striping

Striping allows a single message to be sent over more than one network or for a task to communicate over more than one adapter window on a single adapter. This functionality has the potential to drastically increase the communications bandwidth. On Iceberg, the p655+ nodes connect to two separate networks, which theoretically allows striping to double the bandwidth for large messages.

Using "sn_all" or "csss" for the adapter set in the network statement will enable striping.


E.g.
# @ network.mpi = sn_all,shared,us

Striping over a single adapter is also possible, however thus far this form of striping has shown limited or no benefit in tests run by ARSC staff. To avoid confusion we won't show the syntax for single network striping in this article.

Here is an example script using striping with RDMA:


#!/bin/ksh
# @ output           = $(Executable).$(jobid).out
# @ error            = $(Executable).$(jobid).err
# @ environment      = MP_SHARED_MEMORY=yes; COPY_ALL
# @ notification     = never
# @ network.mpi      = sn_all,shared,US
# @ bulkxfer         = yes
# @ node_usage       = not_shared
# @ wall_clock_limit = 8:00:00
# @ job_type         = parallel
# @ node             = 4
# @ tasks_per_node   = 8
# @ class            = standard
# @ queue

./a.out

Experiences

ARSC MPP specialist Tom Logan performed extensive tests of RDMA and striping for a talk at Scicomp-11 using a application called dcprog. This application tests bi-directional network bandwidth and has been routinely used at ARSC as a sanity check and performance test following system upgrades. Below are some of the observations based on Tom's runs of dcprog.

  • RDMA generally increased bandwidth over sn_single performance with RDMA disabled, though there was a drop-off in performance near the bulk transfer cutoff point.
  • The environment variable MP_BULK_MIN_MSG_SIZE smoothed out some of the variability near the bulk transfer cutoff point. In his runs Tom used MP_BULK_MIN_MSG_SIZE=512000.
  • In general RDMA and striping provide the most benefit for large messages. Gains in bandwidth were most noticeable for messages larger than 2 MB.
  • Striping with RDMA enabled nearly doubled the bandwidth for messages larger than 2 MB compared to sn_single with RDMA enabled.

FPGA Applications on the Cray XD1

Dr. Maltby of Cray Inc. will give a presentation on the capabilities of the Cray model XD1 computer as they pertain to the availability of field-programmable gate arrays (FPGAs).


Title:    "FPGA Applications on the Cray XD1"
Date:     Tuesday June 7
Time:     2:00 - 3:00 pm
Location: WRRB 010 (ARSC seminar room)
Speaker:  Dr. James Maltby of Cray, Inc.

OpenMP Schedules and Sections

When ARSC hosted WOMPAT several years ago, one of the speakers suggested only static scheduling should be used with OpenMP parallel loops. (See: /arsc/support/news/hpcnews/hpcnews252/index.xml.) What follows are some results from a simple OpenMP test program on the X1 and P6X Complex.

The program was originally written to help a user who wanted to use OpenMP sections, so a little complexity is added to facilitate breaking the overall job into large sections. Here's the basic idea:


  #define NSECTS 8
  #define WKPERSECT 1000000
  #define SZ (NSECTS*WKPERSECT)

"SZ" is the total array length, as used in the array declarations:


  float a[SZ];
  float b[SZ];
  float c[SZ];
  float x[SZ];
  float y[SZ];

Here's the "work" performed on these arrays, modeled on the user's code:


  for (n=0; n<NSECTS; n++) {
    for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
      x[i] = sin ((float)i / 1000.0);
      a[i] = k1 * x[i] ;
      y[i] = sin ((float)i / 900.0);
      b[i] = k2 * y[i] ;
      c[i] = a[i] + b[i];
    }
  }

This work could be divided amongst OpenMP threads in various ways. The test program experiments with five methods:

  1. parallelizing the 8-iteration outer loop (with static scheduling)
  2. parallelizing the inner loop (static)
  3. parallelizing the inner loop (guided)
  4. parallelizing the inner loop (dynamic, which a chunk size of 1000)
  5. replacing the outer loop with 8 sections, in "sections" block.

Here are some results. Two runs were made on each platform for each value of OMP_NUM_THREADS ("N" in the table) and except for a couple of serious outliers on the X1, as noted below, times were consistent within 10%.


Table 1: P655+ Results
  Time (seconds)

   N   OUTER   STATIC   GUIDED  DYNAMIC SECTIONS
   -   -----   ------   ------  ------- --------
   2   0.659    0.658    0.661    0.659   0.661 
   3   0.492    0.440    0.441    0.440   0.495 
   4   0.332    0.340    0.331    0.330   0.331 
   5   0.328    0.312    0.275    0.264   0.349 
   6   0.338    0.266    0.222    0.234   0.329 
   7   0.323    0.208    0.189    0.189   0.345 
   8   0.184    0.165    0.166    0.172   0.167 

Table 2: X1 Results (compiled in SSP mode)
  Time (seconds)

   N   OUTER   STATIC   GUIDED  DYNAMIC SECTIONS
   -   -----   ------   ------  ------- --------
   2   0.384    0.389   13.184    0.396   0.374 
   3   0.288    0.259    8.776    0.264   0.280 
   4   0.191    0.195    6.578    0.197   0.187 
   5   0.191    0.155    5.261    0.158   0.187 
   6   0.191    0.130    4.385    0.132   0.187 
   7   0.192    0.111    3.757    0.113   0.186 
   8   0.095    0.097    3.292    0.099   0.094 

Observations:

  1. OpenMP "sections" performed almost identically to parallelizing the outer, 8-iteration loop with a "parallel for" directive. In both cases, when the number of threads was 2, 4, or 8, and thus evenly divided the 8 sections, the time for these two methods was as good as the best of the other methods. When the number of threads didn't divide the 8 sections evenly, the time didn't improve proportionate to the number of threads. This is no surprise, as good load-balance can clearly not be achieved with such a limited number of chunks of work divided unevenly.
  2. Parallelizing on the inner loop, with its 1000000 iterations to play with, produced good load-balance and speedup for all values of OMP_NUM_THREADS.
  3. On the IBM, the schedule type for this little code didn't matter. Guided and dynamic did as well as static.
  4. On the Cray, "guided" was a disaster (note the GUIDED column in Table 2), but dynamic (with the chunk size of 1000) was as good as static.
  5. Some strange timings appeared on the X1. These were deleted from the above table, but shown here:
    
    Table 3: X1 -- Outliers: 
      Time (seconds)
    
     N   OUTER   STATIC   GUIDED  DYNAMIC SECTIONS
     -   -----   ------   ------  ------- --------
     6   10.334   3.557    4.383    0.132   0.187 
     8   68.454  18.700    3.288    0.099   0.094
    

On the X1, the scheduler places these jobs on any node where it can find the required number of SSPs. (On the IBM, the tests were run on the debug nodes, and never had to share.) Given the utilization of klondike recently, this meant that these X1 jobs always had to share a node with other jobs. Thus, a likely explanation is that these particular runs were affected by something else happening on the node.

Here's the complete test program, if you'd like to experiment with it:


#include <stdio.h>
#include <math.h>
#include <sys/time.h>


#define NSECTS 8
#define WKPERSECT 1000000
#define SZ (NSECTS*WKPERSECT)
#define NLINES 2

main () {
  float a[SZ];
  float b[SZ];
  float c[SZ];
  float x[SZ];
  float y[SZ];
  float k1,k2;
  struct timeval t;
  double t1,t2; 
  int   i,n; 

  /* INITIALIZATION */
  k1 = 0.5;
  k2 = -0.5;

  #pragma omp parallel 
  {
    printf ("Hello from %d of %d openmp threads.\n", omp_get_thread_num(), omp_get_num_threads() );
  }


  /* Initialization: do nothing but load cache? Place the arrays?  
   * Without this, the first actual test is consistently slower. */
    #pragma omp parallel for private(i)
    for (n=0; n<NSECTS; n++) 
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) 
        x[i] = a[i] = y[i] = b[i] = c[i] = 0.0;


  /* OUTER LOOP PARALLEL FOR */

  gettimeofday(&amp;amp;t,NULL);
  t1 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

    #pragma omp parallel for private(i)
    for (n=0; n<NSECTS; n++) {
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

  gettimeofday(&amp;t,NULL);
  t2 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

  printf ("outer loop, PARALLEL FOR: %lf (seconds)\n", t2-t1);

  /* POST-PROCESSING */
  for (i=0; i<NLINES; i++) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);

  for (i=SZ-1; i>SZ-NLINES; i--) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);


  /* INNER LOOP PARALLEL FOR -- STATIC SCHEDULE */

  gettimeofday(&amp;t,NULL);
  t1 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

    for (n=0; n<NSECTS; n++) {
      #pragma omp parallel for 
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

  gettimeofday(&amp;t,NULL);
  t2 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

  printf ("inner loop, PARALLEL FOR: STATIC: %lf (seconds)\n", t2-t1);

  /* POST-PROCESSING */
  for (i=0; i<NLINES; i++) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);

  for (i=SZ-1; i>SZ-NLINES; i--) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);



  /* INNER LOOP PARALLEL FOR -- GUIDED SCHEDULE */

  gettimeofday(&amp;t,NULL);
  t1 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

    for (n=0; n<NSECTS; n++) {
      #pragma omp parallel for schedule (guided)
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

  gettimeofday(&amp;t,NULL);
  t2 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

  printf ("inner loop, PARALLEL FOR: GUIDED: %lf (seconds)\n", t2-t1);

  /* POST-PROCESSING */
  for (i=0; i<NLINES; i++) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);

  for (i=SZ-1; i>SZ-NLINES; i--) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);



  /* INNER LOOP PARALLEL FOR -- DYNAMIC SCHEDULE        */

  gettimeofday(&amp;t,NULL);
  t1 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

    for (n=0; n<NSECTS; n++) {
      #pragma omp parallel for schedule (dynamic, 1000)
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

  gettimeofday(&amp;t,NULL);
  t2 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

  printf ("inner loop, PARALLEL FOR: DYNAMIC: %lf (seconds)\n", t2-t1);

  /* POST-PROCESSING */
  for (i=0; i<NLINES; i++) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);

  for (i=SZ-1; i>SZ-NLINES; i--) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);



  /* SECTIONS */

  gettimeofday(&amp;t,NULL);
  t1 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

  #pragma omp parallel sections private(i,n)
  {
    #pragma omp section
    {
      n=0;
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

    #pragma omp section
    {
      n=1;
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

    #pragma omp section
    {
      n=2;
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

    #pragma omp section
    {
      n=3;
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

    #pragma omp section
    {
      n=4;
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

    #pragma omp section
    {
      n=5;
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

    #pragma omp section
    {
      n=6;
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }

    #pragma omp section
    {
      n=7;
      for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
        x[i] = sin ((float)i / 1000.0);
        a[i] = k1 * x[i] ;
        y[i] = sin ((float)i / 900.0);
        b[i] = k2 * y[i] ;
        c[i] = a[i] + b[i];
      }
    }
  }  /* End sections */

  gettimeofday(&amp;t,NULL);
  t2 = (double) t.tv_sec +  (double) t.tv_usec / 1000000.0;

  printf ("SECTIONS: %lf (seconds)\n", t2-t1);

  /* POST-PROCESSING */
  for (i=0; i<NLINES; i++) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);

  for (i=SZ-1; i>SZ-NLINES; i--) 
    fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);

}

Quick-Tip Q & A



A:[[ I always keep two terminal windows open to klondike (or whatever
  [[ remote system I'm using).  
  [[
  [[ Are there any tricks for exchanging information between the two
  [[ windows?  For instance, I may type a command in one session, but want
  [[ the same thing to happen in both sessions. E.g., "module switch
  [[ PrgEnv PrgEnv.new" or  "cd ~/blah/blah/blah".  Or I may want to pass
  [[ an environment variable between sessions.
  [[
  [[ Yes, I know how to "cut" and "paste" between windows using the
  [[ so-called "clipboard"... but it's a pain in the guru.  Any other
  [[ solutions?


#
# Thanks to Ed Kornkven:
#

One solution for ksh users, although not without its problems, is to use
a common HISTFILE for both (all) terminal sessions.  The ksh HISTFILE is
where the most recently issued commands are stored for possible later
reuse.  

I usually keep separate files for each session so that the histories of
the two sessions are not confused (I do this by appending the terminal
id -- the output of `tty` -- to the history file name).  If a "constant"
name is used, however, the command histories of both sessions will be
visible to both sessions, and you can then reuse commands from one
session window in the other.


#
# And an idea from the Editor...
#

The multiple sessions can share files. Here's one sample "utility" using
this idea. This allows you to save a directory path, and recall it and
"cd" to it later, from any session. You'd add this to your .profile
file:

  export cdx_file_name="$HOME/.cdx_file_name.temp"
  alias svx='wdx=`pwd`; echo $wdx >>
 $cdx_file_name'
  alias cdx='cd $(cat $cdx_file_name)'

When you want to save your PWD, type "svx."  Later, type "cdx" and
you'll "cd" right back to that directory.  If a memory of one isn't
enough, you could create more, like: "sv1" and "cd1", "sv2" and "cd2",
etc...



Q: I'm trying to find the version of the IBM Fortran compiler, xlf,
   but the man page doesn't list an option for this.  Where can I find
   the version?  

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top