| Newsletter Index | Quick-Tip Index | Search Newsletters |
[ Thanks to Tom Logan of ARSC for sharing his slides for his talk at Scicomp-11. His presentation was used for some of the content of this article. Scicomp is the IBM System Scientific User Group Meeting. http://www.spscicomp.org ]
In April, upgrades to the High Performance Switch on the IBM system, iceberg enabled Remote Direct Memory Access (RDMA) transport and striping. In this article we will take a look what these features do and how each can be enabled.
RDMA transport allows a portion of the segmentation and reassembly of messages to be done by the switch hardware rather than processors. The result of this is reduced load on the processors during communications, which ideally will free CPU resources for computation.
Adding the Loadleveler keyword "bulkxfer" to your Loadleveler job script will enable RDMA.
E.g. # @ bulkxfer= yes
The poe environment variable MP_USE_BULK_XFER similarly allows RDMA to be used with interactive jobs. However if you are using Loadleveler the Loadleveler keyword should be used to guarantee that RDMA resources are available (i.e. not being used by another job) on each node the job is scheduled on.
We can verify that a job will use RDMA by looking at the output of "llq -l". Below is the output for a job requesting RDMA. The "Resources" and "Bulk Transfer" fields are of particular interest.
iceberg1 1% llq -l b1n2.35171.0 | egrep "Bulk|RDMA"
Resources: RDMA(1)
Bulk Transfer: Yes
Striping allows a single message to be sent over more than one network or for a task to communicate over more than one adapter window on a single adapter. This functionality has the potential to drastically increase the communications bandwidth. On Iceberg, the p655+ nodes connect to two separate networks, which theoretically allows striping to double the bandwidth for large messages.
Using "sn_all" or "csss" for the adapter set in the network statement will enable striping.
E.g. # @ network.mpi = sn_all,shared,us
Striping over a single adapter is also possible, however thus far this form of striping has shown limited or no benefit in tests run by ARSC staff. To avoid confusion we won't show the syntax for single network striping in this article.
Here is an example script using striping with RDMA:
#!/bin/ksh # @ output = $(Executable).$(jobid).out # @ error = $(Executable).$(jobid).err # @ environment = MP_SHARED_MEMORY=yes; COPY_ALL # @ notification = never # @ network.mpi = sn_all,shared,US # @ bulkxfer = yes # @ node_usage = not_shared # @ wall_clock_limit = 8:00:00 # @ job_type = parallel # @ node = 4 # @ tasks_per_node = 8 # @ class = standard # @ queue ./a.out
ARSC MPP specialist Tom Logan performed extensive tests of RDMA and striping for a talk at Scicomp-11 using a application called dcprog. This application tests bi-directional network bandwidth and has been routinely used at ARSC as a sanity check and performance test following system upgrades. Below are some of the observations based on Tom's runs of dcprog.
Dr. Maltby of Cray Inc. will give a presentation on the capabilities of the Cray model XD1 computer as they pertain to the availability of field-programmable gate arrays (FPGAs).
Title: "FPGA Applications on the Cray XD1" Date: Tuesday June 7 Time: 2:00 - 3:00 pm Location: WRRB 010 (ARSC seminar room) Speaker: Dr. James Maltby of Cray, Inc.
When ARSC hosted WOMPAT several years ago, one of the speakers suggested only static scheduling should be used with OpenMP parallel loops. (See: http://www.arsc.edu/support/news/HPCnews/HPCnews252.shtml.) What follows are some results from a simple OpenMP test program on the X1 and P6X Complex.
The program was originally written to help a user who wanted to use OpenMP sections, so a little complexity is added to facilitate breaking the overall job into large sections. Here's the basic idea:
#define NSECTS 8 #define WKPERSECT 1000000 #define SZ (NSECTS*WKPERSECT)
"SZ" is the total array length, as used in the array declarations:
float a[SZ]; float b[SZ]; float c[SZ]; float x[SZ]; float y[SZ];
Here's the "work" performed on these arrays, modeled on the user's code:
for (n=0; n<NSECTS; n++) {
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
This work could be divided amongst OpenMP threads in various ways. The test program experiments with five methods:
Here are some results. Two runs were made on each platform for each value of OMP_NUM_THREADS ("N" in the table) and except for a couple of serious outliers on the X1, as noted below, times were consistent within 10%.
Table 1: P655+ Results Time (seconds) N OUTER STATIC GUIDED DYNAMIC SECTIONS - ----- ------ ------ ------- -------- 2 0.659 0.658 0.661 0.659 0.661 3 0.492 0.440 0.441 0.440 0.495 4 0.332 0.340 0.331 0.330 0.331 5 0.328 0.312 0.275 0.264 0.349 6 0.338 0.266 0.222 0.234 0.329 7 0.323 0.208 0.189 0.189 0.345 8 0.184 0.165 0.166 0.172 0.167
Table 2: X1 Results (compiled in SSP mode) Time (seconds) N OUTER STATIC GUIDED DYNAMIC SECTIONS - ----- ------ ------ ------- -------- 2 0.384 0.389 13.184 0.396 0.374 3 0.288 0.259 8.776 0.264 0.280 4 0.191 0.195 6.578 0.197 0.187 5 0.191 0.155 5.261 0.158 0.187 6 0.191 0.130 4.385 0.132 0.187 7 0.192 0.111 3.757 0.113 0.186 8 0.095 0.097 3.292 0.099 0.094
Table 3: X1 -- Outliers: Time (seconds) N OUTER STATIC GUIDED DYNAMIC SECTIONS - ----- ------ ------ ------- -------- 6 10.334 3.557 4.383 0.132 0.187 8 68.454 18.700 3.288 0.099 0.094
On the X1, the scheduler places these jobs on any node where it can find the required number of SSPs. (On the IBM, the tests were run on the debug nodes, and never had to share.) Given the utilization of klondike recently, this meant that these X1 jobs always had to share a node with other jobs. Thus, a likely explanation is that these particular runs were affected by something else happening on the node.
Here's the complete test program, if you'd like to experiment with it:
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
#define NSECTS 8
#define WKPERSECT 1000000
#define SZ (NSECTS*WKPERSECT)
#define NLINES 2
main () {
float a[SZ];
float b[SZ];
float c[SZ];
float x[SZ];
float y[SZ];
float k1,k2;
struct timeval t;
double t1,t2;
int i,n;
/* INITIALIZATION */
k1 = 0.5;
k2 = -0.5;
#pragma omp parallel
{
printf ("Hello from %d of %d openmp threads.\n", omp_get_thread_num(), omp_get_num_threads() );
}
/* Initialization: do nothing but load cache? Place the arrays?
* Without this, the first actual test is consistently slower. */
#pragma omp parallel for private(i)
for (n=0; n<NSECTS; n++)
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++)
x[i] = a[i] = y[i] = b[i] = c[i] = 0.0;
/* OUTER LOOP PARALLEL FOR */
gettimeofday(&amp;t,NULL);
t1 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
#pragma omp parallel for private(i)
for (n=0; n<NSECTS; n++) {
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
gettimeofday(&t,NULL);
t2 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
printf ("outer loop, PARALLEL FOR: %lf (seconds)\n", t2-t1);
/* POST-PROCESSING */
for (i=0; i<NLINES; i++)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
for (i=SZ-1; i>SZ-NLINES; i--)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
/* INNER LOOP PARALLEL FOR -- STATIC SCHEDULE */
gettimeofday(&t,NULL);
t1 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
for (n=0; n<NSECTS; n++) {
#pragma omp parallel for
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
gettimeofday(&t,NULL);
t2 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
printf ("inner loop, PARALLEL FOR: STATIC: %lf (seconds)\n", t2-t1);
/* POST-PROCESSING */
for (i=0; i<NLINES; i++)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
for (i=SZ-1; i>SZ-NLINES; i--)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
/* INNER LOOP PARALLEL FOR -- GUIDED SCHEDULE */
gettimeofday(&t,NULL);
t1 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
for (n=0; n<NSECTS; n++) {
#pragma omp parallel for schedule (guided)
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
gettimeofday(&t,NULL);
t2 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
printf ("inner loop, PARALLEL FOR: GUIDED: %lf (seconds)\n", t2-t1);
/* POST-PROCESSING */
for (i=0; i<NLINES; i++)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
for (i=SZ-1; i>SZ-NLINES; i--)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
/* INNER LOOP PARALLEL FOR -- DYNAMIC SCHEDULE */
gettimeofday(&t,NULL);
t1 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
for (n=0; n<NSECTS; n++) {
#pragma omp parallel for schedule (dynamic, 1000)
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
gettimeofday(&t,NULL);
t2 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
printf ("inner loop, PARALLEL FOR: DYNAMIC: %lf (seconds)\n", t2-t1);
/* POST-PROCESSING */
for (i=0; i<NLINES; i++)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
for (i=SZ-1; i>SZ-NLINES; i--)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
/* SECTIONS */
gettimeofday(&t,NULL);
t1 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
#pragma omp parallel sections private(i,n)
{
#pragma omp section
{
n=0;
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
#pragma omp section
{
n=1;
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
#pragma omp section
{
n=2;
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
#pragma omp section
{
n=3;
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
#pragma omp section
{
n=4;
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
#pragma omp section
{
n=5;
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
#pragma omp section
{
n=6;
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
#pragma omp section
{
n=7;
for (i=n*WKPERSECT; i<(n+1)*WKPERSECT; i++) {
x[i] = sin ((float)i / 1000.0);
a[i] = k1 * x[i] ;
y[i] = sin ((float)i / 900.0);
b[i] = k2 * y[i] ;
c[i] = a[i] + b[i];
}
}
} /* End sections */
gettimeofday(&t,NULL);
t2 = (double) t.tv_sec + (double) t.tv_usec / 1000000.0;
printf ("SECTIONS: %lf (seconds)\n", t2-t1);
/* POST-PROCESSING */
for (i=0; i<NLINES; i++)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
for (i=SZ-1; i>SZ-NLINES; i--)
fprintf (stderr, "i=%ld: %12.10f %12.10f %12.10f \n", i, a[i], b[i], c[i]);
}
A:[[ I always keep two terminal windows open to klondike (or whatever [[ remote system I'm using). [[ [[ Are there any tricks for exchanging information between the two [[ windows? For instance, I may type a command in one session, but want [[ the same thing to happen in both sessions. E.g., "module switch [[ PrgEnv PrgEnv.new" or "cd ~/blah/blah/blah". Or I may want to pass [[ an environment variable between sessions. [[ [[ Yes, I know how to "cut" and "paste" between windows using the [[ so-called "clipboard"... but it's a pain in the guru. Any other [[ solutions? # # Thanks to Ed Kornkven: # One solution for ksh users, although not without its problems, is to use a common HISTFILE for both (all) terminal sessions. The ksh HISTFILE is where the most recently issued commands are stored for possible later reuse. I usually keep separate files for each session so that the histories of the two sessions are not confused (I do this by appending the terminal id -- the output of `tty` -- to the history file name). If a "constant" name is used, however, the command histories of both sessions will be visible to both sessions, and you can then reuse commands from one session window in the other. # # And an idea from the Editor... # The multiple sessions can share files. Here's one sample "utility" using this idea. This allows you to save a directory path, and recall it and "cd" to it later, from any session. You'd add this to your .profile file: export cdx_file_name="$HOME/.cdx_file_name.temp" alias svx='wdx=`pwd`; echo $wdx >>| $cdx_file_name' alias cdx='cd $(cat $cdx_file_name)' When you want to save your PWD, type "svx." Later, type "cdx" and you'll "cd" right back to that directory. If a memory of one isn't enough, you could create more, like: "sv1" and "cd1", "sv2" and "cd2", etc... Q: I'm trying to find the version of the IBM Fortran compiler, xlf, but the man page doesn't list an option for this. Where can I find the version?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Contact:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619 Donald Bahls ARSC User Consultant ph: 907-450-8674 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.Email Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 |
voice: 907-474-6935 |
email:
home | search | about | support | news | science | resources