| Newsletter Index | Quick-Tip Index | Search Newsletters |
Congratulations and Hello from Mike Ess from DEC West
With the 100th T3D newsletter, ARSC has shown its commitment to distribute information to the T3D community. Parallel processing is tough, and sometimes the right information, at the right time, can make a big difference. I hope that that ARSC someday, publishes the 200th T3E newsletter. Since leaving ARSC, I have taken a job working on DEC's version of the Visual C++ compiler in Windows NT. (I guess I just couldn't get enough of that DEC Alpha processor). Seattle is as close to Alaska as I could get in the lower 48 but a world apart.
>
> What's the mileage to Lawton, OK from Fairbanks, AK? (How 'bout route,
> bandwidth, and latency?)
>
Mileage - 3900 miles - 3900 LONG miles!! Route - Only 2 possibilities to Edmonton - Alcan or Cassiar Bandwidth - Two-lane till around Edmonton Latency - VERY heavy, due to number of motor-homes!!
www-ocean.tamu.edu/GOM/gom-resource.html.
Project Title: High-performance Modeling of the Arctic Ocean and Sea Ice.
People Involved: - Wieslaw Maslowski, Albert Semtner, Yuxia Zhang,
Naval Postgraduate School
- Anthony Craig, Robert Chervin, NCAR
This project can be described in short as an advanced oceanographic
application using massively parallel processor (MPP) and parallel-vector
processor (PVP) architectures. Our modeling efforts to use advanced
computers have resulted in development of two high resolution coupled
models of the Arctic Ocean and Sea Ice. The PVP-version uses a modified
free-surface version of the Semtner/Chervin parallel ocean climate
model (POCM) coupled to the thermodynamic-dynamic sea-ice model of Hibler
with more efficient numerics (this version runs elsewhere now). The MPP-
version uses the Parallel Ocean Program (POP) of Los Alamos National
Laboratory adapted to the Arctic Ocean coupled to a recently developed
massively parallel version of the above mentioned sea ice model. The
model domain extends beyond the Central Arctic and includes the Nordic
Seas, Canadian Archipelago, and subarctic North Atlantic. The resolution
is ~18km (1/6-degree in rotated coordinates) and 30 levels.
In this ongoing study both models have already been integrated for multi- decades (the MPP-version for 80+ years) using high-frequency realistic atmospheric forcing. We will continue integration (~200 years) of the massively parallel coupled model forced with high-frequency multi-year observed and re-analyzed atmospheric forcing to determine the long-term turbulent circulation of the ice-covered Arctic Ocean. An eddy-resolving version of the model at 1/12-degree (~9km) and 40 levels will be configured and run as soon as a new more powerful Cray T3E becomes available at ARSC.
Many scientific and practical applications can be made with the model and its output, related to climate change, navigational forecasts, biological productivity, pollutant dispersal, and basic ocean dynamics. Our results already provide improved information on ocean thermohaline and wind-driven circulation, water mass formation, shelf/basin and ice/ocean interactions, and neutral tracers dispersion in the Arctic.
Wieslaw Maslowski, Oceanography Department, Naval Postgraduate School
www-hpcc.astro.washington.edu/picture.html.
Quote: Science! Curse thee thou vain toy; and cursed be all the things that cast man's eyes aloft to that heaven. Moby Dick, chapter 118 Hermann Melville
www7320.nrlssc.navy.mil/html/lsm-home.html.
==== P.S. my vote: MORE PE's!
(although more PE's implicitly means more memory)
==== Vote: B) More PES
==== ps - I vote for more memory.
==== I vote of course for both bigger memory and more PEs
ARSC hired me last November. Prior to that, I had been a software engineer at NOAA's Climate Monitoring and Diagnostics Lab (CMDL) in Boulder, where I worked with Dr. Jim Elkins on in situ measurements of stratospheric halocompounds and nitrous oxide. In other words, we studied the ozone hole.
Jim (et al) build miniaturized, high-precision gas chromatographs and launch them in balloons and stratospheric aircraft. If you're interested, here's the URL:
www.cmdl.noaa.gov/noah_home/noah.html
I started at NOAA while finishing up my master's in computer science at CU-Boulder, where I worked on load-balancing and distributed processing. Here's the URL of the CS department at Boulder:
I have written a program which performs simple timings of a one-sided "put" operation, and implemented it using both EPCC's MPI2 library and CRI's shmem library. My code, makefiles, and results, running on ARSC's T3D, are provided below. We expect that, since the MPI2 routines are built on top of shmem, they will add some overhead. They seem to increase latency, per "Put" call, by 20 or so microseconds for small buffers, and decrease bandwidth by 1-10 mbytes/sec for large buffers.
Here is my timing algorithm:
----------------------------
switch on PE type:
case SENDER:
Synchronize at barrier;
Start timer;
Put data to the RECEIVER PE;
Stop timer;
case RECEIVER:
Synchronize at barrier;
Start timer;
Wait for acknowledgment that Put has completed;
Stop timer;
end switch;
The primary usefulness of the MPI2 prototype is probably stated best in EPCC's documentation. This is available, in postscript, at:
www.epcc.ed.ac.uk/t3dmpi/Product/
Title: "Using MPI 2 One Sided Communications on Cray T3D."
Author: A. Gordon Smith
Date: 12 Dec 1995.
Here is an excerpt:
"The MPI 2 Forum is actively - at the time of writing - devising
extensions to the standard message-passing interface, MPI. One area
of this effort is the One-sided Communications interface. This
attempts to standardise remote memory read and write operations,
within the existing MPI framework. The communications model provided
is similar to that of the Cray T3D Shared Memory Access (SMA)
library, via shmem_put and shmem_get, familiar to many Cray T3D
users. Edinburgh Parallel Computing Centre (EPCC) has provided an
efficient and robust MPI for the Cray T3D and has extended this with
a prototype implementation of a subset of the currently proposed
One-sided Communications interface. It is hoped that this will
encourage a useful exchange between Cray T3D users and the MPI2
Forum: T3D users benefit from the portability required of MPI
One-sided; and the MPI 2 Forum benefits from feedback on the proposal
from Cray T3D users that have developed skills and experience using
the remote read/write model."
Incidentally, Gordon Smith has been extremely helpful whenever I have had questions concerning either EPCC/MPI or EPCC/MPI2.
Another source of information is the MPI2 page at Argonne National Laboratory:
www.mcs.anl.gov/Projects/mpi/mpi2/mpi2.html
======================================================================
Here are my codes, makefiles, and results:
[ shmem version ]
####################
# makefile
all: shmem_wait
shmem_wait: shmem_wait.c
/mpp/bin/cc -Tcray-t3d -X2 -O0 -o shmem_wait \
shmem_wait.c
####################
/* shmem_wait.c */
/* Measure communication rates between two PEs using shmem_put
* and shmem_wait.
*
* Tom Baring, ARSC, August, 1996
*/
#include <stdio.h>
#include <mpp/shmem.h>
#include <mpp/limits.h>
#include <mpp/time.h>
#define BUFSZ 1000000
/* d=delta clock tics; n=number words*/
#define USECS(d) ((float)(d)*1000000.0/(float)CLK_TCK)
#define MBR(d,n) ((float)(n)*8.0/1000000.0)/(USECS(d)/1000000.0))
/* long buf[BUFSZ]; */
main () {
int n;
long nwords;
int mype, otherpe, npes;
long t1,t2;
long buf[BUFSZ];
fortran irtc();
long pSync[_SHMEM_BARRIER_SYNC_SIZE];
npes = shmem_n_pes();
if (npes < 2) {
printf ("ERROR: Minimum of 2 PEs required.\n");
exit (1);
}
/* Init work array for shmem_barrer() */
for (n = 0; n < _SHMEM_BARRIER_SYNC_SIZE; n++)
pSync[n] = _SHMEM_SYNC_VALUE;
/* Can't use shmem_barrier() until all PEs have initialized pSync */
barrier();
for (n = 0; n < BUFSZ; n++)
buf[n] = 0;
mype = shmem_my_pe();
shmem_set_cache_inv(); /* Reload cache when put received */
switch (mype) {
case 0:
otherpe = 1;
for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
buf[nwords-1] = 1; /* Use as flag to shmem_wait() on receiver */
shmem_barrier( 0, 1, 2, pSync ); /* sync the PEs */
t1 = irtc ();
shmem_put (buf, buf, nwords, otherpe);
t2 = irtc ();
printf ("SENDER: nwords=%ld ticks=%ld usecs=%f\n",
nwords, (t2-t1), USECS(t2-t1));
}
break;
case 1:
for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
shmem_barrier( 0, 1, 2, pSync ); /* sync the PEs */
t1 = irtc ();
shmem_wait( &buf[nwords-1], 0 ); /* wait for last element */
t2 = irtc ();
printf ("RECEIVER: nwords=%ld ticks=%ld usecs=%f; mbr=%f\n",
nwords, t2-t1, USECS(t2-t1), MBR(t2-t1,nwords);
}
break;
default:
break;
}
}
======================================================================
[ MPI2 version ]
####################
# makefile
MPI2_15_INC_PATH=/u1/uaf/baring/MPI2
MPI2_15_LIB_PATH=/u1/uaf/baring/MPI2
all: mpi2_wait
mpi2_wait: mpi2_wait.c
/mpp/bin/cc -V -Tcray-t3d -X2 -o mpi2_wait mpi2_wait.c \
-I$(MPI2_15_INC_PATH) \
-L$(MPI2_15_LIB_PATH) \
-lmpi2
####################
/* mpi2_wait.c */
/* Measure communication rates between two PEs using MPI2_Put
* and MPI_Wait.
*
* Tom Baring, ARSC, August, 1996
*/
/* Note on linking: do not link libmpi.a -- libmpi2.a has MPI2
* specific version of MPI_Wait().
*/
#include <stdio.h>
#include <mpp/limits.h>
#include <mpp/time.h>
#include "mpi.h" /* Must include mpi.h and mpi2.h: mpi.h first. */
#include "mpi2.h"
#define BUFSZ 1000000
/* d=delta clock tics; n=number words*/
#define USECS(d) ((float)(d)*1000000.0/(float)CLK_TCK)
#define MBR(d,n) ((float)(n)*8.0/1000000.0)/(USECS(d)/1000000.0))
#define NCOUNTERS 1
#define COUNTER_0 0
void dump_status (MPI_Status *s);
main (int argc, char *argv[]) {
int n;
long nwords;
int mype, otherpe, npes;
int err, cnt;
long buf[BUFSZ];
long t1,t2;
fortran irtc();
MPI_Comm comm;
MPI_Request req;
MPI_Status status;
MPI_Init (&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mype);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
if (npes < 2) {
printf ("ERROR: minimum of 2 pes required.\n");
goto FINALIZE;
}
MPI2_RMC_init (buf, BUFSZ, MPI_LONG, NCOUNTERS, &req,
MPI_COMM_WORLD, &comm);
/* Initialize buf */
for (n = 0; n < BUFSZ; n++)
buf[n] = 0;
/* Divide "work" among pes */
switch (mype) {
case 0:
otherpe = 1;
for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
MPI_Barrier(MPI_COMM_WORLD);
t1 = irtc ();
err = MPI2_Put (buf, nwords, MPI_LONG, otherpe, 0, nwords,
MPI_LONG, comm, COUNTER_0, 1);
t2 = irtc ();
printf ("SENDER: nwords=%ld ticks=%ld usecs=%f\n",
nwords, (t2-t1), USECS(t2-t1));
}
break;
case 1:
for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
err = MPI2_Set_counter_threshold (req, 1);
MPI_Barrier(MPI_COMM_WORLD);
t1 = irtc ();
err = MPI_Wait (&req, &status);
t2 = irtc ();
printf ("RECEIVER: nwords=%ld ticks=%ld usecs=%f; mbr=%f\n",
nwords, t2-t1, USECS(t2-t1), MBR(t2-t1,nwords);
}
break;
default:
break;
}
FINALIZE:
MPI_Finalize ();
}
======================================================================
Here is the output from these programs, obtained on ARSC's T3D:
[ shmem version ]
RECEIVER: nwords=1 ticks=1317 usecs=8.779122; mbr=0.911253
RECEIVER: nwords=10 ticks=957 usecs=6.379362; mbr=12.540440
RECEIVER: nwords=100 ticks=1476 usecs=9.839016; mbr=81.308947
RECEIVER: nwords=1000 ticks=10136 usecs=67.566573; mbr=118.401743
RECEIVER: nwords=10000 ticks=95936 usecs=639.509348; mbr=125.095904
RECEIVER: nwords=100000 ticks=956004 usecs=6372.722388; mbr=125.535046
RECEIVER: nwords=1000000 ticks=9478044 usecs=63180.638567; mbr=126.621069
SENDER: nwords=1 ticks=1094 usecs=7.292604
SENDER: nwords=10 ticks=758 usecs=5.052828
SENDER: nwords=100 ticks=1577 usecs=10.512282
SENDER: nwords=1000 ticks=10163 usecs=67.746555
SENDER: nwords=10000 ticks=95966 usecs=639.709328
SENDER: nwords=100000 ticks=956124 usecs=6373.522308
SENDER: nwords=1000000 ticks=9478176 usecs=63181.518478
[ MPI2 version ]
RECEIVER: nwords=1 ticks=4549 usecs=30.323633; mbr=0.263821
RECEIVER: nwords=10 ticks=3419 usecs=22.791053; mbr=3.510149
RECEIVER: nwords=100 ticks=4508 usecs=30.050327; mbr=26.622007
RECEIVER: nwords=1000 ticks=13048 usecs=86.977964; mbr=91.977319
RECEIVER: nwords=10000 ticks=101002 usecs=673.279303; mbr=118.821416
RECEIVER: nwords=100000 ticks=977533 usecs=6516.234696; mbr=122.770286
RECEIVER: nwords=1000000 ticks=9658318 usecs=64382.344998; mbr=124.257667
SENDER: nwords=1 ticks=3987 usecs=26.577341
SENDER: nwords=10 ticks=3456 usecs=23.037695
SENDER: nwords=100 ticks=4104 usecs=27.357263
SENDER: nwords=1000 ticks=12858 usecs=85.711424
SENDER: nwords=10000 ticks=100647 usecs=670.912873
SENDER: nwords=100000 ticks=977088 usecs=6513.268326
SENDER: nwords=1000000 ticks=9657798 usecs=64378.878679
> > PITTSBURGH SUPERCOMPUTING CENTER ACCEPTS FIRST > PRODUCTION MODEL OF POWERFUL NEW SYSTEM FROM > CRAY RESEARCH > > Highly Parallel CRAY T3E Advances U.S. Science and > Engineering > > CHIPPEWA FALLS, Wisc., Aug. 15, 1996 -- The Pittsburgh > Supercomputing Center (PSC) today formally accepted the first > production model of the CRAY T3E. Produced by Silicon Graphics > subsidiary Cray Research, the CRAY T3E is one of the world's > most powerful supercomputers, a "highly parallel" system that > teams tens, hundreds or thousands of processors to work > simultaneously on the same computing task. > > PSC's CRAY T3E system will have 512 processors and be capable > of more than 300 billion calculations per second. This > represents a four-fold increase in scientific productivity over > PSC's current highly parallel system, the CRAY T3D. > > This new system substantially boosts the computational > capability of U.S. scientists and engineers, said PSC scientific > directors Michael Levine and Ralph Roskies: "This is a big step > forward in U.S. high-performance computing." >
Contact:
Donald Bahls ARSC User Consultant ph: 907-450-8674 Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.E-mail Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources