ARSC T3D Users' Newsletter 100, August 16, 1996

CENTENNIAL ISSUE!!!

Hello From Mike Ess

[ In case you just subscribed, Mike Ess started this Newsletter, and edited/published/delivered/wrote the first 87 issues. ]

Congratulations and Hello from Mike Ess from DEC West

With the 100th T3D newsletter, ARSC has shown its commitment to distribute information to the T3D community. Parallel processing is tough, and sometimes the right information, at the right time, can make a big difference. I hope that that ARSC someday, publishes the 200th T3E newsletter. Since leaving ARSC, I have taken a job working on DEC's version of the Visual C++ compiler in Windows NT. (I guess I just couldn't get enough of that DEC Alpha processor). Seattle is as close to Alaska as I could get in the lower 48 but a world apart.

Mike Ess( ess@zso.dec.com

)

Don Morton Drives the Alcan -- Again

[ We got a response to the following question, which you may remember from > Newsletter #98 . ]

> > What's the mileage to Lawton, OK from Fairbanks, AK? (How 'bout route, > bandwidth, and latency?) >

Mileage - 3900 miles - 3900 LONG miles!! Route - Only 2 possibilities to Edmonton - Alcan or Cassiar Bandwidth - Two-lane till around Edmonton Latency - VERY heavy, due to number of motor-homes!!

Meet ARSC's Big T3D Users

[ Run "mppview" right now: I bet you'll see some familiar names. I asked our big T3D users to contribute something for this Newsletter, and got the following responses (I bet everyone else is on vacation). Thanks to all! ]

-- Charlie Barron --

I am a postdoc at the Naval Research Laboratory, Stennis Space Center MS. I am running two Gulf of Mexico circulation models at ARSC, one to examine the circulation in high horizontal resolution and low vertical resolution and another with smaller horizontal but higher vertical resolution. The NRL Stennis web address is www.nrlssc.navy,mil. For my group, look for the Ocean Dynamics and Prediction Branch under the Oceanography division. You can find more Gulf of Mexico resources from my alma mater at:

www-ocean.tamu.edu/GOM/gom-resource.html .

-- Wieslaw Maslowski --


Project Title: High-performance Modeling of the Arctic Ocean and Sea Ice.
People Involved: - Wieslaw Maslowski, Albert Semtner, Yuxia Zhang,
                   Naval Postgraduate School
                 - Anthony Craig, Robert Chervin, NCAR
This project can be described in short as an advanced oceanographic application using massively parallel processor (MPP) and parallel-vector processor (PVP) architectures. Our modeling efforts to use advanced computers have resulted in development of two high resolution coupled models of the Arctic Ocean and Sea Ice. The PVP-version uses a modified free-surface version of the Semtner/Chervin parallel ocean climate model (POCM) coupled to the thermodynamic-dynamic sea-ice model of Hibler with more efficient numerics (this version runs elsewhere now). The MPP- version uses the Parallel Ocean Program (POP) of Los Alamos National Laboratory adapted to the Arctic Ocean coupled to a recently developed massively parallel version of the above mentioned sea ice model. The model domain extends beyond the Central Arctic and includes the Nordic Seas, Canadian Archipelago, and subarctic North Atlantic. The resolution is ~18km (1/6-degree in rotated coordinates) and 30 levels.

In this ongoing study both models have already been integrated for multi- decades (the MPP-version for 80+ years) using high-frequency realistic atmospheric forcing. We will continue integration (~200 years) of the massively parallel coupled model forced with high-frequency multi-year observed and re-analyzed atmospheric forcing to determine the long-term turbulent circulation of the ice-covered Arctic Ocean. An eddy-resolving version of the model at 1/12-degree (~9km) and 40 levels will be configured and run as soon as a new more powerful Cray T3E becomes available at ARSC.

Many scientific and practical applications can be made with the model and its output, related to climate change, navigational forecasts, biological productivity, pollutant dispersal, and basic ocean dynamics. Our results already provide improved information on ocean thermohaline and wind-driven circulation, water mass formation, shelf/basin and ice/ocean interactions, and neutral tracers dispersion in the Arctic.

Wieslaw Maslowski, Oceanography Department, Naval Postgraduate School Monterey, CA, maslowsk@ucar.edu, (408) 656-3162, (fax) (408) 656-2712 vislab-www.nps.navy.mil/~braccio/maslowski/arctic.html

-- Tom Quinn --

I am a Research Assistant Professor at the University of Washington, whose primary interests are simulation of large scale structure of the Universe. On the ARSC T3D, we have been simulating the formation of clusters of galaxies similar to the group which our galaxy belongs. The method of simulation is a particle code: galaxies are represented as collections of particles interacting with their mutual gravitational force. A tree is used to reduced the computational complexity from order N^2 to order N log(N). For some pictures of our simulations see:

www-hpcc.astro.washington.edu/picture.html .


Quote:
        Science!  Curse thee thou vain toy; and cursed be all the
things that cast man's eyes aloft to that heaven.
                                        Moby Dick, chapter 118
                                        Hermann Melville

-- Jay Shriver --

I am an oceanographer with the Naval Research Laboratory at Stennis Space Center, Mississippi. The work I do involves using numerical models to better understand the dynamics of the ocean. To view some on-line examples of our project's recent accomplishments, see the URL:

www7320.nrlssc.navy.mil/html/lsm-home.html

.

VOTE: Bigger Memory? More PEs?

[ I also asked the Big users to vote on this happy choice (no promises, here). Unfortunately, I don't know how to interpret the results. ]

==== P.S. my vote:  MORE PE's! 
     (although more PE's implicitly means more memory)

==== Vote: B) More PES

==== ps - I vote for more memory. 

==== I vote of course for both bigger memory and more PEs

Meet Tom Baring

I'm a User Consultant specializing on the T3D, and the current editor of this Newsletter.

ARSC hired me last November. Prior to that, I had been a software engineer at NOAA's Climate Monitoring and Diagnostics Lab (CMDL) in Boulder, where I worked with Dr. Jim Elkins on in situ measurements of stratospheric halocompounds and nitrous oxide. In other words, we studied the ozone hole.

Jim (et al) build miniaturized, high-precision gas chromatographs and launch them in balloons and stratospheric aircraft. If you're interested, here's the URL:

www.cmdl.noaa.gov/noah_home/noah.html

I started at NOAA while finishing up my master's in computer science at CU-Boulder, where I worked on load-balancing and distributed processing. Here's the URL of the CS department at Boulder:

www.cs.colorado.edu

Shmem and The MPI 2 One-Sided Communication Interface

EPCC provides a "prototype implementation of a subset of the currently proposed [MPI 2] One-sided Communications interface." If this standard is accepted, it could be a boon to T3D users who desire the low latency and speed of the shmem routines, but fear their platform dependence.

I have written a program which performs simple timings of a one-sided "put" operation, and implemented it using both EPCC's MPI2 library and CRI's shmem library. My code, makefiles, and results, running on ARSC's T3D, are provided below. We expect that, since the MPI2 routines are built on top of shmem, they will add some overhead. They seem to increase latency, per "Put" call, by 20 or so microseconds for small buffers, and decrease bandwidth by 1-10 mbytes/sec for large buffers.


  Here is my timing algorithm:
  ----------------------------   
      switch on PE type:
        case SENDER:
            Synchronize at barrier;
    
            Start timer;
              Put data to the RECEIVER PE;
            Stop timer;
    
        case RECEIVER:
            Synchronize at barrier;
    
            Start timer;
              Wait for acknowledgment that Put has completed;
            Stop timer;

      end switch;

The primary usefulness of the MPI2 prototype is probably stated best in EPCC's documentation. This is available, in postscript, at:

www.epcc.ed.ac.uk/t3dmpi/Product/


Title:  "Using MPI 2 One Sided Communications on Cray T3D."
Author: A. Gordon Smith
Date:   12 Dec 1995.

Here is an excerpt:
    
      "The MPI 2 Forum is actively - at the time of writing - devising
  extensions to the standard message-passing interface, MPI.  One area
  of this effort is the One-sided Communications interface.  This
  attempts to standardise remote memory read and write operations,
  within the existing MPI framework.  The communications model provided
  is similar to that of the Cray T3D Shared Memory Access (SMA)
  library, via shmem_put and shmem_get, familiar to many Cray T3D
  users.  Edinburgh Parallel Computing Centre (EPCC) has provided an
  efficient and robust MPI for the Cray T3D and has extended this with
  a prototype implementation of a subset of the currently proposed
  One-sided Communications interface. It is hoped that this will
  encourage a useful exchange between Cray T3D users and the MPI2
  Forum:  T3D users benefit from the portability required of MPI
  One-sided; and the MPI 2 Forum benefits from feedback on the proposal
  from Cray T3D users that have developed skills and experience using
  the remote read/write model."

Incidentally, Gordon Smith has been extremely helpful whenever I have had questions concerning either EPCC/MPI or EPCC/MPI2.

Another source of information is the MPI2 page at Argonne National Laboratory:

www.mcs.anl.gov/Projects/mpi/mpi2/mpi2.html


======================================================================
Here are my codes, makefiles, and results:


 [ shmem version ]
####################
# makefile 

all:        shmem_wait

shmem_wait:        shmem_wait.c
        /mpp/bin/cc -Tcray-t3d -X2 -O0 -o shmem_wait \
        shmem_wait.c
####################

/* shmem_wait.c */
/* Measure communication rates between two PEs using shmem_put 
*  and shmem_wait.  
*
*  Tom Baring, ARSC, August, 1996 
*/

#include <stdio.h>
#include <mpp/shmem.h>
#include <mpp/limits.h>
#include <mpp/time.h>

#define BUFSZ 1000000

/* d=delta clock tics; n=number words*/
#define USECS(d)  ((float)(d)*1000000.0/(float)CLK_TCK)
#define MBR(d,n) ((float)(n)*8.0/1000000.0)/(USECS(d)/1000000.0))

/*  long        buf[BUFSZ];    */

main () {
  int         n;
  long        nwords;
  int         mype, otherpe, npes; 
  long        t1,t2;
  long        buf[BUFSZ]; 
  fortran     irtc();
  long        pSync[_SHMEM_BARRIER_SYNC_SIZE];

  npes = shmem_n_pes();
  if (npes < 2) {
    printf ("ERROR: Minimum of 2 PEs required.\n");
    exit (1);
  }  

  /* Init work array for shmem_barrer() */
  for (n = 0; n < _SHMEM_BARRIER_SYNC_SIZE; n++)
    pSync[n] = _SHMEM_SYNC_VALUE;

  /* Can't use shmem_barrier() until all PEs have initialized pSync */
  barrier(); 

  for (n = 0; n < BUFSZ; n++) 
    buf[n] = 0; 
        
  mype = shmem_my_pe();
  shmem_set_cache_inv();             /* Reload cache when put received */

  switch (mype) {
    case 0:
      otherpe = 1;

      for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
         buf[nwords-1] = 1;    /* Use as flag to shmem_wait() on receiver */

        shmem_barrier( 0, 1, 2, pSync );                 /* sync the PEs */

        t1 = irtc ();
        shmem_put (buf, buf, nwords, otherpe);
        t2 = irtc ();

        printf ("SENDER: nwords=%ld ticks=%ld usecs=%f\n", 
          nwords, (t2-t1), USECS(t2-t1));
      }
      
      break;


    case 1:
      for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
        shmem_barrier( 0, 1, 2, pSync );                 /* sync the PEs */

        t1 = irtc ();
        shmem_wait( &buf[nwords-1], 0 );    /* wait for last element */
        t2 = irtc ();

        printf ("RECEIVER: nwords=%ld ticks=%ld usecs=%f; mbr=%f\n", 
          nwords, t2-t1,  USECS(t2-t1), MBR(t2-t1,nwords);
      }
      
      break;


    default:
      break;
  }
}

======================================================================
 
 [ MPI2 version ]
####################
# makefile

MPI2_15_INC_PATH=/u1/uaf/baring/MPI2
MPI2_15_LIB_PATH=/u1/uaf/baring/MPI2

all:        mpi2_wait

mpi2_wait:        mpi2_wait.c
        /mpp/bin/cc -V -Tcray-t3d -X2 -o mpi2_wait mpi2_wait.c \
        -I$(MPI2_15_INC_PATH) \
        -L$(MPI2_15_LIB_PATH) \
        -lmpi2
####################

/* mpi2_wait.c */
/* Measure communication rates between two PEs using MPI2_Put 
*  and MPI_Wait.  
*
*  Tom Baring, ARSC, August, 1996 
*/

/* Note on linking:  do not link libmpi.a -- libmpi2.a has MPI2 
*   specific version of MPI_Wait().
*/
#include <stdio.h>
#include <mpp/limits.h>
#include <mpp/time.h>
#include "mpi.h"      /* Must include mpi.h and mpi2.h: mpi.h first. */
#include "mpi2.h"

#define BUFSZ 1000000

/* d=delta clock tics; n=number words*/
#define USECS(d)  ((float)(d)*1000000.0/(float)CLK_TCK)
#define MBR(d,n) ((float)(n)*8.0/1000000.0)/(USECS(d)/1000000.0))

#define NCOUNTERS 1
#define COUNTER_0 0

void dump_status (MPI_Status *s);

main (int argc, char *argv[]) {
  int         n;
  long        nwords;
  int         mype, otherpe, npes; 
  int         err, cnt;
  long        buf[BUFSZ]; 
  long        t1,t2;
  fortran     irtc();
  MPI_Comm    comm;
  MPI_Request req;
  MPI_Status  status;

  MPI_Init (&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &mype);
  MPI_Comm_size(MPI_COMM_WORLD, &npes);

  if (npes < 2) {
    printf ("ERROR: minimum of 2 pes required.\n");
    goto FINALIZE;
  }


  MPI2_RMC_init (buf, BUFSZ, MPI_LONG, NCOUNTERS, &req, 
                 MPI_COMM_WORLD, &comm);

            
  /* Initialize buf */
  for (n = 0; n < BUFSZ; n++) 
    buf[n] = 0; 

        
  /* Divide "work" among pes */
  switch (mype) {
    case 0:
      otherpe = 1;
        
      for (nwords = 1; nwords <= BUFSZ; nwords*=10) {

        MPI_Barrier(MPI_COMM_WORLD);

        t1 = irtc ();
        err = MPI2_Put (buf, nwords, MPI_LONG, otherpe, 0, nwords,  
                        MPI_LONG, comm, COUNTER_0, 1);
        t2 = irtc ();

        printf ("SENDER: nwords=%ld ticks=%ld usecs=%f\n", 
          nwords, (t2-t1), USECS(t2-t1));
      }
      
      break;


    case 1:
 

      for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
        err = MPI2_Set_counter_threshold (req, 1);

        MPI_Barrier(MPI_COMM_WORLD);

        t1 = irtc ();
        err = MPI_Wait (&req, &status);
        t2 = irtc ();


        printf ("RECEIVER: nwords=%ld ticks=%ld usecs=%f; mbr=%f\n", 
          nwords, t2-t1,  USECS(t2-t1), MBR(t2-t1,nwords);
      }
      
      break;


    default:
      break;
  }

  
FINALIZE:
  MPI_Finalize ();

}

======================================================================
Here is the output from these programs, obtained on ARSC's T3D:

 [ shmem version ]
RECEIVER: nwords=1 ticks=1317 usecs=8.779122; mbr=0.911253
RECEIVER: nwords=10 ticks=957 usecs=6.379362; mbr=12.540440
RECEIVER: nwords=100 ticks=1476 usecs=9.839016; mbr=81.308947
RECEIVER: nwords=1000 ticks=10136 usecs=67.566573; mbr=118.401743
RECEIVER: nwords=10000 ticks=95936 usecs=639.509348; mbr=125.095904
RECEIVER: nwords=100000 ticks=956004 usecs=6372.722388; mbr=125.535046
RECEIVER: nwords=1000000 ticks=9478044 usecs=63180.638567; mbr=126.621069
SENDER: nwords=1 ticks=1094 usecs=7.292604
SENDER: nwords=10 ticks=758 usecs=5.052828
SENDER: nwords=100 ticks=1577 usecs=10.512282
SENDER: nwords=1000 ticks=10163 usecs=67.746555
SENDER: nwords=10000 ticks=95966 usecs=639.709328
SENDER: nwords=100000 ticks=956124 usecs=6373.522308
SENDER: nwords=1000000 ticks=9478176 usecs=63181.518478

 [ MPI2 version ]
RECEIVER: nwords=1 ticks=4549 usecs=30.323633; mbr=0.263821
RECEIVER: nwords=10 ticks=3419 usecs=22.791053; mbr=3.510149
RECEIVER: nwords=100 ticks=4508 usecs=30.050327; mbr=26.622007
RECEIVER: nwords=1000 ticks=13048 usecs=86.977964; mbr=91.977319
RECEIVER: nwords=10000 ticks=101002 usecs=673.279303; mbr=118.821416
RECEIVER: nwords=100000 ticks=977533 usecs=6516.234696; mbr=122.770286
RECEIVER: nwords=1000000 ticks=9658318 usecs=64382.344998; mbr=124.257667
SENDER: nwords=1 ticks=3987 usecs=26.577341
SENDER: nwords=10 ticks=3456 usecs=23.037695
SENDER: nwords=100 ticks=4104 usecs=27.357263
SENDER: nwords=1000 ticks=12858 usecs=85.711424
SENDER: nwords=10000 ticks=100647 usecs=670.912873
SENDER: nwords=100000 ticks=977088 usecs=6513.268326
SENDER: nwords=1000000 ticks=9657798 usecs=64378.878679

PSC Accepts Production Model of T3E

[ This is from a press release we received yesterday. ]

>  
> PITTSBURGH SUPERCOMPUTING CENTER ACCEPTS FIRST 
> PRODUCTION MODEL OF POWERFUL NEW SYSTEM FROM 
> CRAY RESEARCH 
>  
> Highly Parallel CRAY T3E Advances U.S. Science and 
> Engineering 
>  
> CHIPPEWA FALLS, Wisc., Aug. 15, 1996 -- The Pittsburgh 
> Supercomputing Center (PSC) today formally accepted the first 
> production model of the CRAY T3E. Produced by Silicon Graphics 
> subsidiary Cray Research, the CRAY T3E is one of the world's 
> most powerful supercomputers, a "highly parallel" system that 
> teams tens, hundreds or thousands of processors to work 
> simultaneously on the same computing task. 
>  
> PSC's CRAY T3E system will have 512 processors and be capable 
> of more than 300 billion calculations per second. This 
> represents a four-fold increase in scientific productivity over 
> PSC's current highly parallel system, the CRAY T3D. 
>  
> This new system substantially boosts the computational 
> capability of U.S. scientists and engineers, said PSC scientific 
> directors Michael Levine and Ralph Roskies: "This is a big step 
> forward in U.S. high-performance computing." 
> 


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top