ARSC T3D Users' Newsletter 82, April 12, 1996

MPI and PVM speeds (and SHMEMS too)

I guess I was too flippant in dismissing SHMEMs in last week's newsletter, as a way to get good performance on the T3D for short messages. Frank Chism of CRI takes me to task about not using SHMEMs:


> Mike,
> 
> I think you are really missing the point of the shmem library.  If you
> want to show the T3D is useful in the case where there are many short
> messages,  you will get the best results using shmem.
> 
> Here are the results of a ping-pong test where the latency is defined
> as one half the round trip time for a one word message exchanged between
> a pair of PEs.  This test times every possible pair of PEs and reports
> the global minimum, maximum and average latencies:
> 
> rain% ping_pong -npes 16
>  
> BEGIN SINGLE PAIRS PING-PONG PUT LATENCY TEST WITH NPES=16
>  
>  
> Begin single pairs of each PE to all other PEs ping-pong latency test
> ****************************************************************
> * Max latency for paired PE one word ping-pong is 3.433507 microseconds
> * Min latency for paired PE one word ping-pong is 1.629012 microseconds
> * Ave latency for paired PE one word ping-pong is 2.076176 microseconds
> * Average MB/s for paired PE one word ping-pong is 3.853237MB/s
> ****************************************************************
> 
> Somewhat better than 77 microseconds, eh?  Perhaps even worth using
> something that is not standard.
> 
> With regard to bandwidth,  here is a simple test that exchanges messages
> between pairs of PEs with all PEs sending at once.  Again all possible
> pairs are tried and the global statistics reported:
> 
> rain% multiple_pairs -npes 16
>  
> BEGIN MULTIPLE PAIRS PUT TEST WITH NPES=16
>  
>  
> Begin all pairs of each PE to all other PEs send latency test
> ======================================================================
> ======================================================================
> =For all distances to neighbors with all PEs sending and receiving:
> =
> = Max latency for paired PE one word messages is 3.065986 microseconds
> = Min latency for paired PE one word messages is 0.874104 microseconds
> = Ave latency for paired PE one word messages is 1.008117 microseconds
> = Average MB/s for paired PE one word messages is 7.935589MB/s
> ======================================================================
> ======================================================================
>  
> Begin all pairs of each PE to all other PEs send bandwidth test
> ****************************************************************
> * Rmax (Rinf) = 61.431111 MB/s
> ****************************************************************
>  
>   Search for N one-half
> ****************************************************************
> * N 1/2 = 30.715556 MB/s at 10 words (80 bytes)
> ****************************************************************
> 
> Remember this is with ALL PEs sending data to some other PE at any given
> time.  For single PE sending to one other PE with the other PEs idle
> the results are:
> 
> rain% single_pairs -npes 16
>  
> BEGIN SINGLE PAIRS PUT TEST WITH NPES=16
>  
>  
> Begin single pairs of each PE to all other PEs send latency test
> ****************************************************************
> * Max latency for paired PE one word messages is  2.635556 microseconds
> * Min latency for paired PE one word messages is  1.066142 microseconds
> * Ave latency for paired PE one word messages is 1.096107 microseconds
> * Average MB/s for paired PE one word messages is 7.298560MB/s
> ****************************************************************
>  
> Begin single pairs of each PE to all other PEs send bandwidth test
> ****************************************************************
> * Rmax (Rinf) = 126.016816 MB/s
> ****************************************************************
>  
>   Search for N one-half
> ****************************************************************
> * N 1/2 = 63.008408 MB/s at 22 words (176 bytes)
> ****************************************************************
> 
> So,  when communications is a problem,  use shmem for performance.  The
> effort will be rewarded.  It's kind of like evaluating a vector machine
> and limiting your tests to only scalar code.
> 
> Frank
Frank is right, I should have investigated SHMEMS and provided that alternative to the customer.

Linking Tricks

Alan Watson of New Mexico State Department of Astronomy has sent in this tip for linking benchlib routines:

> This is what I use to transparently link C with the unsupported benchlib
> faster math routines (the ones mentioned in newsletters 29 and 33).
> It's not exactly rocket science, but it took me a little while to
> figure out:

> cc -Tcray-t3d \
> -Dcos=_COS -Dexp=_EXP -Dlog=_ALOG -Dpow=_RTOR -Dsin=_SIN -Dsqrt=_SQRT \
> ... \
> /full/path/to/lib_scalar.a -lm

News from Pharoh

The Arctic Region Supercomputing Center is a member of the NSF project "Pharoh" with the Ohio Supercomputer Center and the Pittsburgh Supercomputing Center. Pharoh is a Regional Alliance to promote awareness of the benefits of HPCC technologies and to develop expertise in industrial and academic users. Pharoh has a nice webpage at http://www.osc.edu/Pharoh/pharoh.html . Here is an excerpt from the latest update on Pharoh activities:

> Upcoming T3D Workshops
> ----------------------
>    OSC - May 7-8, 1996
>    PSC - May 20-23, 1996
> 
> Upcoming workshops on related topics
> ------------------------------------
>    PSC/Digital Workshop on Optimized Medium Scale Parallel Program,
>        April 29 - May 2, 1996
>    PSC/Biomedical Workshop: Supercomputing Techniques for Biomedical
> Researchers
>        May 5-9, 1996
>    OSC/Scientific Visualization (AVS)
>        May 21-22, 1996
>    OSC/Parallel Computing in Education (see below)
>        May 17, 1996
>    NCSU/Regional Training Center for Parallel Processing Workshop
>        May 6-9, 1996 (see http://renoir.csc.ncsu.edu/RTCPP/)
> 
> Available Proceedings
> ---------------------
> Proceedings of the "The Meeting of Optimization of Codes for the Cray MPP
> Systems". A collection of the presentation materials from each of the
> speakers.
> 
> Proceedings of the Parallel CFD '95 Conference Pasadena, CA, 26-29 June, 1995.
> The papers presented at this conference included subject areas such as novel
> parallel algorithms, parallel Euler and Navier-Stokes solvers, parallel
> Direct Simulation Monte Carlo method, parallel multigrid techniques, parallel
> flow visualization and grid generation, and parallel adaptive and irregular
> solvers. (Copyright 1996 Elsevier Science B.V., ISBN 0-444-82322-0).

T3D Class at ARSC

I taught a T3D class this week and a point I made over and over is summarized in the table below from page 17 of the Cray Training manual, "Cray T3D Applications Programming" (TR-T3DAPPL). The table lists the time it takes on the T3D for data to be loaded from one storage area to another. The first column of numbers is the time in clock periods and from that number and the number of words transferred, the bandwidth per operation can be computed.

> T3D Load Times/BWs
>                                                            Bandwidth
>                                                            64bit
> Source             Destination             Latency(cp)    Wds/cp  MB/s
> 
> Cache                                            3         1/1    1200
> DRAM ------------> Cache(in page)               24         4/24    200
> DRAM ------------> Cache(out of page)           39         4/39    123
> DRAM Read Ahead -> Cache                        15         4/15    320
> DRAM ------------> Register(in page)            22         1/22     55
> DRAM ------------> Register(out of page)        37         1/37     32
> Remote DRAM -----> Cache(in page)              107         4/111    43
> Remote DRAM -----> Cache(out of page)          122         4/126    37
> Remote DRAM -----> Register(in page)            86         1/83     15
> Remote DRAM -----> Register(out of page)       101         1/98     12
> Remote DRAM -----> Prefetch(in page)          ~ 86         1/21     67
> Remote DRAM -----> Prefetch(out of page)      ~101         1/23     50
> 
> All remote DRAM values assume no network contention and nearest neighbor(not
> same node) communication. For more distant neighbors, add 1 cp for each 
> network switch passed and 2 cp's for each change of direction(maximum of 2).
The numbers may not be exact but I think the broad picture is correct. A remote fetch at the hardware level in clock periods is something like:

  101 + (number of dimension changes(at most 2)) * 2 + hops * 1
from one PE to another.

So optimizing for physical location in terms of change in dimension and number of hops between remote accesses between PEs is not as important as reducing number of messages and message software overhead. It's not often in the MPP world that there is such a simple situation as on the T3D, but this is one of the best.


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top