ARSC HPC Users' Newsletter 242, April 01, 2002

T3E MPI Performance Followup

Thanks to Ed Anderson of Lockheed Martin Technology Services for this email:

You reported MPI performance results in newsletter 240 (March 5) that showed significant improvement for mpt 1.4 over mpt 1.3 on the T3E. What was your setting of the environment variable MPI_BUFFER_MAX for these tests? If you do a lot of large transfers without other work in between, the buffering will slow you down, so you're better off going memory to memory. I like a setting of MPI_BUFFER_MAX=2048, which limits buffering to messages less than 2048 bytes.

This inspired a little more investigation, and this discovery from man mpi : From man mpi under mpt.


Specifies maximum buffer size. Specifies a maximum message size, in bytes, that will be buffered for MPI standard, buffered, or ready send communication modes.

Default: No limit

Operating system: UNICOS/mk systems

From man mpi under mpt.


Specifies a maximum message size, in bytes, that will be buffered internally within MPI for standard, buffered, or ready-send communication patterns. Setting MPI_BUFFER_MAX to 0 disables buffering, except for noncontiguous or misaligned data, which must always be buffered.

Using the default buffering setting provides the best bandwidth performance for synchronized MPI programs (where the sender and receiver are both available at the same time to transfer data). If senders and receivers are not synchronized, it may be better for overall performance to buffer data internally. To do this, set the MPI_BUFFER_MAX environment variable to the number of bytes that you want buffered.

On Cray T3E-600 systems, MPI_BUFFER_MAX is ignored unless hardware data streams are disabled.

Default: 0 bytes

Operating system: UNICOS/mk

As a test, I reran the "ring" program again under both MPT's, with a MPI_BUFFER_MAX==10000000 to mimic the setting of "unlimited" and MPI_BUFFER_MAX==0, the setting.

Here are the last two timings results from each of these four runs:

  MPI_BUFFER_MAX =  10000000
   BUFFER =  32767 TIME =    3862.1 Microsec   BW =  67.9 MB/sec
   BUFFER =  32768 TIME =    1853.8 Microsec   BW = 141.4 MB/sec
   BUFFER =  32767 TIME =     811.3 Microsec   BW = 323.1 MB/sec
   BUFFER =  32768 TIME =     814.6 Microsec   BW = 321.8 MB/sec
  MPI_BUFFER_MAX =  10000000
   BUFFER =  32767 TIME =    3484.0 Microsec   BW =  75.2 MB/sec
   BUFFER =  32768 TIME =    1583.0 Microsec   BW = 165.6 MB/sec
   BUFFER =  32767 TIME =     901.8 Microsec   BW = 290.7 MB/sec
   BUFFER =  32768 TIME =     901.4 Microsec   BW = 290.8 MB/sec

This suggests that MPI_BUFFER_MAX is indeed accountable for the apparent improvement. It's also likely responsible for the observation made in issue #239:


that some codes worked that fine on the T3E under MPT (but on no other platform) stopped working under MPT

Used Cyclotron Achieves 10 Peta-FLOPS, Sustained Performance

For Immediate Release April 1, 2002 Fairbanks, AK

Self-taught computer engineer Svensen Smits, of Circle City, Alaska, has escaped national attention. Until now. Three years ago, Smits purchased a cyclotron on e-bay and has brought it to life using a clever dialect of the FORTRAN IV.1 programming language.

Smits' results have astounded physicists and computer scientists alike. According to Guy Robertson of the Arctic Region Supercomputing Centre, "No one ever dreamt that a particle accelerator could be used to solve partial differential equations."

"I've spent my whole life outside the box," says Smits. "Once in a while, it pays off."

The principle is really very simple. The accelerator's beam and detectors are adjusted according to his FORTRAN program, and when the beam strikes the target -- a fleck of gold which Smits panned himself on the Chatinika River, "one day last summer, when the fishing got bad" -- over 10^23 interactions take place instantaneously. This massive, massive, massive parallelism is captured and results plotted on the iBook front-end.

Once he completes the C++ compiler for his invention and installs a robotic tape silo to handle all the data it can produce, Smits plans to lease time to HPC centers world wide. "Users won't even have to learn MPI."

One drawback, the "Superconducting Cyclo-Computron" only runs in winter on those days when the temperature in Smits' garage drops below -30 degrees Fahrenheit. "I'll be jiggered if I'm gonna buy an AC unit when I live in Alaska!" mutters Smits.

The inventor transported his $1,300 e-bay purchase from a storage unit in Cambridge, MA to his home in Circle City using a 1979 Ford F-150 pickup truck. It took 9 trips on the Alcan highway, 17 spare tires, and 6 months. It was worth it, says Smits, though he claims he put on 30 pounds from way too much fast-food and sitting.

On-Line Resources: Citation Index and Comp Bio.

[ A reader sent this suggestion and comment: ]

> You might visit the ResearchIndex page: > > > > This has a link to a very nice search engine for CS publications: > > > > which has the nice feature that it has links to online copies of the > papers, because it finds them on the Web in the first place. It's > always fun to type your own name in and see what it finds :-) And > there's also a link to a list of the most cited authors in computer > science: > > > > which I found fascinating. I know three of the top five. The top 30 > names or so read like a who's who of computer science, though after that > the names I recognise get thin on the ground. I had a look at Donald > Knuth's Web page and found it very interesting and entertaining - check > it out: > > > > He's apparently working on the volume 4 of The Art of Computer > Programming! It's not available yet, but are taking > advance orders, so it's publication must be imminent. [ Jim Long of ARSC sent this 'round: ] > Archive of classic computational biology papers: > >



International Workshop on


Moscow State University, Moscow, Russia October 21-23, 2002

Supported by IEEE Computer Society Technical Committee on Parallel Processing and IEEE Task Force on Cluster Computing

(Submission deadline: May 1, 2002)'02.htm

Local networks of computers are the most common and available parallel architecture now.

Unlike dedicated parallel computer systems, local networks are inherently heterogeneous. They consist of diverse computers of different performances interconnected via mixed network equipment providing communication links of different speeds and bandwidths.

Traditional parallel algorithms and tools are aimed at homogeneous multiprocessors and cannot be efficiently used for parallel computing on heterogeneous clusters. New ideas, dedicated algorithms, and tools are needed to efficiently use this new type of parallel architecture. The workshop is intended to be a forum for people working on algorithms, programming languages, tools, and theoretical models aimed at efficient solution problems on heterogeneous clusters.

Access Grid Class: "Porting Cray Code to the NERSC SP"

ARSC will be "sitting in" on this class from NERSC, being broadcast via the Access Grid:


"Porting Cray Code to the NERSC SP"

Where and When:

109 Butrovich, University of Alaska Fairbanks Thursday 4 April, 8:00-9:30am ALASKA TIME

More Information:

Feel free to drop in and take a look if you are interested.

Quick-Tip Q & A

A:[[ Recently I was debugging a code on the SV1 using totalview.  In one
  [[  of the subroutines there were several matrices declared as:
  [[      COMPLEX         A( LDA, * ), B( LDB, * )
  [[      COMPLEX         Z( LDZ, * )
  [[ Where LDA = LDB = LDZ = 147.  When viewing the values, totalview
  [[ displayed all the data arrays as all being dimensioned (147, 1) when
  [[ in reality the sizes were A(147,147), B(147,147) and Z(147,15).
  [[ I tried changing the sizes in the variable display window, but
  [[ couldn't get it to work.  I also found that I could type in single
  [[ values to look at, say Z(15,15) and that would work.  However, I'd
  [[ like to be able to see the whole matrix at once.  Does anyone know if
  [[ this can be done?

  # ARGH! No responses, and the editors don't have an answer ready to go.

Q: I've been connecting remotely to an SGI Octane2. I use the DISPLAY
   environment variable to export the X Windows display back to my
   personal workstation.

   For some reason, when I sit down at this SGI and log onto the
   console, the screen flashes, and it immediately logs me off. I'm
   definitely not over-quota, my account is active, and everything works
   perfectly when I connect remotely again.

   Any ideas what's up?

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top