ARSC T3D Users' Newsletter 72, February 2, 1996

Meeting on the Optimization of Codes for the CRAY MPP Systems

The meeting on optimizing T3D codes at the Pittsburgh Supercomputing Center (PSC), January 24th to 26th went very well. There were about 56 attendees and all of them were united in the common goal of getting high performance out of the T3D. When subjects like padding, mppexec and craft came up, everyone in the audience knew what was what and had their own stories to tell.

Cathy Mulligan, who ran the conference for PSC, has collected the slides of each presenter and will have them duplicated for each attendee. When I get my copy of these slides, I will again list the talks and readers of this newsletter can request copies from me. The PSC staff were very helpful and provided attendees a lot of information on many topics.

Below are some of my impressions of the talks. It was interesting how often cache utilization came up during each of the talks. Cache utilization was as important at this meeting as vectorization would be at a conference on Y-MP optimization. On the down side, there were many disturbing results where parallelization efforts conflicted directly with cache utilization.

Short Summaries

Performance of a Major Oil Reservoir Simulation, Olaf Lubeck (LASL)

The speaker described his group's results on a CRADA project. This project is to develop a portable MPP oil reservoir model. His current results show a strong conflict between portability/maintainability and performance. He gave a very telling "back of the envelope" explanation of why most users were getting only about 10 mflops/PE on the T3D:

Consider a DO loop that loads two operands and stores one result per loop trip. And assume:

  1. No caching
  2. The floating point operations (4 per trip) are free
Maybe something like:

  do i = 1, 100, 4                  ! the skip defeats caching
     c(i) = ( a(i) * a(i) + b(i) * b(i) ) / 2.0
Then for each trip thru the loop we have:

  load a(i)                                   24 clock periods
  load b(i)                                   24 clock periods
  floating point operations(and indexing)     free
  store c(i)                                  12 clock periods
Then for the speed of this loop we have

     ( number of flops / time per trip ) * processor clock rate 

  =  ( 4 flops / 60 clock periods ) * 150 MHz

  =  10 mflop/s
Sure, there are a lot of special assumptions above, but the cache utilization comes out as critical. If the loop is like:

  do i = 1, 100                   ! cache provides 3 easy hits
     c(i) = ( a(i) * a(i) + b(i) * b(i) ) / 2.0
Then the average load time per trip per operand is something like:

  ( 24 + 3 + 3 + 3 ) / 4 = 8.125 clock periods 
and the speed goes way up. (The access time for an operand already in cache is only 3 clock periods.)

Parallel Sequence Analysis, Alexander J. Ropelewski (PSC)

PSC has been a champion of parallel sequence analysis and the speaker showed us what can be done on various platforms. The problem seems to be constrained by large datasets and their attendant I/O problems. With the T3E, PSC sees a cleaner interface to I/O and so better performance. I was surprised by the number of presenters who have reached a limit on optimizing their code and were waiting for the T3E.

Software Package for Simulation of Electro-magnetic Fields, Daniel Katz (CRI)

The speaker spoke about the problems of porting an entire package to MPP platforms. (Too often we hear of good MPP results on a portion of a code but that effort is useless until the entire code runs on the MPP.) In most useful applications, each component must have a solution on the MPP. But the run time of each component determines the amount of optimization effort on each component. As a surprise to most attendees, the speaker showed that Craft was an efficient, quick solution for those components that had a minimal effect on runtime. This allowed the programming team to concentrate their parallelization and optimization efforts on those components that determine run time.

The Parallel Finite Element Method, Dave O'Neal (PSC)

The speaker's talked described the implementation of a new algorithm for parallel implementation of the finite element method as a replacement to the more traditional preconditioned conjugate gradient method (PCG). This algorithm has a greater potential for parallelism than PCG and is described in the 1995 Spring CUG proceedings. The speaker can be reached at for further details.

Mathematical Offsetting Scheme to Improve Alignment and Enhance Performance on MPP Systems, David Wong (NCSU)

The audience was very interested in these results. David Wong showed how the improvement of padding arrays to reduce cache conflicts was also a function of the number of processors. What might be a successful padding strategy for N processors might fail for 2*N processors. He then went on to devise criteria for choosing the padding that would reduce cache conflicts independent of the number of processors. Using these criteria he could calculate optimal padding in specific cases and showed speedups that were good for a range of processors. The results were complicated and all of us who have wrestled with this problem are interested in seeing the details of his unifying results in copies of his slides.

Parallel Simulated Annealing on the T3D, Carlos Gonzalez (PSC)

Optimization by genetic search or simulated annealing requires a large number of expensive function evaluations and MPP seems to be the right tool. But the search strategy has to coordinate the efforts of the processors doing the evaluations. The speaker described his efforts to combine the results on the individual processors into a coherent successful search. Interestingly, most of the development work was done on the C90 with a simulator of his parallel code and then when results were reasonable on that friendly platform, the code was moved to the T3D.

AMBER 4.1 for the T3D, James Vincent (Penn State University); Parallel AMBER Enhancements and Particle-Mesh Ewald Electrostatics, Tom Darden (National Institute of Environmental Health & Safety); CHARMM, Bill Young (PSC)

PSC is famous for parallel chemistry/biochemistry applications like GAMESS, AMBER and CHARMM1. These packages are available in source form from their original authors and researchers all over the world have taken the source and modified it for their own molecules and machines. In these areas, the optimizations are not all on the machine side, sometimes the optimizations seem like cheating the laws of physics with their limits on force fields and disappearing energies.

Lattice QCD Simulation Programs on the Cray T3D, S.P. Booth (Univ. of Edinburgh); Implementing the Monte Carlo and Sparse Matrix Algorithms Needed for Lattice QCD on the T3D, Greg Kilcup (OSU); High Performance MPP Codes, Nicholas Mark Hazel (Univ. of Edinburgh)

These QCD people need cycles and they're willing to go to assembly language to get them. I would like to believe that assembly language on the T3D is as easy as they describe but I vote for better compiler efforts.

Experience with Early Versions of the HPF Compilers for the T3D, Mike Ess (ARSC)

This was an updated version of my presentation at the Alaska CUG. I went into more detail than before and presented some results on the 2.0 version of PGHPF.

F-- : A Minimalist's View of Parallel Fortran for Shared and Distributed Memory Computers, Robert Numrich (CRI)

The speaker reported on CRI's efforts to prototype this extension to Fortran 90. The extension cleverly makes explicit the distinction between what is local and remote on a distributed memory machine. The author ( ) is looking for user input for this extension.

Introduction to the CRAY T3E, Peter Rigsbee (CRI)

This talk provided more details to what was said at the Alaska CUG (newsletter #54 9/30/95). The main differences from that newsletter, that I noted, were:
  1. The 300MHz version of the 21164 chip will be available in quantity for T3E shipment. (Earlier reports had said the 275MHz version would be used). Future models of the T3E will be able to use versions of the 21164 that are running more than 300MHz.
  2. All communication functions will be done thru the 512 off chip E registers and users will be able to access these registers at the assembly language level. A message queue is implemented with these registers and will be accessible to the users.
  3. Hardware (peak) transfer rates are 600MB/second thru the E registers.
  4. Initial memory configurations will be 64MB, 128MB, 256MB or 512MB per PE. Later models could support 1 and 2 GB versions.
  5. Configurations will consist of user PEs and service PEs in an approximate ratio of 16 to 1. The service PEs will initially be dedicated to service tasks such as being the redundant processor and providing OS functions. The user PEs will run three servers: a process server, a file system server and an error manager.
  6. All of UNICOS functionality should be available to users on each PE by the end of 1996.
  7. Initial results seem to show a 3X performance improvement on optimized T3D codes and a 5X improvement on unoptimized T3D codes.
  8. There is a backlog of orders thru 1996.
  9. All T3D problems will be fixed on the T3E. (JOKE)

Design and Implementation of Efficient Bitonic Sorting Algorithms on the Cray T3D, Chua-Huang Huang; Optimizing the ARPS Model for Execution on MPPs, Adwait Sathye (Univ. of OK)

This effort again brought out the conflict between portability and maintainability versus performance.

I/O Optimization on the T3D, John Urbanic (PSC)

Many of the attendees who left early missed these last but illuminating results. The speaker approached the I/O as an information path between T3D processors and archived storage. PSC has compared hardware peak performance at each segment of the path to measured performance for each segment of the path. So optimizing I/O on the T3D becomes a case history in "bottle-neckology" with some bottlenecks very restrictive on hardware peaks. Two interesting results that were confirmed by audience participation were:
  1. Optimal I/O involves at most 4 to 8 PEs doing all I/O, otherwise additional contention reduces total bandwidth.
  2. After all the tricks are played, 80 Mbytes is all that can be expected on realized I/O bandwidth. And in many cases not even this can be expected.
It is lamentable that only after two years in the field have these results become generally known. The amount of individual effort to reach similar results at several sites has been wasted effort.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top