ARSC T3D Users' Newsletter 28, March 24, 1995

A Summary of T3D Information from the Denver CUG

The North American Cray User's Group Meeting was held March 13th to 17th in Denver and was sponsored by the National Center for Atmospheric Research (NCAR). In conjunction with this CUG, the CRI PATP (Parallel Applications Technology Program) put on a conference that was run in parallel with the CUG sessions. This "conference in a conference" was called the SPAD (Scalable Parallel Applications Developers) conference and was a forum for CRI and members of the PATP to present their results on the T3D.

Together with the regular sessions of the CUG and SPAD conference there were 27 hours of presentations about the T3D in the four and a half day CUG conference. I counted at least 6 hours in which there were two or more T3D sessions going on at the same time and one 1 hour session was overlapped with 4 half hour sessions. There was a lot of information presented. The T3D related talks that were part of CUG will eventually appear in the CUG proceedings. Unfortunately there are no current plans to publish the SPAD results, you just had to be there. The SPAD results might someday be available on the CRI web server but prior to the conference there weren't any plans to capture this information.

Below is the list of all T3D activities at CUG or SPAD with the notes I have from the ones I attended:


8:00 - 10:00
Single PE Optimization. Jeff Brooks, CRI
Jeff did a real good job illustrating the T3D processor with small snippets of Fortran. He correlated hardware performance with the measured performance of a Fortran code. Jeff also will be making the optimized routines and utilities of benchlib available and I have contacted him about distributing it. Jeff also gave a few details about the T3E processor:
  1. 250 MHz
  2. 500 MFLOPs
  3. 8KB I and D cache, 98 KB secondary cache
  4. I and D cache are 3-way set associative
  5. No Y-MP front end
A lot of what he covered is in the postscript paper that I distributed in Newsletter #18. I have an extended version of that paper that I can supply upon request.
10:30 - 12:00
MPP Performance Tool. Carol Beatty, CRI
This was a talk about Totalview and Apprentice. It was a good overall talk and gave the status of the current versions. From the talk:
  1. Totalview line mode version in 4th quarter 1995
  2. Totalview with your own signals causes problems
  3. Totalview requires entry/exit macro for CAM routines
  4. Apprentice can be controlled by instrumenting your own calls (see man page apprif)
  5. Apprentice doesn't count what goes in a library like libm.a
  6. Both Totalview and Apprentice don't work well with CRI's C++ because that C++ is based on cfront that converts a C++ job into a C program with new variable names.
I have a copy of Carol's slides.
5:00 - 6:30
Cray T3D Sparse & Dense Matrix Solvers. Guangye Li and Majed Sidani, CRI
Guangye described CRI's ambitious plans for callable sparse iterative solvers. Basically a user will be supplied with tools and a library that they use to access optimized solvers. There was no date given on availability. I have a copy of Guangye's slides.
Majed described the distributed dense solvers based on SCALAPACK, parallel blas (PBlas) and the communication blas (Blacs) out of Oak Ridge National Labs. The PBlas and Blacs are in the current 1.2 PE but there is an inconsistency with the newest version of SCALAPACK 2.0 released from ORNL in the past month and the calling sequence of T3D Blacs. I will try to sort this situation out in a future newsletter. There is a man page on Blacs on denali.


8:30 - 9:30
Parallel Applications Status. Sara Graffunder, CRI
The Application Department of Cray has become a Division putting it on a equal footing with Development and Sales & Marketing. This reflects CRI's commitment to getting applications up and running on the T3D. Someone at the conference paraphrased the currently MPP marketing strategy as: "It's the applications, stupid!" (As in "It's the economy stupid!" from the last election.) The vehicle for these applications is the PATP which is a three year program now consisting of 100 to 200 researchers and scientists worldwide. The four early T3D sites that belong to the PATP group are:
  1. Pittsburgh Supercomputing Center
  2. Jet Propulsion Laboratory
  3. Lawrence Livemore National Laboratories
  4. Swiss Federal Institute of Technology
PVM is the language of most of these conversions because although it may be the "assembly language of message passing", it is the only common language between all the targeted platforms. Here is the list of programs/packages/libraries that Sara presented as fast as I could copy:

  Currently available on the T3D:

  Structures:       LS-DYNA3D
  CFD:              FLO67
  Electronics:      FAIM,SEED,FDTD
  Environmental:    IFS,PCCM,POP
  Petroleum:        3DGEO
  Math:             Elegant,NAG (1PE)

  Available in '95 on the T3D:

  Chemistry:        CHARMM (MSI),DGAUSS,XPLOR,Gaussian '94,
  Structures:       MARC,PAMCRASH,RADIOSS
  Environmental:    HIRLAM,SKIHI
  Math:             IMSL,HARWELL,NAG (multiple PEs)

  Misc. notes:

  Gaussian depends on Linda running on the T3D.
  ABAQUS will be ported to the SP2 first, then to the T3D.
  NASTRAN might be ported by 1996.
  32 PEs on LS-DYNA3D and PAMCRASH are about the same as 1
  C90 processor.
8:30 - 9:00
LS DYNA3D in Production on the T3D. John Gregory, CRI
8:30 - 9:00
Operation of a CRAY T3D. Mike Brown, Edinburgh Parallel Computing Centre
9:00 - 9:30
CRAY T3D at ECMWF. Graham Holt, ECMWF
Didn't attend, I have a paper in ECMWF's T3D and Y-MP performance.
9:00 - 9:30
Programming Models for Cray MPPs. Tom MacDonald, CRI
9:30 - 10:00
Elegant Math. Alex Yeremin, Elegant Mathematics, Inc.
This was a Russian software house getting good results on sparse matrix solvers and a Fortran analyzer. They were looking for partners. I have copies of papers on the sparse solvers and the Fortran analyzer.
1:30 - 2:00
CRAY T3D Metal Forming Applications. Rob Neely, Lawrence Livermore Labs
This was a description of the ALE3D package converted to the LLNL Meiko machine. They plan to convert to the T3D but haven't done so yet. This has been an 8 year project and includes about 20 people. The functionality is greater than LS-DYNA3D and will scale to about 32 PEs. On this application 16 PEs are about equal to one Y-MP Processor.
2:00 - 2:30
CRAY T3D CFD Applications. Roland Richter, CRI
This application uses the master/slave paradigm to implement an adaptable mesh to 2 and 3 dimensional flows. There was a lot of details on cache management and word alignment that I'll illustrate in future newsletters. On this application he got about 15 to 20 Mflops per PE.
2:30 - 3:00
CRAY T3D Biomedical Applications. Nick Nystrom, PSC
This was description of the Chemistry programs ported to the T3D at Pittsburgh Supercomputing Center and most are on Sara Graffunder's list. Nick also gave some typical speed for their applications:

    PVM to host:  3 to 13 Mb/s
    File I/O:     35-45 Mb/s
Being one of the oldest T3D sites they have basically exhausted what they can do on the T3D as far as speed goes and they are waiting for the T3E. I have a copy of Nick's slides.
3:30 - 5:00
Heterogeneous Computing Workshop
3:30 - 4:00
Binary-swap Volume Rendering on the CRAY T3D. Hansen, et al, Los Alamos
This was a port of CM5 code to the T3D, the port hasn't gone too well and most of the results were from the CM5 effort.
4:00 - 4:30
Medical Diagnostics using the CRAY T3D. Jon Genetti & Greg Johnson, ARSC
This talk described using AVS running on a workstation as a front end to to volume rendering modules running on the T3D. The images formed on the T3D are sent to the /proc file on the Y-MP and then onto the workstation. There was a video demonstrating almost real time interaction. I can supply copies of the paper.


8:30 - 9:00
TOP2 Tool Suite. Ulrich Detert and Michael Gerndt, KFA Juelich
This was a very interesting tool that takes a program running on the Y-MP and moves parts of the calling tree onto the T3D. A user annotates his Fortran to specify what should run on the T3D and the tools produce a Y-MP master program and a T3D slave program. The program starts running on the Y-MP and then transfers all arrays and variables to the T3D for further processing. When the T3D is done the entire state is sent back to the Y-MP. Currently the transfers are done with either shared files or UNIX sockets.
9:00 - 9:30
AC and the CRAY T3D. Jesse Draper and Bill Carlson, SRC and IDA
This was a description of the port of the GNU C compiler, gcc, to the T3D. The performance was in some cases 3 times better than the CRI C compiler and the compiler has a single extension 'dist' (for distributive) which allows shared arrays in C. In one application the AC compiler produced an executable 3 times faster than the CRI C compiler. I am trying to get a copy of this compiler for use at ARSC. I have a copy of the report and the slides.
9:30 - 10:00
Remote Memory Access. Libsma, Karl Feind, CRI
This was a description of the SHMEM routines from CRI. The speaker went over the internals of how SHMEMs are implemented and some of the new routines that are available in the 1.2 PE. In future newsletters I'll describe some of the effects of SHMEMs and cache coherency. I have a copy of the slides.
1:30 - 2:00
Sorting on the CRAY T3D. Brandon Dixon, University of Alabama
How fast can you sort on the T3D? There are several tricks here that were fun to see (the radix sort is the winner). In these applications the AC compiler was about 30% faster than the CRI C compiler.
2:00 - 2:30
Evolution on MD Codes from PVP to the CRAY T3D/T3E. Barry Bolding, CRI
Although CHARMM has been converted to the T3D it doesn't always scale up well in the number of PEs. The speaker reports on the results of a 20 person team funded by a CRADA that wrote from scratch a MD code with the functionality of CHARMM but written to scale with the number of processors. The project is not yet done but it will be an interesting data point in the argument of "porting" vs. "rewriting" as the way to get good performance on MPPs. I have a copy of the slides.
2:30 - 3:00
Spectral FE Model for Shallow Water Equations. Giovanni Erbacci, CINECA
This finite element model implementation was done on both a C90 and a T3D. The performance was roughly 1 C90 processor = 4 T3D PEs, but it pointed out some of the tradeoffs that help the T3D are not advantageous to the C90.
3:30 - 4:00
Timesharing on the CRAY T3D. Brent Gorda, LLNL
On busy T3D sites there are situations where a mismatch between requests of PE configurations can lead to a underutilized machine. With a limited amount of roll out and roll in capability, the T3D can be "repacked" to use more of the PEs for a given sequence of requests. This work in progress manually does this rollin/rollout. The problems with swap space and time for swapping for T3D applications will be big issues for MPPs in the future.
4:00 - 4:30
Modeling Serverized UNICOS on the T3D. Bruce Schneider, CRI
CRI is studying the tradeoffs for a distributive O.S. on the T3E. This talk supplied no performance numbers but showed that with simulation decisions could be made about how the O.S. functions would be split between PEs. CRI will use an internally developed SCX channel to implement connections between T3E PEs and attached peripherals. The T3E will be "self-hosted", i.e. no Y-MP front end.
4:30 - 5:00
A Parallel Algorithm for Image Sequence Coding on the CRAY T3D. Henri Nicolas & Martin Schutz, CRI and Swiss Inst. of Tech.
This result from a PATP member was about image processing.


8:30 - 9:00
PATP Status. John Champine, CRI
8:30 - 9:00
Parallel I/O. Kent Koeninger, CRI
Kent distributed some good ideas about I/O on the T3D and I will experiment with these in the next few months. As more and more applications find I/O is a bottleneck, more effort is spend finding out why. Kent also reviewed the status and performance of Phase I, II and III I/O, his advice for a site like ARSC was to stay with Phase I I/O and try experimenting with:
  • well-formed I/O
  • disk striping, user-implemented and system-implemented
  • balancing IOG capacity and disk contention
  • tuning the mpp agent
The results of these experiments were elaborated on in Paul Helvig's talk of Friday morning.
9:00 - 9:30
Performance of Parallel Programs Optimization. W. Nagel, KFA Juelich
This tool from the PARBENCH effort, instruments the message-passing routines to provide a trace history of communication during a program run. It is a typical tool from the Intel Paragon environment but CRI resists these tools as their emphasis is on large configurations where this tool is impractical. Small runs can produce trace files up to 1 GW. ARSC will try to make this program available on denali.
9:00 - 9:30
UNICOS/mk. Gabriel Broner, CRI
9:00 - 10:00
Centric Engineering. Mike Eldredge, ISV
9:30 - 10:00
High Level Language & Cache Issues on the Cray T3D. Ken Koch, Los Alamos
This was a horror story of a port of a CM5 3D reservoir modeling code to the T3D. The expectations were low at 10% of peak on the T3D but because the CM5 code had no notion of data locality the cache performance was atrocious on the T3D. Also the CM5 code was written in TMC's Fortran 90 extensions and the T3D Fortran 90 had lots of trouble with it. In particular, the Fortran 90 didn't do a good job on loop fusion. The developers went back to Craft F77 but still couldn't get good performance. They did a lot of work on cache and page management. They are hoping that the T3E will be much better. I have a copy of the paper.
1:30 - 2:00
Cray T3D Petroleum Applications. Bob Stephenson, CRI?
1:30 - 2:00
Support Functions for CRAY T3D Users at ARSC. Mike Ess, ARSC
This was a well-received talk on how the T3D support is different from the Y-MP support. I have copies of the slides and the paper.
2:00 - 2:30
CRAY T3D Aerospace Applications. Steve Taylor, Caltech
2:30 - 3:00
CRAY T3D Atmosphere Applications. Jim Rosinski, NCAR
2:30 - 3:00
CRAY T3D Help Desk Panel.
This open discussion was about how each site is developing its own experience in the T3D world. We all agreed there was little coordination of T3D information and we each fight the steep learning curve by ourselves. The older users, like the PATP group, haven't found a way to impart their experience to the newer T3D users, except through something like the SPAD conference.
3:30 - 4:00
MPP Mutual Interest Group.
Not being a SIG (Special Interest Group) the MIG was disbanded for lack of interest. The motion to disband was a surprise to most attendees, I think. Basically no one has the time to organize and run a MPP SIG. I don't think the problem was a lack of interest, just a lack of commitment. Future talks about T3D will be part of the regular CUG proceedings.
4:00 - 4:30
SPAD Business Meeting. John Champine, CRI
Everyone agreed that there was a tremendous amount of collective experience on the T3D and that a lot of it got out at this SPAD conference. That there would be no written record of SPAD would be a real loss for new users and CRI. John Champine was going to look into whether some of the SPAD material could be made available on the CRI web server.
5:00 - 6:00
Parallel Visualization Workshop.
This panel discussion went over what LLNL, Los Alamos, JPL and CRI were doing in visualization. The general feeling was that the Y-MP gets in the way of what the T3D could produce and there was much anticipation of the T3E. There seemed to be different camps as to whether AVS, Wavefront or roll your own was the correct approach to visualization on the T3D. But the group tried to separate visualization into different tasks and then decide what was suitable for the T3D and then what was the proper software tool on the T3D.


8:00 - 9:30
Parallel I/O Tutorial. Paul Helvig, CRI
In a logical format, Paul outlined approaches to I/O on the T3D. He discussed the advantages and disadvantages and provided performance data on which approaches are most promising. Over and over he came back to the tradeoffs between:
  • IOG capacity
  • Disk contention
  • System activity
I have a lot of ideas to try out myself now.
8:30 - 9:00
Simulating Quarks and Gluons. Tom DeGrand & Bob Sugar, Univ. of Colorado
9:30 - 10:00
Clusters vs. MPP. Kent Koeninger, CRI
9:30 - 10:00
Terrain Correction of SAR Imagery using the CRAY T3D. Tom Logan, ARSC
Tom gave the results of his master's thesis in porting different modules of SAR processing to the T3D. With the new 8MW nodes and Tom's effort, the SAR processing will be more cost effective than the same processing on the Y-MP. Tom's experience has made him one on the most knowledgeable T3D users at ARSC.

Further Information

I have the e-mail addresses, phone numbers and mailing addresses of the authors above. If you would like details on any of their talks I can send this information to you and you can contact them directly. For those talks that I have a copy of the slides or paper, I will send out copies upon request.

The Future of CRI Hardware

Here is my long-term picture for CRI hardware from what I gathered at the CUG (severe speculation). The follow-on to the T3D is the T3E that will be shipped in the 1st quarter of 1996. Once CRI has made the transition to a distributive O.S. on the T3D they will be in a position to use commodity processors and fold in the programming environment and tools from the Y-MP line, this will be when CRI develops the "scalable node" architecture. The scalable node architecture will be nodes of a small number of processors sharing memory (the Y-MP model) and these nodes can be scalably configured together (the T3E connections). There will be no follow-on to the T90 or J90, the time of customized architectures is almost over. CRI's long-term future will look something like:

  1993  1994  1995  1996  1997  1998  1999  2000

  T3D--------------->T3E------------------->The Scalable Node
The long-term future of CRI is tied to both the T3D line and the older Y-MP line. So effort invested in the T3D/T3E is effort that will grow with the future of CRI.

ARSC T3D Future Upgrades

We are testing the upgrade to the T3D 1.2 Programming Environment (libraries, tools and compilers.) If all goes well it will be on the system in one week.

We are also planning to install CF90 and C++ for the T3D. This will come after the upgrade to the 1.2 P.E. I am interested in hearing from users who want to use the CF90 and C++ products as soon as they are available.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top