ARSC T3D Users' Newsletter 109, October 25, 1996

Report on The Carolina CUG

Five ARSC staff members spent last week in sunny Charlotte. Dale Clark and I are now back in the land of winter. For this Newsletter, I am going to share excerpts from our CUG reports. A lot of this material was taken from notes, so if we have gotten anything wrong, my apologies to the presenters (corrections, embellishments, and other contributions are, of course, always welcome).

Thanks to Dale for his material, which I'll give first:

Introduction

The 38th Cray User Group (CUG) conference was held in Charlotte, North Carolina during the week of October 14 - 18, 1996. The host site was the North Carolina Supercomputing Center, which was, however, located about 120 miles east at Research Triangle Park. The theme of the conference was "Speeding by Design", inspired by the Charlotte Motor Speedway, "America's premier NASCAR facility and home of the Coca-Cola 600 race event".

Charlotte is a city of some 455,000 people, and was incorporated in 1768. Nicknamed "the Queen City", Charlotte's actual name derives from George III's German queen, Charlotte of Mecklenburg. The weather during the conference was pleasant and sunny, with daily highs in the upper seventies.

Conclusions

This conference, the first since Seymour Cray's death and CRI's merger with SGI, in many respects was classic CUG. Well organized and graciously hosted, its tutorials, BOFs, SICs, MIGs, general sessions and parallel presentations provided useful forums for us users to learn from one another's experiences, and - of at least equal importance - for Cray to instruct us and inform us of their future plans.

Certain problems suggest, however, that future CUGs will be different. Monetary problems, for one thing, have led delegates to approve a move to once-yearly meetings, to be held each Spring, making this CUG 38 the last Fall meeting.

More importantly, the CRI/SGI merger has plunged CUG into an identity crisis. The legal requirement that references to 'CRAY' in CUG bylaws be replaced with references to 'S.G.I./CRAY', for example, potentially opens the door to CUG membership for anyone with an SGI workstation. Does CUG then abandon its focus on high end machines and become a broad-based corporate user group, or should it retain its focus on high performance computing, possibly expanding it to admit users of all high performance computers, whatever the make?

Whatever the resolution, CUG stands to lose its special relationship with Cray. How willingly will Cray share privileged information with a group whose members may include competing supercomputer manufacturers, who could not be kept out under either scenario? More importantly, how willingly will Cray's new parent support CUG? Remarks by SGI's President and CEO, Ed McCracken, seem to indicate that he regards user groups, with their natural focus on current needs and problems, as something of a drag on progress.

Perhaps, as some delegates remarked, the sky will not fall, and CUG is in no real danger. Whether it will be business as usual, and whether CUG can continue its traditions, however, remains to be seen.


================================ reports ===============================

Cray Corporate Report

General session presentation by Bob Ewald, President and COO, CRI

  • 11 T3Es comprising 864 PEs shipped so far. (average: 79 PEs per T3E)
  • J90s and T90s also selling well.
  • Legal action in progress against NEC, for allegedly outbidding CRI for NCAR contract by dumping its machine.
  • "Cray-class performance + SGI-class visualization + balanced systems => INSIGHT"
  • A successor machine to the T3E is planned; tentatively called T3E+.
  • Effects of merger with SGI:
  • Staff of ~3650 CRI employees reduced by ~140 redundant employees.
  • Some turf wars between CRI and SGI sales departments.

Hardware Report

General session presentation by Steve Johnson, CRI

  • C90 out of production.
  • J90 in sustaining mode.
  • J90++ design underway; twice J90 performance.
  • T90 enhancements:
    • CMO4 (SSRAM)
    • IEEE CPUs
    • gigaring I/O
  • T90P under development:
    • faster CPUs
    • larger memory
    • IEEE FP
    • BiCMOS
  • T3E:
    • 28 systems to be shipped by end of October
    • 512 PE system built
    • 1192 PE system on order
    • 512MB/PE in production
    • 2GB/PE prototyped
    • multiple gigarings by November 1
    • multi-purpose node under development; being tested on J90
  • Origin 2000
    • 195 MHz MIPS R10000 (R10K)
    • 4MB cache
    • 65 - 128 CPUs
  • SN1 (Scalable Node) teams being formed; MIPS based.
  • SN2 in definition stage; CRI/SGI architecture.

Software Report

General session presentation by Mike Booth, CRI

  • 827 UNICOS systems averaging 99.59% availability.
  • Release levels in field:
    • 7.0 15%
    • 8.0 70% (we have plenty of company)
    • 9.0 15%
  • UNICOS 8.0.4.4 is the final 8.0 release.
  • Future releases:
    • 10.0 1997
    • 11.0 1999

Service Report

General session presentation by Mick Dungworth, CRI

  • Former head of CRI service now head of combined CRI/SGI service.
  • 2,000 combined service employees.
  • $600 million combined service revenue.

Computational Chemistry: Two Success Stories

General session presentation by Lee Bartolotti, NCSC

This was an interesting presentation showing how theoretical chemists, using complex molecular models and massive amounts of computation time, can achieve results that have eluded their experimental chemistry colleagues. The first example was a fine-grained model applied to the small toluene (methyl benzene) molecule, in a search for its decomposition products, a matter of concern in the fragile upper atmosphere. Results pointed to a preponderance of an unexpected decomposition product, a finding that was later confirmed experimentally. The second success story concerned the elucidation of the stereochemistry of a key protein associated with Alzheimer's disease. This involved a coarser-grained model, with resulted in less exact predictions, but still advanced researcher's knowledge of this protein's likely structure.

SGI/Cray Next Generation

General session presentation by Rick Bahr, SGI, and Karl Freund, CRI

A frequently-seen chart showing the expected development of SGI and Cray product lines was again trotted out, the key message seeming to be that both product lines are converging on the SN1 and SN2, scalable node architectures using MIPS processors and key Cray technology. Also discussed were the advantages and disadvantages of current architectures:

  • shared memory:
    • easy to program
    • hard to scale
  • parallel systems:
    • hard to program
    • easy to scale

The Silicon Valley CUG

General session presentation by David Robertson, Sterling Software

This presentation consisted of a brief but entertaining video of the attractions of the San Jose area, including some sight gags on the conference theme, "seismic computing." From the looks of things, a rattlingly good time is in store for all attendees.

Tom Baring's Material:

I tried to hit every talk or BOF which had "T3E" somewhere in the title. People seem excited about the 'E, and a lot of enthusiasm swirls around it. Early systems were apparently unstable, but they are stabilizing quickly, due to a tremendous and exacting effort on the part of CRI.

T3E Optimization -- Jeff Brooks (CRI)

The T3E uses the Dec Alpha 21164 (EV5) chip. Some specs:

  • 32 int and 32 float registers
  • 4 clock fp add/mult pipes (13.3 ns)
  • 1K direct-mapped cache
  • 96 K 3-way set associative secondary cache
  • quad issue
  • 300 Mhz

EV5 cache system:

  • Primary dcache like EV4--
    • 1024 words, 4 word lines
    • direct mapped
    • read allocate, write through
    • NOTE: on EV5, fp loads frequently bypass this cache
  • Secondary cache--
    • 12288 words, 8 word lines
    • 3-way set associative with random replacement policy
    • read / write allocate and write back
    • caches data AND instructions
Local Memory System:
Streams allow prefetching from DRAM to fast buffers to speed data loading. There are 6 streams available, they are managed in hardware and are allocated after two secondary cache misses. They achieve >600 MB/sec sustained bandwidth on multiple RHS code. Optimize for streams by keeping # of data streams on RHS's to 6 or less.
Global Memory Access:
512 "E-Registers" are used for GETs and PUTs to global memory. E-register data can load to EV5's registers at 600 MB/s. Using low-level directives, you can compute straight out of E-registers, this is possible optimization for strange strides.
Floating Point Functional Units:
Mult/Add are pipelined and take 4 clock periods. (T3D is 6 cp) Divide is not pipelined and takes 22-60 cp. (T3D is 61 cp) Unroll loops to optimize -- exposes more parallelism to compiler.

Some Speed comparisons w/ T3D:


  4th order Horner's rule polynomial--
    not unrolled (as is): 3.6x faster
    unroll by 4:          6.5x 
  libm intrinsic funcs--
    sqrt:                 5.7x
    1.0/sqrt              5.1x
    alog                  2.9x
    exp                   4.5x
    sin                   2.6x
    cos                   3.1x
    a**b                  3.4x
  saxpy--
    no unroll             4.8x
    unroll                6.4x

Summary:

  • Stride 1 is important.
  • Optimize for 2ndary cache use.
  • Treat Scache like a 4096 word direct-mapped cache for key loops.
  • Keep # of streams at 6 or less: try "cdir$ split" on key loops where more than 6.
  • Note: stores that miss in the Scache result in a read stream
  • Try "cdir$ unroll" w/ various levels of unrolling.
  • Try E-Registers for strided address computations and for gather/scatter ops that are not likely to be cache resident.
  • Try 32-bit floats if possible
  • Use SCILIB for BLAS-2 and -3 and for FFT funcs. Not necessary for BLAS-1.
  • Use bnchlib intrinsics.

Jeff has graciously supplied postscript copies of his overheads, which are now available:

pub/mpp/docs/T3Etutorial.ps.Z


At:
   
ftp.arsc.edu

SGI Merger: Impact on Cray Users -- Ed McCracken (SGI Pres & CEO)

In '97, SGI will be a $4B company w/ ~11,000 employees.

Principal markets:
1/3 -- manufacturing (automotive and aircraft)
1/3 -- defense & intelligence (image processing, simulations, HPC)
1/3 -- Science (weather, oil/gas, pharm, etc... Universities)
<15% -- "tele-entertainment"
Product categories:
$2B Low-end (desktop) machines
$2B High-end (on the floor) machines, which can be divided:

1/4 in graphics supercomputers
1/4 in database/data warehousing/web servers
1/2 in pure supercomputing, which can be divided:
2/3rds CRI
1/3rd SGI

Five primary groups in SGI makeup:
MTI (MIPS and PowerPC)
Desktop workstations
Scalable Systems
CRI
Silicon interactive (software)
Products:
J90 and T90 new generations
Scalable node product lines (CRI & SGI plans were similar)
SGI SN line is core
CRI responsible for largest configurations
Other comments:
SGI often uses "just in time research." When they need an idea, they go out and find it; they maintain close ties with Universities, in particular Stanford. They are heavier in engineering than research, but compared to competitors, try to stay much higher in R&D -- they want to develop the new ideas, not the clones. They have cut very few CRI employees, and those from redundant departments, like HR.

Installation and configuration of UNICOS/mk on a Cray T3E -- David C. Holst (CRI)

This talk was more for system admins but I have the handout.

A couple things:

  • Preinstallation will be a very simple configuration, but all T3Es will ship from a working, known starting point.

  • A minimum of 1 support (or shell) PE is required. Thus, a 128 node T3E can have at most 127 application nodes.

  • Fairly easy to change "support" nodes into "application" nodes, but will take a reboot.

  • Can convert "system" nodes into support or application nodes, but not recommended. CRI may not be able to read dumps if you alter recommended number of system nodes.

Benchmarking the SNL MPI Suite on T3E -- Mike Davis (CRI at SNL)

They moved code to T3E in steps:

  • on PVP, convert from f77 to f90
    • Change to POSIX Fortran interface: no more GETARG or RENAME.
    • Adjust for mpirun which adds 2 command line args after user command.
  • port to T3D
    • sizeof (Fort real) vs sizeof (C float)
    • remove SIGN (X,-0) which had been used as a way around if-then-else
    • sbrk() change
    • Fortan character descriptor (alignment in PVP had allowed alias of 1st byte to char*)
    • MPI_CHARACTER change
  • port to T3E
    • No Problems; Fully compatible.

See: www.sandia.gov for codes

Comparison of CF77 and CF90

Programmer should be aware of several techniques to get the best compiler optimization out of CF90. My notes don't have the detail, but look for:

  • inlining array intrinsics
  • loop fusion
  • loop interchange
  • outer loop vectorization
  • unroll & jam
  • loop splitting
  • loop collapse and unroll

Cray Supercomputing Report -- Irene Qualters (VP, CRI)

T3Es built:

  • 17 air-cooled by end of Oct
  • 11 liq-cooled by end of Oct
  • multiple 256 PE T3Es
  • building a 512 PE T3E now
  • largest order: 1192 PEs
  • largest air-cooled: 68 PEs
  • T3Es w/ 64, 128, 256, 512 MB memory per PE in full production
  • Prototype using faster alpha and 1 & 2 GB memories in checkout
  • Prototype using multiple Gigarings in checkout

"Birds-of-a-feather" discussion on T3E Status -- Steve Reinhart, CRI

  • 18 T3Es have been delivered to date
  • Now at 3-5 (300-500 PEs) per week

Software in delivery:

  • progEnv, CF90, C, C++, PVM, MPI, Shmem
  • TCP/IP, telnet, NFS, NQS
  • basic UNICOS, Single-PE swapping
  • single MPN
  • other...
    • checkpoint/restart planned for ver. 1.5

Stability of existing T3Es:

  • Remaining instability is primarily in low-level HW. Thus, one bug fix removes multiple manifestations. (CRI)

  • Upper layers are stabilizing quickly. (CRI)

  • Staff from three sites already running T3Es noted that, although the stability of early T3Es was a problem, CRI has been working extremely hard and making dramatic and consistent progress. My sense was that everyone is really excited about their T3Es.

Streams:

  • Use caution with current mk release.
  • Programmer is responsible to prevent local computation & remote communication simultaneously via stream access to same vector.

Installing and Configuring a T3E -- David C. Holst (CRI)

At CRI, 3-5 users per shell (support) PE at a time is typical.

Automated T3D error reporting -- M.W. Brown (EPCC)

They have developed a tool, the "patrol" system, for automatic checking of T3D status. It scans mppsyslog for certain patterns; invokes mppping to check communication status; and checks fsmon (free space monitor).

Chemistry Apps on the T3E -- John Carpenter (CRI)

This authoritative talk covered progress in porting chemistry codes to the T3E. Computational goal is to compute 100 atom molecules in 20 hours -- much longer becomes impractical for researchers. Here is the status of some ports:


Gaussian 94 -- being ported to 'E
UniChem     --
GAMESS      -- 'D and 'E versions available
DMOLE       -- 'D version in beta; 'E being ported
Turbomole   -- 'D and 'E versions in beta
QChem       -- 'D and 'E versions in beta
NWChem      -- 'D and 'E versions in beta

HPF-CRAFT "Birds-of-a-feather" discussion -- Doug Miles of PGI

This BOF brought good news, as I had thought that CRAFT on the T3E would be scaled down and that users would be faced with major rewrites. Turns out, T3E-CF90 will support full CRAFT standard. THere will be some minor syntax changes relative to CRAFT 77. Here are three that I got down:


 Directive prefix !DIR$ --> !HPF$
     "     shared       --> DISTRIBUTE
     "     DOSHARED     --> INDEPENDENT

But all CRAFT features will be implemented.

The HPF features in T3E-CF90 will be standard, and thus, if users stick with HPF rather than CRAFT, their code should port to different platforms. The generic HPF uses message passing, but on T3E, PGI will be working toward implementation taking advantage of shmem.

T3E-CF90 will be interoperable w/ totalview and apprentice.

So Optimization Breaks Your Code ... -- R. K. Owen (NAS)

He has developed a tool, "bchop," which helps determine which of many object files has an error under different compiler options. The concept and program are both fairly simple:

  1. create a directory ./good, and put object files produced with known "good" compiler options into it.

  2. create a ./bad directory, a put object files into it with the questionable compiler options.

  3. create a script which can tell good program output from bad.

  4. bchop relinks the program over and over with different combinations of good/bad object files and uses the test script to evaluate the executable resulting from each combination. It does a binary search through the possible combinations for the object file(s) which cause the program to produce bad output.

bchop is available from the NAS www site.

Quick-Tip Q & A


A: {{ What mode should you give a directory so that members of your
      unix permission group can create files in it, edit their own and
      other members' files, remove and rename their own files, but not
      remove or rename anyone else's files? }}

   Set the "sticky-bit" (1xxx).

   Mode: 1770   (denies world permissions)
   Mode: 1775   (gives world read/execute permission)

   Example:

   denali$ mkdir Dir_Sticky
   denali$ chmod 1770 Dir_Sticky
   denali$ ls -ld Dir_Sticky
   drwxrwx--T   2 baring   staff     4096 Oct 11 14:49 Dir_Sticky/

   The "T" in the "ls" output indicates that the sticky-bit is set.  


Q: You telnet to a remote site, and suddenly your backspace key
   produces funny text instead of spacing back.  How do you fix this?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top