ARSC T3E Users' Newsletter 154, October 30, 1998

CUG Origin Workshop



[ Thanks to ARSC Director of User Services Barbara Horner-Miller and
  ARSC System Storage Specialist Gene McGill for sharing their CUG
  reports with us.  Below are excerpts.  Many of the papers will 
  appear later on the CUG web site: 
http://www.cug.org/
.]

From Barbara Horner-Miller's Report

The workshop went from Sunday, October 11, 8:00am until Tuesday, October 13, 12:00pm as a single track. It covered the full spectrum of Origin issues and included 3 tutorials. There were 157 attendees from 51 sites but no breakdown on how many of the sites actually run Origins versus those who were there to hear the issues. One bit of "cray" history that I found interesting is covered by the following table. The table demonstrates that while clock speeds are decreasing, the number of operations required to cover an operation is getting higher.


Machine     latency      CPU      OPS need to
Name          ns        Period   Cover latency
1)---------   -----      ------   -------------
CDC 7600      275        27          10
CRAY-1        150        12          24
CRAY J90      310        10          62
Origin 2000   300        5          117
CRAY T3E      280        1.7        340
CRAY SV1      300        3.3        360

Origin Optimization and Programming Differences:

  • Tools: Miser replaces Atexpert; timex replaces time; perfex replaces hpm.
  • Divide by zero generates an interrupt and continues processing rather than aborting.
  • Two processors share memory on a node, memory conflicts more common.
  • Cache is per processor, therefore the more processors, the more cache.
  • Memory is loaded by cache line at 16 words/line.
  • Stride 1 yields best results.
  • Grouping together data used at the same time helps cache usage.
  • Avoid powers of two sized arrays.
  • The compiler will pad cache to keep calculations from beginning in the middle of a cache line.
  • Prefetch is on with -O3 optimization or by compiler directive.
  • Default optimization is -O0, far too slow. Use -O2 or -O3 optimization.
  • Subroutine call are almost free.
  • User can beat the libraries if usage is sophisticated enough.
  • To see what the compiler is doing, use PHASE: flist=on.
  • Co-array fortran and SHMEM are available.
  • There are no hardware barriers, so "randomly placed" barriers are very expensive.
  • Can't mix SHMEM and MPI in the same program; MPI must be used across systems.
  • Use SHMEM_GET not _PUT; _GET8 is preferred for performance.
  • Loop fusion provides better register/cache reuse; remove temporaries if used.

Daniel Pressel

  • Cache and TLB Tuning: Profile. Remove unneeded data motion. Reorder indices & loops. Merge data that is always used together. Split arrays that contain multiple distinct groups. Use transposes where appropriate.
  • Remove scratch arrays from COMMON and from SAVE blocks.
  • Pipeline Tuning: Profile. Use sparingly. Loop unrolling and reordering.
  • Merge loops to give the compiler more to bite on.
  • Parallelization: Use loop level parallelization ( C$doacross). Move parallelization to outer loops.
  • Miscellaneous Comment: The system runs better if my job sleeps a few minutes every day? Couple of days? (didn't write it down exactly and now can't remember but you get the point)

Origin 200 Status:

  • 128 PE Systems: ~100 systems are operational at ~40 sites.
  • 256 PE Systems: Operational at SGI.
  • 512 PE Systems: Under development as a part of SSI.

From Gene McGill's Report

First, a glossary of some terms used at the conference

  • Architect: The marketeers have pirated this noun and turned it into a verb because it is no longer dazzling enough to say someone "designed" a system - now they "architect" it. Or if they are doing it over, they "re-architect" it.
  • Partial affinity: As in "some software packages only have a partial affinity for modules." This means they don't work with modules.
  • Stabilized: A euphemistic term indicating death of a product. A product is "stabilized" when no new features are added and problems will only be fixed on a limited basis for a limited period of time. In other words, it's dead.

What have the we done to the English language?

Conference Overview:

This was a very technical and valuable conference. The attendees were very open, clearly expressing both the good and bad about the Origin2000. There have been some great successes, but many sites, especially the larger ones, are experiencing a bit of pain.

A summary of some of the presentations:

Tutorial: Origin2000 Optimization, Charles Grassl, SGI

Grassl gave a good talk on issues to consider, primarily from the perspective of converting traditional Cray vector code to the Origin2000 model. Of course, the NUMA memory model is key and taking good advantage of level 1 and 2 cache leads to better performance. If you have to go to memory, it is best to stay local, but remote memory accesses, while experiencing greater latency, still have bandwidth close to that of local memory. Stride 1 accesses take best advantage of cache, and power of 2 sized arrays are to be avoided. Grassl gave quite a bit of detail, more than my note-taking ability could keep up with. He promised to make the slides available on from the CUG web page with pointers to other sources of optimization information.

Welcome, Sally Haerer, CUG President, Gary Jensen, NCSA, and Mick Dungworth, SGI

Haerer gave a brief welcome speech and introduced Jensen.

Jensen welcomed everyone to Denver, pointing out Denver is the home of the Superbowl champion Broncos. Jensen explained the motivation for the Sunday conference was that too many sites can't afford to have people away too long, and it was too difficult getting managements to approve Saturday travel, even though it clearly saves money. Then he introduced Dungworth.

Dungworth gave the 'real' reason for the Sunday conference - the NFL has sunk to such a sorry state that the Broncos can win the Superbowl and most Americans would rather work on Sunday than watch football.

Ocean Nest Grid Models and Moderately Parallel Environment, Germana Peggion, University of Southern Mississippi

Peggion spoke of some of the problems of ocean modeling, including dealing with boundary conditions, how to handle changes in grid sizes. She is working on a Gulf of Mexico model and must deal with the boundaries of the Yucatan Straits and the Florida Straits. The GOM model is at 20km resolution, but the coastal areas are modeled at 4km, with some inlets at 1km. The talk was rather mathematically oriented, not spending much time on Origin-specific details.

Optimization and Parallelization of a vector code (C90) for Origin2000 Performance: What we accomplished in 7 days, Punyam Satya-narayna, Raytheon Systems Company at ARL DSRC, Aberdeen Maryland, Phil Mucci, Computer Science Department, University of Tennessee, Knoxville, Ravi Avancha, Mechanical Engineering Dept., Iowa State University, Ames Iowa

Punyan describe how the three of these people worked at their separate labs optimizing code that Avancha had developed. The code originally ran quite well on a T90, achieving a peak of 518 MFLOPS on a single CPU, with vector lengths of 96. In seven days, they improved the code enough to improve its first O2K run from 60 seconds to 26. Most of the optimizations were for single-cpu performance ("parallelization will be the next 7 days"). Pfa was not effective.

Loop Level Parallelism Using Moderate Sized Parallel Processor: Performance Issues, Daniel M. Pressel, Computer Scientist, U.S. Army research Laboratory

Pressel discussed various issues including optimization and system throughput management. The optimization issues followed the same themes of previous speakers - better use of cache, reducing TLB hits, merge arrays used as a group, split arrays that contain multiple distinct groups. He also recommended minimizing scratch arrays and process them a row at a time instead of a plane at a time. Scratch arrays should not be in Common Blocks or marked SAVE. Avoid paging like the plague - it drags you and everybody else down. He recommended programming into the program checkpoint restart to save time and space - checkpoint restart on the system is too slow, and saves too much data. [Interesting - ARSC's T3E checkpoint seems to perform quite well.] He pointed out that the Origin system isn't like UNICOS systems that stay up forever and he recommends any long-running job checkpoint once per day.

Managing Origin Resources using Job Performance Monitor, Michael Shapiro, NCSA

Shapiro has developed tools to better track resources on a running system. One is an enhanced ps command. Based on this output they have developed better process limit mechanisms and accounting abilities. They are available at ftp://ftp.ncsa.uiuc.edu/outgoing/mshapiro files are psd.tar and ilimit.tar.

NCAR Experiences with the Origin2000-128 CPU 250 MHZ, Mary Ann Ciuffini.

Ciuffini gave a very good, very detailed talk mostly about problems they encountered bringing the O2K into production. Her slides are very detailed, and would be better viewed than my summary. See the NCAR SCD web pages - http://www.scd.ucar.edu/hps/PAPERS/NCAR.O2000_981011/ and http://www.scd.ucar.edu/hps/papers.html.

Shared Memory Multi-Level Parallelism for CFD, OVERFLOW-MLP: A Case Study, James Taft, Sierra Software, Inc., NASA Ames Research Center

Overflow consumes 25% of the cycles at NASA Ames. It is pure vector code, that can consume a 16 processor C-90 and run over 4.6 GFLOPS doing it. A complete aircraft model may have 30,000,000 points. Taft worked on converting it to run on the Origin using multi-level parallelism, forks and barriers. This turned out to convert to the O2000 very well, running 10 GFLOPS on a 128 node O2K. This is without much single processor optimization - that is expected to give another 2X speedup. It scales linearly up the 128 processors, and beyond that, it could scale super-linear since more processors mean more cache. NASA is about to release the MLPlib to make conversions like this easier.

CLK_TCK, Cycle Time, IRTC... Which Clock?

One of our readers was doing some timings with IRTC and discovered that CLK_TCK was reporting 75,000,000. He was expecting 450,000,000, since the clock speed on the T3E-900 is 450Mhz.

Here are results from various interface routines on yukon:

  • If you execute the command "sysconf" you'll see this:
    
      Hardware: 
            [ ... ]
            Cycle time in nanosecs (LPE 0x060) ... 2.2220
            Clock ticks per second (LPE 0x060) ... 75000000
            [ ... ]
    

    Inverting the cycle time yields 450Mhz, the expected clock speed of the T3E-900.

    The "Clock ticks" value of 75Mhz is the clock module oscillator frequency--prior to its multiplication in hardware by 6.

  • If you use the POSIX function "pxfsysconf" to determine the clock speed (as shown in newsletter #116), you'll be accessing the "CLK_TCK" global variable.

    CLK_TCK gets the value 75,000,000.

    At the time we printed issue #116, yukon was a T3E-600, and CLK_TCK returned the anticipated value of 300,000,000.

  • If you use the "sysconf" function you'll get this:
    
      _SC_CLK_TCK       (the number of clock ticks per second) gets   75000000.
      _SC_CRAY_CPCYCLE  (the CPU cycle time in picoseconds)  gets     2222. 
    
  • If you use:
    
      IRTC()
    

    to count the number of clock ticks that occur in a second of wall-clock time, you'll get 75,000,000.

  • If you use the "target" system call:
    
      target(MC_GET_TARGET, &data)  
    

    as shown in a C program given on the man page for "target", you'll find that the field:

    
      mc_clktck 
    

    gets the value 450,000,000. And that the field

    
      mc_clk
    

    gets the value 2222 (picoseconds) which yields 450Mhz when inverted.

If you're using any of the above commands/functions, be sure that your constants are consistent.

In particular, if you time your program using IRTC, and then convert to seconds by dividing by CLK_TCK (as derived from the POSIX function) you're Okay. If you've hardwired a clock speed of 300,000,000 or 450,000,000 (or 600,000,000, if you're lucky enough) into your code, you better double-check your timings.

AAAS/IARC

The AAAS conference and IARC Inauguration occurred this week at UAF with over 200 attendees primarily from the U.S. and Japan. "Images from the conference"--including one of Guy (fourth image down on the left, second "guy" on the right)--are available at:

http://www.gi.alaska.edu/aaas/event.html

PLAPACK Update


[ This came in yesterday... ]
> During the last 2 1/2 years the PLAPACK project at UT-Austin has
> developed an MPI based Parallel Linear Algebra Package (PLAPACK)
> designed to provide a user friendly infrastructure for building
> parallel dense linear algebra libraries.  
> 
> The BETA release of R1.2 has a number of important additions:
> 
> 1) In addition to the general and positive definite (dense) matrix
> solver in R1.1, we now feature a QR decomposition based linear least
> squares solver.
> 
> 2) PLAPACK now includes a Fortran interface.
> 
> 3) Although R1.1 was already competitive with other parallel dense
> linear algebra packages, performance has noticeably improved:
> 
>   Performance in MFLOPS/sec/processor on 16 node T3E-600 (300 Mhz),
>                   streams turned on
>              
>        
 (Cholesky) 
 (LU w/ pivot)
 (LU w/ pivot) 
   (QR)
>        
 double real
 double real  
double complex 
 double real
>     n  
 R1.1  R1.2 
 R1.1   R1.2  
 R1.1   R1.2   
 R1.1   R1.2
> -------+------------+--------------+---------------+--------------
>  2000  
   90   130 
   66     79  
   *     178   
   *      73
>  5000  
  195   253 
  174    212  
   *     332   
   *     209   
>  7500  
  245   303 
  228    268  
   *     376   
   *     272
> 10000  
  280   333 
  268    307  
   *     397+  
   *     312
> 12000  
  300   346 
         327  
   *      *    
   *      *
> 
> (For the above performance results, complete solves were performed,
> including related triangular solves. +The R1.2 performance for LU
> double complex attaining 397 MFLOPS/sec/processor was for a problem
> size of 9500.)  This performance improvement was primarily due to a
> reduction in communication overhead and more sophisticated
> implementation of the factorization routines.
> 
> PLAPACK has been ported to platforms ranging from parallel
> supercomputers like the Cray T3E and IBM SP-2 to Pentium based Beowulf
> class systems running Windows NT or Linux.
> 
> For more information on PLAPACK:
>       
http://www.cs.utexas.edu/users/plapack

> 
> The PLAPACK Users' Guide is available from The MIT Press:
>    Robert van de Geijn, "Using PLAPACK: Parallel Linear Algebra
> Package",
>    The MIT Press, 1997, 
http://mitpress.mit.edu/

> 
> Coming soon: A Matlab and Mathematica interface to PLAPACK.
> 
> Greg Morrow and Robert van de Geijn
> for the PLAPACK team

Quick-Tip Q & A


A: {{ Why does Fortran array indexing start at 1 while C starts at 0? }}
  We got two responses from readers (Thanks!):
  #####

  Because FORTRAN (FORmula TRANslation, in the history books) was
  written by mathematicians and C was written by computer scientists
  :-).

  #####

  Interesting question.  I assumed that Fortran started at 1 because it
  made more sense in the human-oriented notion of counting and
  indexing, whereas C's implementors were concerned with efficiency and
  realized that counting from 0 made more sense from a machine's point
  of view (for example, it makes the array ~= pointer analogy clearer:
  x[i] == *(x+i);  or most importantly: x[0] = *x).

  Here's another example where counting from 0 makes sense. Imagine
  that you want to permute a vector by simply shifting it by an amount
  "shift".  For example, the following example shows a vector with
  shift = 3.

          1 2 3 4 5 6 ->  4 5 6 1 2 3

  In C, this is pretty clean:

          for (i = 0; i < len; i++) {
            x[i] = y[(i + shift) % len]; }

  In a 1-based indexing dialect of C, it would have to be:

          for (i = 1; i <= len; i++) {
            x[i] = y[((i + shift - 1) % len) + 1]; }

  Note that ZPL represents the next logical step in the evolution,
  since it allows arrays to have arbitrary lower bounds.  :)

  #####
  
  Editor's comment:

  I think it is a cultural matter. Some cultures never had a concept of
  zero. The ancient Mayans of Central America are considered to be the
  first with a true understanding. They had a highly sophisticated set
  of numerical notations and understood the concept of the quantity of
  zero a thousand years before anyone else did.

  Do remember in Fortran90 you can define arrays as you wish with the
  array subscript features. I think the most important issue is to be
  consistent. Now more codes are mixing C and Fortran, which can be
  confusing. Is the first element number 1?


Q: Here's one for the Minnesotans, Norwegians, and other Arctic
   dwellers out there:

   You're obsessed to the point of absent-mindedness by the excitement
   of parallel programming.  What's the best way to remember that
   you've plugged your car in (so you don't drive away, rip the socket
   out of the wall, and run over your own extension cord)?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top