ARSC T3E Users' Newsletter 188, February 4, 2000

ARSC News Briefs

  • ARSC Employment opportunities:
    1. We're looking for students to serve as consultants for the viz labs.
    2. We're also seeking two full-time User Services Consultants.
    For details, see:

    http://www.arsc.edu/misc/jobs.html

  • "Arctic Visions: Supercomputing in the Far North," an informational video about ARSC was awarded "First Place - Video or Film Feature" in the Aurora Awards sponsored annually by the Alaska Chapter of the Public Relations Society of America.

    Congrats to LJ Evans, Roger Edberg, and Jenn Wagaman who produced the film, and others who contributed. A limited number of copies of the video are available for educational purposes upon request to LJ Evans at, ljevans@arsc.edu.

  • Tom Baring spoke at the UA-Anchorage ACM student chapter last week, introducing ARSC, supercomputing, and programming the T3E.
  • We've given many tours recently, to various groups including cub, girl, and boy scouts, jr and sr high-school classes, UA students, researchers from the State Division of Forestry, forensic experts working in the Balkans, and other VIP visitors.
  • Coming public events:
    • Feb. 17th, The "Crayons" of ARSC will compete again in the Literacy Council of Alaska's annual spelling bee fund-raiser.
    • Feb. 26th, Open-house for National Engineering Week in the Duckering Viz Lab.
    • June 7th, the first of our regular Wednesday afternoon summer tours.

Comparison of Languages for Multi-Grid Methods

[ This is the first in a two-part series contributed by Brad Chamberlain of the University of Washington. ]

INTRODUCTION

This past fall, members of the ZPL Parallel Programming Project at the University of Washington conducted a study of language support for dense multigrid-style applications on a few parallel architectures, including the Cray T3E.

The languages compared on the T3E were F90+MPI, Co-Array Fortran (CAF), High Performance Fortran (HPF), and ZPL. The benchmark used for the study was the MG benchmark from the NAS Parallel Benchmark Suite (NPB) v2.3. This benchmark uses a multigrid computation to obtain an approximate solution to a scalar Poisson problem on a discrete 3D grid with periodic boundary conditions. The ARSC T3E was used to conduct the test runs.

THE BENCHMARKS

The spirit of NPB v2 is to measure the performance of a specific algorithmic approach rather than to expend effort being clever about finding new ways to obtain the solution. To this end, versions of the benchmark were sought which (a) followed the NAS implementation of the benchmark as closely as possible, and (b) were written by someone with knowledge of the language in question.

To this end, the F90+MPI code is the original version written by NAS. The CAF version was written by a member of the CAF team and was developed by simply replacing the MPI calls in the F90+MPI version with equivalent CAF statements or subroutines. The HPF version was written at NASA Ames as part of a project to implement the NAS benchmarks as efficiently as possible in HPF. PGI has identified this as the best implementation of MG for the pghpf compiler that they are aware of. The ZPL version was obtained by making a careful translation of the F90+MPI code into ZPL.

Each implementation of the benchmark was evaluated both in terms of its performance and its expressiveness (i.e., its ability to express the MG computation both cleanly and succinctly). The compilers used on the T3E are summarized here:


  language   compiler   version   command-line options
  --------   --------   -------   --------------------
  F90+MPI    f90        3.2.0.1   -O3                
  CAF        f90        3.2.0.1   -O3 -X 1 -Z <nprocs>
  HPF        pghpf      2.4.4     -O3 -Mautopar -Moverlap=size:1 -Msmp
  ZPL        zc         1.15a
             cc         6.2.0.1   -O3
Note that to achieve portability across numerous platforms, ZPL compiles to ANSI C and then uses the machine's native C compiler (in this case cc) to create the executable.

It should also be noted that the implementations made different assumptions about the execution parameters and when they are bound:

  • The F90+MPI and CAF codes fix both the problem size and number of processors into the code, and the number of processors must be a power of 2.
  • The HPF code fixes the problem size statically, but leaves the number of processors unspecified, and it can be any number.
  • The ZPL code fixes neither the problem size nor the number of processors, and again the number of processors can be any value.
These differences are largely a matter of philosophy and convenience. F90+MPI is a completely general system, but since this code was the original NAS benchmark, it made sense to provide as much information as possible to achieve peak speeds. The constraint that the processors be a power of 2 eases the task of writing the local-view without worrying about strange processor topologies. CAF is also a general language, but since it was written by porting the F90+MPI code, it inherited its assumptions. HPF manages details of data distribution between processors for the user, but still specifies the array sizes as is the norm in Fortran. The ZPL implementation adheres to ZPL's philosophy that the user should not be required to recompile for each problem size and processor set.

PERFORMANCE

For this article, performance will be summarized by giving speedup numbers using 256 processors. First, we look at MG's Class A size problem, which has 256x256x256 elements at the finest level. Class A requires a minimum of 4 ARSC T3E processors to obtain sufficient memory. All speedups are computed relative to the fastest 4-processor time (12.612 seconds, obtained by CAF); higher numbers are better; and a speedup of 64 would be considered ideal.

  Speedup of MG Class A (256 proc time / CAF 4 proc time)
  ---------------------
  F90+MPI: 30.91
  CAF:     50.65                
  HPF:     1.01
  ZPL:     33.28
For Class C, which has 512x512x512 elements at its finest level, a minimum of 16 processors is required to obtain sufficient memory. All speedups are thus computed relative to the fastest 16-processor time (119.20 seconds, obtained by CAF) and a speedup of 16 would be considered ideal.

  Speedup of MG Class C (256 proc time / CAF 16 proc time)
  ---------------------
  F90+MPI: 10.91
  CAF:     14.98                
  HPF:     --.--
  ZPL:     11.90
Looking at the class C results, these numbers show that CAF achieves the best speedup of ~15. After a sizeable gap, ZPL came in at ~12, and just behind it, the F90+MPI version at ~11. The HPF version was unable to run on 256 processors due to its excessive memory requirements.

The differences in performance between the languages were found to be due largely to three factors (in no particular order):

  1. Communication Protocol (MPI, SHMEM, or other)
  2. Stencil Optimizations
  3. Base Language (Fortran or C)
Each is addressed in more detail here:
  1. As discussed in past T3E newsletters, SHMEM is a faster communication mechanism than MPI on the T3E and can result in significant improvement in execution times. SHMEM has the disadvantage of being trickier and less portable than MPI for programmers to write by hand. However, compilers like zc and pghpf can take advantage of SHMEM and realize these benefits without the user ever being aware of its existence. The CAF compiler is similar, but uses an even lower-level interface than SHMEM, which results in additional benefit. Thus, for this factor, CAF does best, followed by ZPL and HPF, followed by F90+MPI.
  2. The original F90+MPI code had several hand-coded optimizations that eliminated redundant calculations in MG's 27-point stencils at the cost of obscuring the code's intent somewhat. This optimization was also found in the CAF and HPF versions of the code. Unfortunately, the optimization is not efficiently expressible in ZPL, and therefore was written in the naive form. Since stencil computations are extremely common in ZPL, this optimization is currently being developed for the compiler and improves ZPL's speedup number above to 14.10. Thus, for this factor F90+MPI, CAF, and HPF do best, and ZPL does worse.
  3. F90+MPI, CAF, and HPF all use Fortran as their base language, whereas ZPL uses C as its base language. Fortran has been shown to generally result in more efficient code than C due to its lack of aliasing problems caused by C's pointer arithmetic and lax subscripting. This gives the Fortran-based languages a slight advantage over ZPL.

    The summary of these three factors is that CAF's performance advantage is due to its use of Fortran as a base language, its optimized stencil computations, and its use of a lower-level SHMEM-style communication. ZPL suffers primarily from its lack of the stencil optimization and its use of C as a base language. F90+MPI suffers due to its reliance on the MPI interface.

    EXPRESSIVENESS

    Expressiveness of languages was judged both quantitatively and qualitatively. The quantitative evaluation was performed by reducing each implementation to those lines of code which composed the core of the computation. These lines of code were then categorized as being declarations, computation, or communication. The results are summarized below:
    
      language   lines   decls       comp        comm
      --------   -----   ---------   ---------   ---------
      F90+MPI     992    168 (16%)   237 (23%)   587 (59%)
      CAF        1150    243 (21%)   238 (20%)   669 (58%)
      HPF         433    129 (29%)   304 (70%)     0 ( 0%)
      ZPL         192     90 (46%)   102 (53%)     0 ( 0%)
    
    The first thing to notice is that the languages break into two general camps: Those that provide a local view of the computation, in which the programmer is writing per-node code, and therefore explicitly responsible for expressing the data layout and interprocessor communication of their programs; and those that provide a global view of the computation, in which the programmer essentially writes a sequential computation and the compiler is responsible for issues of distribution and communication. F90+MPI and CAF both require a local view, while HPF and ZPL provide a global view. The result is that global-view codes are 2-6 times shorter than local-view codes.

    As can be seen from the figures above, the length of local-view implementations is primarily due to the large amount of code required to specify communication (~60% for this benchmark). CAF tends to require slightly more code than F90+MPI due to the fact that MPI provides high-level communication mechanisms like reductions. In CAF these operations must be written by hand using its co-array indexing syntax. In contrast, the HPF and ZPL programs require far fewer lines since the programmer can ignore details of distribution and communication. ZPL's computation portion is more succinct than HPF due to its use of "regions" which eliminate the need for looping and indexing.

    This difference between global and local view is not merely a matter of how much typing one has to do, but also an issue of complexity. The hardest part of coding most parallel algorithms is the correct specification of data distribution and communication while dealing with boundary conditions, race conditions, deadlock situations, etc. Thus, this difference in line counts represents not merely a large portion of the code, but an extremely intricate one that is distracting from the algorithm at hand.

    Naturally, line counts alone are not sufficient to evaluate the expressiveness of the programs, so the code was also scrutinized to get a sense for how clear it was. ZPL was deemed best as a description of the algorithm, as it was not obfuscated by data distribution, communication, looping, indexing, and hand optimizations. The claim is that if you want to understand how the MG benchmark works, the ZPL code will be the easiest to read.

    The next issue of the T3E Newsletter will contain code samples from the different implementations.

    CONCLUSIONS

    Some conclusions of this experiment:
    1. (PGI's) HPF is not yet sufficiently advanced to support dense multigrid-style computations. At present, it requires allocating excessive amounts of memory that interfere both with performance and the ability to run large problem sizes.
    2. CAF can significantly outperform hand-coded F90+MPI. Moreover, converting a F90+MPI program to CAF can be done without extreme effort and the performance benefits will probably make that effort worthwhile.
    3. ZPL's performance is better than hand-coded F90+MPI due to its use of the SHMEM interface. In future releases, when the compiler optimizes stencils, it should be competitive with CAF as well.
    4. When writing a parallel algorithm for the first time, F90+MPI and CAF both require the user to manage lots of little details specific to the local view. As a result, a language providing a global view such as ZPL or HPF might be a better first-cut for implementing the algorithm.
    5. By providing high-level concepts such as regions and a syntax-based performance model, ZPL supports succinct and clear coding without greatly sacrificing performance.

    FURTHER INFORMATION

    Information on the four languages is available on-line:

    MPI : http://www-unix.mcs.anl.gov/mpi/index.html CAF : http://www.co-array.org/ HPF : http://www.crpc.rice.edu/HPFF/home.html ZPL : http://www.cs.washington.edu/research/zpl/

    CONTACT INFORMATION

    Many details in these experiments had to be omitted to keep this article at a practical length. For more information on these experiments, please contact brad@cs.washington.edu .

    30th International Arctic Workshop

    Conference announcement:
    
      The 30th International Arctic Workshop: 
    
      Thursday through Saturday, March 16-18, 2000
      
      Institute of Arctic and Alpine Research
      University of Colorado, Boulder, CO USA
      
      The 30th Arctic Workshop will be held at the Institute of Arctic and
      Alpine Research, University of Colorado. The meeting will consist of
      a series of talks and poster sessions covering all aspects of
      high-latitude environments, past and present. 
    
      Previous Arctic Workshops have included presentations on arctic and
      antarctic climate, geomorphology, hydrology, glaciology, soils,
      ecology, oceanography, and Quaternary history.
    
    For registration and details, see:

    http://instaar.colorado.edu/AW2000/

    Language Challenge 2000, Contest

    A technical programming contest, in which any language may be used, is announced at:

    http://www.sdynamix.com

    The basic problem is:

    Find the optimal initial angle for a trajectory to reach a target at 2000 m to within .5 m.

    From the announcement:

    The best entry in each language will be posted on our site to serve as a barometer for those pondering what language to choose for their technical computing. Questions regarding the contest should be directed to info@sdynamix.com with Contest 2000 in the "Subject" line.

Quick-Tip Q & A



A: {{ Is it a good idea to compress files which are to be DMF migrated? }}

 
   If your only goal is to save space on tape, it's a bad idea.  
   
   All files are automatically compressed by the tape hardware, so if
   you "pre-compress" you'll be wasting CPU time and creating extra
   work for yourself.  (And, interestingly, compressing a compressed
   file can actually increase its size.)



Q: I tried to authenticate using Kerberos/SecurID and got this message:

     kerberos skew too great

   What does this mean?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top