ARSC HPC Users' Newsletter 252, August 16, 2002

WOMPAT Reviews

The Workshop on OpenMP Applications and Tools (WOMPAT2002) was held at ARSC last week. Thanks to Tom Logan and Jim Long for submitting their notes:

# # Tom Logan's review: #

The WOMPAT2002 workshop was attended by many OpenMP experts and experienced programmers. As such, the discussions about the current state of OpenMP and the future of OpenMP were lively and informative. The workshop was organized as 1 day of tutorials and 2 days of presentations and discussions.

Tutorial 1:

Dual-Level Parallelism Using MPI and OpenMP Thomas Oppe and Carrie Mahood ERDC

This was an informative session that covered issues in using MPI and OpenMP in the same programs. It also included discussions on parallel architectures, parallel programming strategies, processes and threads, an introduction to OpenMP, a Laplace equation example, several OpenMP case studies, and MLP (a multi-level parallelism package developed out of NASA Ames). The tutorial ended with a set of general guidelines for dual-level parallel programming.

Overall this session contained a lot of information - it would be a good introduction for someone just learning about OpenMP and combined OpenMP/MPI programs that effectively take advantage of modern distributed shared memory systems (DSMs).

Tutorial 2:

Advanced OpenMP Programming Tim Mattson & Sanjiv Shah Intel

The second tutorial was more of an "in the trenches" how to program with OpenMP kind of a session. It was obvious that Tim and Sanjiv are very experienced with OpenMP (they should be - they helped write the language specs and implement the compilers). Here are some notes that I took:

  • OpenMP 2.0 standard is finished (the C/C++ 2.0 standard has been released, but not implemented as of yet)
  • New 2.0 construct is "copyprivate" - this acts like a broadcast of a value to a private variable
  • The abstract OpenMP machine includes
    • Multiple PE's
    • Shared Flat Address Space
    • Lightweight processes
  • The reality of machines is
    • NUMA architectures are NOT flat
    • Cache hierarchies are NOT flat
  • The best way to use OpenMP is to NOT use the clauses. Instead declare any private variables inside of the the parallel section (in C/C++) or to make a subroutine (in Fortran)
  • Most of the time threads that are released from a parallel region are kept in a "thread pool." One of the issues with implementing OpenMP is deciding what these threads do when not working (e.g. spin wait, go to sleep, or spin for a while and then go to sleep). It was suggested that rather than leave this to the OpenMP implementor, controls should be incorporated into the OpenMP specification using environment variables.
  • The tutorial briefly covered different aspects of the OpenMP language, current implementations, and OpenMP optimizations. Here are some high-level tidbits:
    • Scheduling - use static scheduling unless load balancing is a real issue. If it is, you should use static with chunk specified; using the dynamic or guided schedule types requires too much overhead.
    • Synchronization - Just don't use them if at all possible, particularly 'flush'
    • Common Problems with OpenMP programs
      • Too fine grain parallelism
      • Overly synchronized
      • Load Imbalance
      • True Sharing (ping-ponging in cache)
      • False Sharing (again a cache problem)
    • Tuning Keys
      • Know your application
      • Only use the parallel do/for statements
      • Everything else in the language only slows you down, but they are necessary evils.
    • Some example overheads (in clock cycles)
            Thread Id          10-50
            Static Do/For      100-200
            barrier            200-500
            Dynamic Do/For     1000-2000
            Ordered statement  5000-10000
    • Use a profiler on your code whenever possible. This will point out bottle necks and other problems.
    • Common correctness issues are race conditions and deadlocks.
    • Remember that reduction clauses need a barrier - so make sure not to use the 'nowait' clause when a reduction is being computed.
    • Don't use locks if at all possible! If you have to use them, make sure that you unset them!
    • If you think you have a race condition, run the loops backwards and see if you get the same result. (I think this is a pretty cool idea).

Over the following two days of the workshop, 11 papers were presented and many discussion sessions were held.

Rather than give details of the papers, I will defer readers of this review to the forthcoming publication (watch the cOMPunity web site: .)

The discussions included topics such as how to teach OpenMP, status and future of implementations, a cOMPunity status report (the OpenMP user group), and what is next for OpenMP.

As for the main points that I took away from the conference:

  1. OpenMP is DEFINITELY the simplest parallel programming language that I have ever seen.
  2. OpenMP can be rather effective for small-scale parallelism. For 4 to 16 processor parallelism, it gets decent performance and can not be beat for ease of programming.
  3. OpenMP is not very useful for large-scale parallelism. It is difficult to get good parallel efficiency for any significant number of processors.
  4. The fact that OpenMP is designed for a "flat address space" that doesn't really exist in modern computer systems is probably the biggest drawback to true scalable parallel performance.

# # Some notes from Jim Long: #

When to use OpenMP under MPI?

  1. When there is not enough memory on a node for all processors to host an MPI process, there is usually enough memory left for those unused processors to run OpenMP threads.
  2. When using vendor supplied libraries that spawn threads.
  3. When a larger number of MPI processes would put unacceptable stress on I/O and/or file systems, i.e., the optimal number of MPI processes are allocated and there are cpus left over that could run threads.

Free OpenMP compilers available:

p690 Cache

Here are cache sizes for IBM's p690 processor in the Regatta compute server:


128kb per CPU or 64kb per processor. (2 processors per chip or CPU) 4 CPU's per MCM (multi-chip module), 4 MCM's in a 32 processor system. This gives a total of 64x2x4x4 = 2048kb's total. But the number that is important is the 128kb per processor pair, or 64kb per processor.


1.41mb per chip or 1.41mb shared between the two processors. The big thing here is that this is shared space.


128mb of L3 cache consisting of four modules each of which contains 2x 16mb Merged logic DRAM. So each CPU or 2 processors shares 32mb of L3 cache. So, each MCM has 128mb of L3 cache available to it. A 4 MCM system would have a total of 512mb of L3 cache.

For (much) more on the p690, see:

(The cache details are probably there somewhere... but I sure couldn't find them.)

Thanks to Rich Hickey for help with this.

Schedule ARSC Faculty Camp Week 2

As announced in the last newsletter, local Fairbanks users have been invited to attend open lectures in ARSC Faculty Camp 2002. Please contact Guy Robinson ( ) in advance so have an idea of numbers and can let you know of changes.

All events are held in 109 Butrovich Building, UAF Campus unless indicated.

Faculty Camp Schedule, Week 2.

Monday, 19th August.
    LUNCH: Brown Bag with other ARSC users.

    14:00-15:00  Creating and Validating an Ocean Model.
                  Kate Hedstrom(ARSC Oceanographic Specialist).
    15:00-16:00  SMS, a Parallel Library.
                  Kate Hedstrom(ARSC Oceanographic Specialist).

Tuesday, 20th August.
    9:00-10:00   Performance Monitoring and Tuning Tricks.
                Guy Robinson(ARSC MPP Specialist).

Wednesday, 21st August: Bioinfomatics Special Topic Day.

    10:00  ARSC introduction. Tom Baring/ARSC
    10:15  ARSC Presentations: Jim Long/ARSC and Tom Baring/ARSC 
            describe recent work and benchmarks.
    11:00  Nat Goodman: Bioinformatics Overview.
    12:00  Brown Bag lunch discussions.
    13:00  Jack Collins/NCI: NCI Programs and Services Provided.
    13:30  Jack Collins/NCI: NCI Web Demonstration.
    14:00  ARSC Faculty Camp: Needs and Plans Descriptions.
            Robert Hubley, ISB.
            Jeff Shrager, Stanford.
            Adlai Burman, IAB/UAF.
    14:30  Continued Discussions.

Thursday, 22nd August.
    14:00-15:00   Discussion: Future Developments.
                HPC Growth, what is happening at ARSC.  Future
                technology, surviving change.  The GRID, its impact on
                me?  Necessary skills for students/researchers.  AGN,
                future collaborative technology.

Construction Cam

If you're bored with the Iowa Farmer's CornCam, you might check out the UAF Museum Expansion Construction Cam: (linked from: )

or the UAF Geophysical Institute's Weather Cam, with a view of the roof of ARSC's own Butrovich Building, a parking lot, and the Tanana Flats:

I don't get it

(Richard Griswold's email signature:)

There are only 10 types of people who understand binary - those who do and those who don't

Quick-Tip Q & A

A:[[ I'm using "vi" and want to "yank" some text in one document, and
  [[ "put" it into another.  I tried this:
  [[    $ vi file1     # start editing file1
  [[     22j          # move to desired text
  [[     y10y         # yank next 10 lines
  [[     :n file2     # open file2
  [[     p            # attempt to put the 10 lines
  [[ It just beeps at me. This is such an obvious operation, there
  [[ must be a way... Ideas?

# Thanks to Paul Mercer, Derek Bastille, Kate Hedstrom, and Rich
# Griswold. There was general agreement on two solutions:

First Solution (from Kate's message):
You are trying to use the default unnamed buffer. You have to use one of
the 26 named buffers:

  % vi file1 file2         # edit two files
    22j                    # jump 22 lines
    "a10yy                 # yank ten lines into buffer a
    :n                     # next file
    "ap                    # put buffer a 

I like to use the named buffers for something like adding #ifdef lines,
where buffer "a" gets the #ifdef, buffer "b" gets the #else, and buffer
"c" gets the #endif. Using more than three or four gets a bit unwieldy,
but there are 26, named "a" to "z".

Second Solution:
Use vim, which supports pasting the unnamed buffer across files, exactly
as described in the question.

Q: I thought OpenMP was supposed to be easy!  It won't even compile. 
   What's wrong???

   Here's the relevant loop, printed with line numbers, followed by the
   error message:

     +32  #pragma omp for reduction(+:overallsum)
     +33      for (n = 0; n != ARRSZ; n++) {
     +34        overallsum += array[n] ;
     +35      }

  ibmsp$  xlc_r -qsmp=omp -o openmp_tester openmp_tester.c
    "openmp_tester.c", line 33.17: 1506-818 (S) Controlling expression
    of the for loop is not in the canonical form.

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top