ARSC HPC Users' Newsletter 207, November 3, 2000



UAF Colloquium Series

The UAF Department of Mathematical Sciences and ARSC are jointly sponsoring a Mathematical Modelling, Computational Science, and Supercomputing Colloquium Series.

The schedule and abstracts for the '00-'01 academic year are available at:

The following events are scheduled for this academic year:
Nov. 16
Pat Lambert, University of Alaska Fairbanks "Quasi Monte Carlo Methods"
Dec. 7
Jon Genetti, San Diego Supercomputer Center "Bringing Space Into Focus"
Jan. ??
David Woodall, University of Alaska Fairbanks (Title TBD)
Feb. 19
Jack Dongarra, University of Tennessee and Oak Ridge National Laboratories (Title TBD)
Apr. ??
Vince Mousseau, Los Alamos National Laboratories (Title TBD)

Check the web site later, as an event will almost certainly be scheduled for March. Talks normally occur at 2:00 pm in Natural Sciences rm 165.



Here are some booths and events that have a connection to this newsletter.

(Unfortunately, we forgot to publish our annual request for this kind of information in the last newsletter. If you participate in SC2000, feel free to send us something after the fact, and we'll run it as a followup.)


Our booth will feature Co-Editor, Guy Robinson. Ask him about commuting at 40 below.

Here's the official blurb:

The Arctic Region Supercomputing Center supports the computational needs of researchers within the Department of Defense High Performance Computing Modernization Program, the University of Alaska Fairbanks, other academic institutions and government agencies by providing high performance computing, visualization and networking resources, programming and technical expertise, and training. Areas of specialty supported by ARSC include ocean modeling, atmospheric sciences, climate/global change, space physics, satellite remote sensing, and civil, environmental and petroleum engineering. ARSC collaborates in a number of partnerships, including a joint effort with the U.S. Army Engineer Research and Development Center and SGI to build and evaluate a 512-processor SGI Origin 3800 single-system image. The Arctic Region Supercomputing Center operates a Cray T3E and a Cray SV-1, with visualization resources including a Pyramid Systems ImmersaDesk and a network of SGI workstations located in a video production/training lab and three additional access labs on campus.


This research exhibit will describe the underlying concepts of UPC, an explicitly parallel extension of ANSI C designed to provide both good performance and ease of programming for high-end parallel computers. UPC provides a distributed shared-memory-programming model and includes features that allow programmers to specify and exploit memory locality. Such constructs facilitate explicit control of data and work distribution among threads so that remote memory accesses are minimized. Thus, UPC maintains the C language heritage of keeping programmers in control of and close to the hardware. Among the advanced features of UPC are shared and private pointers into the shared and private address spaces, shared and private data, efficient synchronization mechanisms including non-blocking barriers, and support for establishing different memory consistency models. In addition to its original open-source implementation, UPC has gained acceptance from several vendors who are producing exploratory compilers. Additional information can be found at

TECHNICAL PAPER: Brad Chamberlain Time: 3:30 - 5:00 PM WEDNESDAY NOVEMBER 8 Room: D 274

A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures

Bradford L. Chamberlain, Steven J. Deitz, Lawrence Snyder, University of Washington

Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applications by comparing implementations of the NAS MG benchmark in several parallel programming languages: Co-Array Fortran, High Performance Fortran, Single Assignment C, and ZPL. We evaluate each language in terms of its portability, its performance, and its ability to express the algorithm clearly and concisely. Experimental platforms include the Cray T3E, IBM SP, SGI Origin, Sun Enterprise 5500 and a high-performance Linux cluster. Our findings indicate that while it is possible to achieve good portability, performance, and expressiveness, most languages currently fall short in at least one of these areas. We find a strong correlation between expressiveness and a language's support for a global view of computation, and we identify key factors for achieving portable performance in multigrid applications.

BOF: Don Morton and Guy Robinson Time: Wednesday, 5:30 - 7:00 PM Location: D267

Don Morton, ARSC affiliate and frequent contributer to this newsletter has organized this BOF, along with Guy:

Promoting High-Performance Computing Literacy at Low Cost

Despite years of promoting high performance computing, many educational and industrial organizations with interest in this field still lack the expertise and/or leadership to become literate and productive. It has become evident that simply making HPC architectures widely available in the form of high end supercomputers or even low-cost Beowulfs plays a minimal role in developing the intellectual infrastructure necessary for making HPC a productive tool for the masses.

We are interested in initiating programs that proactively reach out to motivated organizations, help them build small local physical HPC infrastructures and then kick-start them in training and organization. Such outreach can help these organizations become independent and self-sustaining in their use of HPC, and ultimately help them to offer their own contributions to the community. You are invited to participate in a discussion of issues in making high-performance computing available to all who are interested. The organizers are currently embarking on a National Science Foundation funded project that proactively promotes and facilitates high performance computing among faculty and students at the state university level, and wish to extend this outreach to smaller schools and industries.

Discussion Areas:

  • What kinds of groups do we know that might be interested in parallel computing, but will likely never get far on their own?

  • What do these groups need to satisfy their interests in parallel computing?

  • What kinds of outreach programs will help satisfy the needs of these groups?

  • How do we organize and standardize so that targeted groups are not "on their own" and so that they can readily migrate between various available HPC environments and keep themselves "technically up to date".

  • Brainstorming to develop plans for initiatives to prototype programs that successfully bring high performance computing to organizations with a long-term commitment.

  • The overall, broad question is: How do we transform HPC-illiterate groups into active users and contributors in the world of high performance computing?


CUG SV1 Workshop Review

The CUG Fall 2000 SV1 Workshop was held last week.

This 3-day conference focussed on the SV1. About 70 people attended. It was held in the very nice Bloomington (Minnesota) Hotel Sofitel.

Here are some general observations:

Separation from SGI

Cray employees seem content with their separation from SGI. About two weeks ago, they moved back to the pre-90's offices in Mendota Heights, from Eagan.

Purchase by Tera

Jim Rottsolk, President and CEO of Cray Inc and co-founder of Tera, and Burton Smith, Chief Scientist at Cray Inc. and Tera and principal architect of the MTA system, were extremely involved and accessible at this CUG. Tera and Cray seem to be getting on well, and there was good energy throughout the conference.


People are learning how to use and optimize for the SV1. First attempts generally get the clock speed improvement when moving from a J90. Optimizing specifically for the SV1 cache has given the biggest payoffs, as shown in talks by both Cray application engineers and user support engineers from centers.

The SV1 is proving useful as a T90 replacement (2-4 SV1 processors per T90 processor) and, for some problems, it has been a huge success because of its generally larger memory. Results are mixed on the usefulness of MSPs.

Cray application engineers have optimized, and continue optimizing, important industrial applications for the SV1.


The topic of the SV2 was raised in several talks, and it is a major focus. The SV2 will be a descendent of both the SV1 and the T3E and will be true to its name: "Scalable Vector".



ARSC recommends that SV1 users monitor the performance of all significant jobs using, at a minimum, "ja" (job accounting) and preferably using both "ja" and "hpm". A performance history can help you understand and improve your code, and catch problems.

"HPM" (the hardware performance monitor) is a non-intrusive tool which reports on the overall CPU performance of your code. You don't need to recompile your code and hpm will not affect code performance.

To use it, simply execute your compiled program using hpm. You can do this at the command line:

chilkoot$ hpm ./a.out

Or, you can include it in your qsub scripts: the report will appear in the stderr file for the NQS job. For instance:

  #QSUB -lT 8:00:00            # Request 8 hours
  #QSUB -lM 100MW              # Request 100 megawords
  #QSUB -eo                    # This combines stderr & stdout

  cd $QSUB_WORKDIR             # cd back to directory from which submitted
  hpm ./a.out arg1 arg2        # run the program "a.out" with two args
                               #    and monitor its performance with hpm

Here's a sample hpm report for a short SV1 job (note the new "Cache hits" fields):

  Group 0:  CPU seconds   :   24.71210      CP executing     :     7413630804

  Million inst/sec (MIPS) :      45.31      Instructions     :     1119652779
  Avg. clock periods/inst :       6.62
  % CP holding issue      :      78.90      CP holding issue :     5849042552
  Inst.buffer fetches/sec :       0.10M     Inst.buf. fetches:        2490999
  Floating adds/sec       :      53.90M     F.P. adds        :     1331938466
  Floating multiplies/sec :      53.80M     F.P. multiplies  :     1329512427
  Floating reciprocal/sec :       7.91M     F.P. reciprocals :      195569490
  Cache hits/sec          :      65.52M     Cache hits       :     1619092080
  CPU mem. references/sec :     226.76M     CPU references   :     5603742221

  Floating ops/CPU second :     115.61M

Quick-Tip Q & A

A:[[ My Fortran 90 program calls several subroutines and functions. I
  [[ wanted to inline them, so I recompiling everything with "-Oinline3".
  [[ Performance was NOT improved but the friendly compiler told me this:
  [[    cf90-1548 f90: WARNING in command line
  [[      -O inline3 is no longer the most aggressive form of inlining.  
  [[ and "explain cf90-1548" told me this:
  [[    The most aggressive form of inlining is now obtained thru
  [[    -Oinline4.  - Oinline3 is a new form of inlining.  -Oinline3
  [[    invokes leaf routine inlining.  A leaf routine is a routine which
  [[    calls no other routines.  With -Oinline3 only leaf routines are
  [[    expanded inline in the program.
  [[ I wish folks would spell "through" correctly, but... back to
  [[ my story...
  [[ I was, of course, very excited to try "-Oinline4", and recompiled
  [[ everything again.  Disappointment!  Performance was NOT improved.
  [[ Next, I recompiled for flow tracing, "f90 -ef".  "flowview" showed
  [[ the most frequently called subroutine.  It was practical, so I
  [[ inlined it MANUALLY and, lo and behold, performance improved
  [[ significantly.
  [[ What am I missing, here?

The compiler needs access to the subroutine source code. If the source
doesn't appear in the same file as the calling subroutine, then you must
tell it where to look using the "-O inlinefrom" option:

  f90 -Oinlinefrom=<FILENAME>

To find out which routines are not being inlined, and why, instruct the
compiler to create a listing file using one of the many report options.
For instance:

  f90 -r2 

Q: Speaking of inlining, it won't work if the subroutine to be
   inlined contains Fortran 90 "OPTIONAL" arguments.  Unfortunately,
   my most-frequently called subroutine does indeed have an OPTIONAL
   argument, named "MASK":


   I've thus replaced SUB_ORIG with two subroutines: one which requires,
   and one which lacks, the OPTIONAL argument.  The code changes were
   trivial, and the new declarations look like this:

        SUBROUTINE SUB_nom  (FIELD)             ! no mask 
        SUBROUTINE SUB_msk  (FIELD, MASK)       ! mask is required

   Now the hard part.  

   The original subroutine was called ~360 times, in some 100 source
   files (of 300 total) which reside in 3 different source directories,
   and it's called in two different ways, depending on the need for the
   optional argument. The original calls look, for example, like this:

        CALL SUB_ORIG (G_BASAL_SALN (:,:)) 
        CALL SUB_ORIG (O_DP_U (:,:,K,LMN), MASK = O_MASK_U (:,:))

   I need to update all these calls with the appropriate replacements:

        CALL SUB_nom (G_BASAL_SALN (:,:)) 
        CALL SUB_msk (O_DP_U (:,:,K,LMN), O_MASK_U (:,:))
   How would you manage this? 

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top