ARSC HPC Users' Newsletter 202, August 18, 2000

Compiling and Optimizing on the CRAY J90

[[ This is the first in a two part series. Thanks to Jeff McAllister of ARSC for this contribution. ]]

J90 Parallelization with f90

Optimization on Chilkoot, as for any system, means learning some details about the compiler and the code it produces. However, the compiler plays a larger role in code performance here than on the T3E. T3E parallelization is a separate issue, involving a lot of separate work with some explicit sharing scheme (such as MPI or SHMEM) or task/memory distribution according to some standard like HPF.

Chilkoot includes vectorization as an additional level of optimization, but the work of parallelizing and vectorizing is to a large degree automated by the J90 compiler. Plus vectorization includes many of the same concerns and compiler options, so parallelized code usually gets vectorized in the bargain (though the opposite is not always true).

The main idea to keep in mind when thinking about how your code will work on Chilkoot is how the J90 brings scalable performance about. J90 parallelized code works by one processor setting itself up as a sort of master broker, dividing tasks into whatever units the compiler has broken the code into -- often quite small -- and negotiating with the processors it has 'acquired' to find somewhere for the pieces to execute. In short, whatever the compiler can do to effectively parse the code into independent chunks will help the code perform better at runtime.

How the J90 parallelizes code is determined by

  • Environment variables
  • Compiler command-line options and compiler directives
  • Factors that make the compiler decide not to parallelize a loop, such as dependencies between iterations and output.

In this, the first of two articles, I'll discuss the first area.

Environment variables

Several environment variables are used by the PVP compiler and executables produced by it. That is, they will only have an effect on the J90.

After some experimentation, I've found that MP_HOLDTIME and NCPUS are the most important:

MP_HOLDTIME defines the number of cycles a processor is held before being released to work on other jobs. The default is 150, which is not very long. While a J90 job is running, it does not grab a processor and use it until the job is completed. A processor does the section it is given, which may be quite a miniscule section of code, waits MP_HOLDTIME cycles, and is then released to work on other jobs. Re-acquiring the processor (that is, making it available to work on your code again) can add quite a bit of overhead through the life of the process. However, results for individual codes may vary. The ATExpert tool reports indicate when increasing MP_HOLDTIME may help.

NCPUS defines the number of CPUs the code will request to run on. Generally this value is only used at runtime. However, it appears that this value can affect how the code integrates with some of the tools (i.e. ATExpert). This is somewhat quirky, as sometimes code compiled with the -eX option will crash if NCPUS is not set to 1. On the other hand, it often works regardless of the setting. To be safe, we recommend setting NCPUS to 1 before compiling with -eX.

NCPUS is used mainly at runtime, though the number of processors "assigned" to your job may be lower under high system loads. At the current time, all CPUs (12) are available in batch and interactive mode, so NCPUS can be anywhere in the range of 1-12. Small codes may get all processors requested, but it is highly unlikely that a large job will acquire all 12. The NCPUS setting will not affect how your job is queued as this is based on time and memory use on the J90.

"Great!" you may think. "Effortless parallelization! I'll be able to get 3x the work done within my limit by setting NCPUS to 12 all the time." Alas, with this ease comes hidden costs. Some codes actually run slower, even in wallclock time, on more processors. Also, increasing the number of processors will not speed up unparallelizeable sections of code. (I'll get to how the compiler decides which regions of code can be partitioned to an arbitrary number of processors in the next section.)

You can see how effectively the code is parallelized by including 'ja' in your submission script.

Here's an example including setting the environment variables.

#QSUB -q batch               # Pipe queue, "batch" required
#QSUB -lT 600                # limit the overall time to 600 CPU seconds
#QSUB -lM 45mw               # limit the overall memory to 45mw

cd /u1/uaf/mcallist/Test

setenv NCPUS 8
setenv MP_HOLDTIME 10000000

ja -csft
output (excerpted)

partest results

(Concurrent CPUs * Connect seconds = CPU seconds)
 ---------------   ---------------   -----------

               1 *          0.2600 =      0.2600
               2 *          0.0500 =      0.1000
               3 *          0.4800 =      1.4400
               4 *          0.4600 =      1.8400
               5 *          0.2500 =      1.2500
               6 *          0.0700 =      0.4200
               7 *          0.2900 =      2.0300
               8 *          2.1300 =     17.0400

(Concurrent CPUs * Connect seconds = CPU seconds)
      (Avg.)           (total)         (total)
 ---------------      --------------   -----------

            6.11 *          3.9900 =     24.3800
As you can see, "partest," which is nothing but a parallelized loop, spends most of its time connected to all 8 processors. In codes that are not parallelized well, the majority of time would be listed as connected to only 1 processor.

(An easy mistake to make reading ja output is to interpret the "CPU seconds" column as total time for processor #1, total time for #2, etc. "CPU seconds" is really reporting the amount of time spent connected to n processors, where "n" is given in the "Concurrent CPUs" column. Thus, in the "partest" results, a total of 17.04 CPU seconds were accumulated while the program was running concurrently on 8 processors--and it accumulated 2.13 seconds on each.)

In this case, the relatively small amount of time spent working at fewer processors is "noise" caused by the presence of many jobs competing for the same resources. In an ideal world, where your job was the only one on the system, you would see serial time (1 processor) and parallel time (8 processors). However, the usual reality is a pandemonium, where when the master processor attempts to assign tasks far fewer than the full amount of processors requested are actually available.

A high concurrent CPU average can be used as an indicator of how much parallelization is happening. However, this can vary quite a lot between runs. In the tests I've made with this program, the exact same code and compiler options have achieved anywhere from 3-7 concurrent CPUs.

Another concern is that even while increasing NCPUS sometimes decreases wallclock time it ALWAYS increases CPU time. (Though CPU time also varies widely with system load, and this can mask the overhead saved by a smaller NCPUS.) The most efficient way to run a J90 program is at NCPUS=1. As chilkoot is primarily a batch system, it makes sense to keep this in mind when attempting to set the time limits (which are in CPU time) low to get scheduled faster. Strange as it may seem, a job that takes longer in wallclock time may actually finish sooner.

The atexpert tool provides diagnostics and graphs which show where the best NCPUS setting is. ARSC's current default setting for NCPUS is 4, but you may override this at any time. We encourage all users to experiment with smaller NCPUS settings and to use atexpert.

In the next issue, I'll discuss how the compiler can increase parallelization via command-line options, directives, and removal of factors that eliminate a loop from consideration (so it stays scalar). The good news is that the concurrency you see now as you increase NCPUS is likely to get better with some simple changes.

CUG SV1 Workshop, Oct 23-25

CUG Fall 2000 Workshop October 23-25, 2000 Minneapolis, Minnesota

Workshop Goals

We anticipate an exciting workshop in Minneapolis that will give you the opportunity to share the latest information on the HPC solutions our vendors provide and how CUG members make use of those resources. Our focus will be the Cray SV1. We encourage you to submit an abstract of a presentation (see Call for Papers below).

Where and When

The CUG Fall 2000 Workshop will be held in Minneapolis, Minnesota October 23-25, 2000. There will be a workshop dinner on October 24th. Sponsors for the Conference are CUG and Cray Inc.

Technical Program

The meeting will be one-track for two and one half days starting at 8:30 on Monday, October 23th. Planned sessions include:

  • User Applications
  • Tutorials
  • Optimization
  • Roadmap and Software Update
  • Libraries and other third party projects
Call for Papers

You are welcome to submit an abstract of a paper to present at the workshop. For information on topics for your paper, please contact Gary Jensen. When you are ready to submit your abstract, please use our on line system . The deadline for submitting an abstract is August 31, 2000.

Quick-Tip Q & A

A:[[ When I try to compile my old reliable Fortran 90 code on my SGI
  [[ workstation I get a message like:
  [[     sgi%  f90 -o test test.f
  [[       "test.f": Warning: Stack frame size (157784752) larger than 
  [[       system limit (67108864)
  [[   and the executable produced is unrunnable.
  [[   What's wrong, and what can I do to get my code to compile?

  Thanks to Ted Mansell (NOAA) and George Petit (NUWC) for sending in
  answers, and to Jeff McAllister (ARSC) for the question.

  Here's George's answer:
  Recompile with the "-static" option:

    sgi% f90 -static -o test test.f

  This will allocate all local storage into memory instead of on the stack.

  If this doesn't work, execute the "limit" command to see what your
  personal maximum stack frame size is set to.  You can reset it to
  "unlimited" which will actually set it to the system-defined limit.

  If it still doesn't work, see your friendly local SGI System
  Administrator about upping the the system limit.

Q: Tabs are an annoyance.  Somehow they get in my source code files, and
   I want to get rid of them. 
   A simple search and replace, inserting 8 spaces per tab, won't work
   because spaces may be already hidden in front of the tabs.  A given
   tab may need to be replaced by from 1-8 spaces, depending on how many
   spaces lurk in front of it.

   Is there a decent way to convert tabs to spaces?

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top