ARSC HPC Users' Newsletter 203, August 31, 2000

Chilkoot Upgrade to SV1

The Arctic Region Supercomputing Center (ARSC) is pleased to announce that "chilkoot", our parallel vector compute platform, is to be upgraded in September, 2000.

Chilkoot will retain its internet name (, but will be replaced by a 32 processor Cray SV1. The SV1, like the J932se, is a parallel vector system, and all codes which run on the J932se should compile and run without change on the SV1.

The SV1 will have 4 GW of memory and approximately 2 TB of disk storage. Each of the 32 processors has a peak potential performance of 1.2 GFLOPS, for a total of 38.4 GFLOPS (compared with the 12-CPU J932se, which has a peak performance of 2.4 GFLOPS).

Installation will require considerable downtime, on both chilkoot and yukon (which relies on chilkoot for DMF support), during the last two weeks of September. If you need results from these systems during that time period, be sure to read "news downtime" and the "Message of the Day" login screen for announcements.

Once this upgrade is complete, chilkoot will boast a 4x increase in memory, 4x increase in disk capacity, and 16x increase in vector processing power.

For more on the CRAY SV1, visit:

Compiling and Optimizing on the CRAY J90, Part II

[[ This is the second in a two part series. Thanks to Jeff McAllister of ARSC for this contribution. ]]

The concepts presented here in terms of J90 programming should hold true for the new SV1 as well. The recent programming environment upgrade means that code (and compilers) should work similarly enough to provide a smooth transition. Learning more about the compiler's strategies for optimizing the code, as well as what it looks for in the code itself, can be valuable in making the most of the resources of the SV1.

Part 1 (previous issue) covered:

  • Environment variables

Part 2 covers:

  • PVP F90 compiler command-line options and compiler directives
  • Factors that make the compiler decide not to parallelize a loop, such as dependencies between iterations and output.

Command-line options

Deciding which compiler options to use is not as simple as it may seem. A given optimization option might increase performance, or decrease it, or alter the program's results.

Changes in performance are highly dependent upon an algorithm and how it is expressed in the code. An option that makes one program (or even a part of a program) much faster may make another slower. We suggest assessing the amount of time spent in various code sections with:

  • timer statements inserted into the code (see T3D newsletters 12 and 99 )
  • performance tools, like atexpert, showing time used by subroutine (T3E newsletter 149 )

This will allow a clearer perspective on the costs and benefits of each approach. (And note, you can compile different subroutines with different options, assuming they're stored in different source files.)

Many of the optimizations given below are not used by default because they make compilation take longer -- and they can change results. Just because the compiler has done something behind the scenes and you have not touched the code does not mean that the results are safe. We recommend that you always check results when adding new compiler options.

That said, here is a list of f90 options to consider using:

f90 -r2 -O aggress,loopalign,taskinner,inline4,task3,vector3,scalar3

-r2 provides a listing of how the compiler views the program, showing which parts are currently using parallel and vector capabilities. Like this code segment,

13 P-- v -------- do i=1,n
14 P   v            X(i)=i
15 P-- v -------> end do
displaying that the compiler "sees" this loop as both parallelizable and vectorizable. This is a valuable tool, allowing you to match what the compiler does with what you think it should do. Output can be found in programname.lst. That is, the listing for f90 -r2 partest.f is 'partest.lst'.

-O aggress raises the limits for internal tables, allowing larger loops to fit within an optimization region -- and to be considered for optimization.

-O loopalign tells the compiler to count the number of generated instructions in a loop, thus making it more likely that an entire loop will fit within the same internal buffer, decreasing the amount of memory swaps during execution. This can save a lot of time if a program is dominated by a particular loop which can be made to "fit" better in the subset of memory retrieved into local caches.

-O taskinner allows inner loops to be examined for autotasking. If there will be a benefit to partitioning these loops, they will be autotasked -- if not they will be left alone.

-O inline4 instructs the compiler to do as much as possible to decrease subroutine calls by "inserting" the subroutine code in place of the call. Code which has been inlined, creating a loop without subroutine calls, becomes eligible for vectorization.

-O task3 tells the compiler to do everything it can to make loops autotask well, including loop nest restructuring. Some of the optimization strategies employed have the potential to change results.

-O vector3 restructures loops to take full advantage of vectorization.

-O scalar3 restructures and bottom-loads loops to provide fastest possible scalar operation

Compiler directives

The CF90 compiler supports the openMP API, which offers a greater level of control over which sections of the code various optimization techniques will be applied to. Directives are inserted into the code in this format:

!$OMP directive[clause[[,] clause] ...]

and are ignored as comments on non-openMP compilers.

Here is some -r2 output to demonstrate how directives can alter compilation:

9                             !$OMP PARALLEL DO
10             P-- v --------       DO I=1,N
11             P   v                   X(I)=I
12             P-- v ------->       END DO
13                            !$OMP END PARALLEL DO
16                 v --------       DO I=1,N
17                 v                   Y(I)=I
18                 v ------->       END DO

Parallelization via command-line option would apply the same criteria to all loops. Here explicit control is provided by PARALLEL DO/END PARALLEL DO.

Obviously there is much more that can be done with directives. Even a brief introduction to their full range of functionality would take several of these columns. At this point the intent is to show the basic concept: more explicit control. However, this control comes with some burdens as well, as it is easy to do things like make loops execute in parallel when there really are dependencies that will make the results incorrect if they are ignored.

Altering the code

In many cases the best and most reliable way to make sections of code become parallel is to remove the factors that inhibit automatic parallelization. (This technique is also used to improve vectorization.)

The compiler normally won't parallelize loops which contain:

  1. Dependencies between iterations
  2. I/O (read,write,print)
  3. Subroutine/function calls
  4. Branches into a loop

though it can be forced to with directives. There are good reasons why parallel/vector mode is avoided in these cases. Parallelization depends on the assumption that the iterations of a loop can be executed in arbitrary order.

It also depends on the code within the loop being vectorizable. Calls to non-inlined subroutines or functions are disallowed, as well as many intrinsic functions and library calls. However, if a vectorized version of the function exists (as with common functions like sqrt) loops using that function are not disqualified.

Loops often not only specify that a set of operations need to be done for each element of an array -- but also the order in which these operations need to be done to achieve correct results. In the following example, the order of operations defined by the first loop is irrelevant, while it does affect the second. Thus the compiler is able to parallelize only the first.

2                                   PROGRAM partest
3                                   IMPLICIT NONE
5                                   INTEGER,PARAMETER:: N=30000000
6                                   REAL,DIMENSION(N)::X
7                                   INTEGER I
9              P-- v --------       DO I=1,N
10             P   v                   X(I)=I
11             P-- v ------->       END DO
13                 1 --------       DO I=2,N
14                 1                   IF (X(I)<50) THEN
15                 1                      X(I)=.5*sqrt(X(I)/X(I-1))
16                 1                   ELSE
17                 1                      X(I)=.5*sqrt(X(I))
18                 1                   END IF
19                 1 ------->       END DO
21                                  PRINT *,X(N)
23                                  END

To improve parallelization, rewrite loops so that the independent steps are separated from the dependent steps.

2                                   PROGRAM partest2
3                                   IMPLICIT NONE
5                                   INTEGER I,N,ASTAT
6                                   REAL,dimension(:),allocatable::X
8                                   N=30000000
10                                  IF (N<2) THEN
11                                     PRINT *,"N must be 2 or more"
12                                     STOP
13                                  END IF
15                                  ALLOCATE(X(N),STAT=ASTAT)
17                                  IF (ASTAT/=0) THEN
18                                     PRINT *,"NOT ENOUGH MEMORY!"
19                                     STOP
20                                  END IF
23             P-- v --------       DO I=1,N
24             P   v                   X(I)=I
25             P-- v ------->       END DO
27                                  IF (N<50) THEN
28                 1 --------          DO I=2,N
29                 1                      X(I)=.5*sqrt(X(I)/X(I-1))
30                 1 ------->          END DO
31                                  ELSE
32                 1 --------          DO I=2,49
33                 1                      X(I)=.5*sqrt(X(I)/X(I-1))
34                 1 ------->          END DO
36             P-- v --------          DO I=50,N
37             P   v                      X(I)=.5*sqrt(X(I))
38             P-- v ------->          END DO
39                                  END IF
41                                  print *,X(N)                        
42                                  END

(P = parallelized, v=vectorized, 1= unoptimized loop nest level 1)

This is a deliberately "cooked" example, designed to maximize the time difference between parallel/vectorized and plain loops. (It's interesting to consider what would happen with changes in N.) The extent of the performance increase between the two categories should always be kept in mind. With this simple change:

  • execution time dropped from 57 seconds to 2 seconds
  • CPU time dropped from 56.7 to 4.8 seconds
  • MFLOPS went from 12.2 to 144.5

The dependency which blocked the parallelization of the main loop also blocked it from vectorization. This barrier to fast performance could only be overcome by changing the code to better fit the platform executing it. This is often the case, and though the compiler can be very clever, there is no substitute for personally evaluating slow sections of code with an understanding of how it will be executed.

New release of the ZPL Compiler and Runtime for T3E

[[ Thanks to Brad Chamberlain of the University of Washington. ]]

The ZPL Project is pleased to announce version 1.16.59 of the ZPL compiler and runtime. ZPL is an array-based parallel programming language that was developed at the University of Washington and has been featured in previous ARSC newsletters (most recently in the NAS MG study, issues 188 and 189 ). This release contains the latest features and optimizations available in ZPL and was used in obtaining the results for a forthcoming SC2000 paper that contains expanded results from the NAS MG study mentioned above.

The release is installed on yukon, ARSC's Cray T3E for use by its users. The documentation for using this installation can be found at:

Users who wish to download their own version of ZPL or find out more about the language are encouraged to visit the ZPL website at:

What's New in ZPL?

This release adds support for:

  • complex types
  • the new wrap-@ operator
  • config vars of type "file"
  • external constants

New optimizations include:

  • automatic optimization of @-based stencil operations
  • automatic optimization of flood array usage
  • an optimized implementation of @-communication

In addition, there are a number of significant bug fixes.

This release also contains new example programs, including a better jacobi (jacobi2.z), the SUMMA matrix multiplication algorithm, and the ZPL implementation of NAS MG.

Also, several language features are supported in an initial form that will be refined in future releases:

  • multidirections, multiregions, and multiarrays
  • support for processor grids of rank greater than 2
  • "grid dimensions" that allow for privatized per-processor values
  • pipelined wavefront computations

Please contact the ZPL group with any questions:

ARSC Viz Lab Open Houses

You are invited to open houses from 12-3 pm in the four ARSC access/visualization labs on the following dates:

Natural Sciences Facility (rm 161) - Thursday, Sept. 21,
Duckering (rm 234) - Friday, Sept. 22,
Elvey (rm 221) - Monday, Sept. 25,
Butrovich (rm 007) - Tuesday, Sept. 26.

Software demonstrations, practice sessions, visualization examples, and Q&A time with ARSC staff will be available.

Quick-Tip Q & A

A:[[ Tabs are an annoyance.  Somehow they get in my source code files,
  [[ and I want to get rid of them. 
  [[ A simple search and replace, inserting 8 spaces per tab, won't work
  [[ because spaces may be already hidden in front of the tabs.  A given
  [[ tab may need to be replaced by from 1-8 spaces, depending on how many
  [[ spaces lurk in front of it.
  [[ Is there a decent way to convert tabs to spaces?

  Thanks to Richard Griswold:
  Check out the GNU textutils "expand" and "unexpand" commands.

    expand - convert tabs to spaces
    unexpand - convert spaces to tabs
Editor's Note:
 (un)expand are available on the Crays and SGIs.

  For perl aficionados, thanks to Stephan Pickles for his solution:
  Try this perl script:

  #!/usr/local/bin/perl -w
  while (<>) {
    1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
    print $_;

  Make sure the shebang line has the correct path to your perl

  If the script is in a file called detab, one way to invoke it (there
  are several) is:

    detab input_file >output_file

Q: How can I find out what the current UNICOS/mk version number is?  
   I use three different T3Es and they're always upgrading the OS
   whenever they darn well feel like it.  If I don't start logging this
   stuff I'm gonna be in deep sauerkraut!

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top