ARSC HPC Users' Newsletter 203, August 31, 2000
Chilkoot Upgrade to SV1
The Arctic Region Supercomputing Center (ARSC) is pleased to announce that "chilkoot", our parallel vector compute platform, is to be upgraded in September, 2000.
Chilkoot will retain its internet name (chilkoot.arsc.edu), but will be replaced by a 32 processor Cray SV1. The SV1, like the J932se, is a parallel vector system, and all codes which run on the J932se should compile and run without change on the SV1.
The SV1 will have 4 GW of memory and approximately 2 TB of disk storage. Each of the 32 processors has a peak potential performance of 1.2 GFLOPS, for a total of 38.4 GFLOPS (compared with the 12-CPU J932se, which has a peak performance of 2.4 GFLOPS).
Installation will require considerable downtime, on both chilkoot and yukon (which relies on chilkoot for DMF support), during the last two weeks of September. If you need results from these systems during that time period, be sure to read "news downtime" and the "Message of the Day" login screen for announcements.
Once this upgrade is complete, chilkoot will boast a 4x increase in memory, 4x increase in disk capacity, and 16x increase in vector processing power.
For more on the CRAY SV1, visit:
Compiling and Optimizing on the CRAY J90, Part II
[[ This is the second in a two part series. Thanks to Jeff McAllister of ARSC for this contribution. ]]
The concepts presented here in terms of J90 programming should hold true for the new SV1 as well. The recent programming environment upgrade means that code (and compilers) should work similarly enough to provide a smooth transition. Learning more about the compiler's strategies for optimizing the code, as well as what it looks for in the code itself, can be valuable in making the most of the resources of the SV1.
Part 1 (previous issue) covered:
- Environment variables
Part 2 covers:
- PVP F90 compiler command-line options and compiler directives
- Factors that make the compiler decide not to parallelize a loop, such as dependencies between iterations and output.
Command-line options
Deciding which compiler options to use is not as simple as it may seem. A given optimization option might increase performance, or decrease it, or alter the program's results.
Changes in performance are highly dependent upon an algorithm and how it is expressed in the code. An option that makes one program (or even a part of a program) much faster may make another slower. We suggest assessing the amount of time spent in various code sections with:
- timer statements inserted into the code (see T3D newsletters 12 and 99 )
- performance tools, like atexpert, showing time used by subroutine (T3E newsletter 149 )
This will allow a clearer perspective on the costs and benefits of each approach. (And note, you can compile different subroutines with different options, assuming they're stored in different source files.)
Many of the optimizations given below are not used by default because they make compilation take longer -- and they can change results. Just because the compiler has done something behind the scenes and you have not touched the code does not mean that the results are safe. We recommend that you always check results when adding new compiler options.
That said, here is a list of f90 options to consider using:
f90 -r2 -O aggress,loopalign,taskinner,inline4,task3,vector3,scalar3
-r2 provides a listing of how the compiler views the program, showing which parts are currently using parallel and vector capabilities. Like this code segment,
13 P-- v -------- do i=1,n 14 P v X(i)=i 15 P-- v -------> end dodisplaying that the compiler "sees" this loop as both parallelizable and vectorizable. This is a valuable tool, allowing you to match what the compiler does with what you think it should do. Output can be found in programname.lst. That is, the listing for f90 -r2 partest.f is 'partest.lst'.
-O aggress raises the limits for internal tables, allowing larger loops to fit within an optimization region -- and to be considered for optimization.
-O loopalign tells the compiler to count the number of generated instructions in a loop, thus making it more likely that an entire loop will fit within the same internal buffer, decreasing the amount of memory swaps during execution. This can save a lot of time if a program is dominated by a particular loop which can be made to "fit" better in the subset of memory retrieved into local caches.
-O taskinner allows inner loops to be examined for autotasking. If there will be a benefit to partitioning these loops, they will be autotasked -- if not they will be left alone.
-O inline4 instructs the compiler to do as much as possible to decrease subroutine calls by "inserting" the subroutine code in place of the call. Code which has been inlined, creating a loop without subroutine calls, becomes eligible for vectorization.
-O task3 tells the compiler to do everything it can to make loops autotask well, including loop nest restructuring. Some of the optimization strategies employed have the potential to change results.
-O vector3 restructures loops to take full advantage of vectorization.
-O scalar3 restructures and bottom-loads loops to provide fastest possible scalar operation
Compiler directives
The CF90 compiler supports the openMP API, which offers a greater level of control over which sections of the code various optimization techniques will be applied to. Directives are inserted into the code in this format:
!$OMP directive[clause[[,] clause] ...]
and are ignored as comments on non-openMP compilers.
Here is some -r2 output to demonstrate how directives can alter compilation:
9 !$OMP PARALLEL DO 10 P-- v -------- DO I=1,N 11 P v X(I)=I 12 P-- v -------> END DO 13 !$OMP END PARALLEL DO 14 15 16 v -------- DO I=1,N 17 v Y(I)=I 18 v -------> END DO
Parallelization via command-line option would apply the same criteria to all loops. Here explicit control is provided by PARALLEL DO/END PARALLEL DO.
Obviously there is much more that can be done with directives. Even a brief introduction to their full range of functionality would take several of these columns. At this point the intent is to show the basic concept: more explicit control. However, this control comes with some burdens as well, as it is easy to do things like make loops execute in parallel when there really are dependencies that will make the results incorrect if they are ignored.
Altering the code
In many cases the best and most reliable way to make sections of code become parallel is to remove the factors that inhibit automatic parallelization. (This technique is also used to improve vectorization.)
The compiler normally won't parallelize loops which contain:
- Dependencies between iterations
- I/O (read,write,print)
- Subroutine/function calls
- Branches into a loop
though it can be forced to with directives. There are good reasons why parallel/vector mode is avoided in these cases. Parallelization depends on the assumption that the iterations of a loop can be executed in arbitrary order.
It also depends on the code within the loop being vectorizable. Calls to non-inlined subroutines or functions are disallowed, as well as many intrinsic functions and library calls. However, if a vectorized version of the function exists (as with common functions like sqrt) loops using that function are not disqualified.
Loops often not only specify that a set of operations need to be done for each element of an array -- but also the order in which these operations need to be done to achieve correct results. In the following example, the order of operations defined by the first loop is irrelevant, while it does affect the second. Thus the compiler is able to parallelize only the first.
2 PROGRAM partest 3 IMPLICIT NONE 4 5 INTEGER,PARAMETER:: N=30000000 6 REAL,DIMENSION(N)::X 7 INTEGER I 8 9 P-- v -------- DO I=1,N 10 P v X(I)=I 11 P-- v -------> END DO 12 13 1 -------- DO I=2,N 14 1 IF (X(I)<50) THEN 15 1 X(I)=.5*sqrt(X(I)/X(I-1)) 16 1 ELSE 17 1 X(I)=.5*sqrt(X(I)) 18 1 END IF 19 1 -------> END DO 20 21 PRINT *,X(N) 22 23 END
To improve parallelization, rewrite loops so that the independent steps are separated from the dependent steps.
2 PROGRAM partest2 3 IMPLICIT NONE 4 5 INTEGER I,N,ASTAT 6 REAL,dimension(:),allocatable::X 7 8 N=30000000 9 10 IF (N<2) THEN 11 PRINT *,"N must be 2 or more" 12 STOP 13 END IF 14 15 ALLOCATE(X(N),STAT=ASTAT) 16 17 IF (ASTAT/=0) THEN 18 PRINT *,"NOT ENOUGH MEMORY!" 19 STOP 20 END IF 21 22 23 P-- v -------- DO I=1,N 24 P v X(I)=I 25 P-- v -------> END DO 26 27 IF (N<50) THEN 28 1 -------- DO I=2,N 29 1 X(I)=.5*sqrt(X(I)/X(I-1)) 30 1 -------> END DO 31 ELSE 32 1 -------- DO I=2,49 33 1 X(I)=.5*sqrt(X(I)/X(I-1)) 34 1 -------> END DO 35 36 P-- v -------- DO I=50,N 37 P v X(I)=.5*sqrt(X(I)) 38 P-- v -------> END DO 39 END IF 40 41 print *,X(N) 42 END (P = parallelized, v=vectorized, 1= unoptimized loop nest level 1)
This is a deliberately "cooked" example, designed to maximize the time difference between parallel/vectorized and plain loops. (It's interesting to consider what would happen with changes in N.) The extent of the performance increase between the two categories should always be kept in mind. With this simple change:
- execution time dropped from 57 seconds to 2 seconds
- CPU time dropped from 56.7 to 4.8 seconds
- MFLOPS went from 12.2 to 144.5
The dependency which blocked the parallelization of the main loop also blocked it from vectorization. This barrier to fast performance could only be overcome by changing the code to better fit the platform executing it. This is often the case, and though the compiler can be very clever, there is no substitute for personally evaluating slow sections of code with an understanding of how it will be executed.
New release of the ZPL Compiler and Runtime for T3E
[[ Thanks to Brad Chamberlain of the University of Washington. ]]
The ZPL Project is pleased to announce version 1.16.59 of the ZPL compiler and runtime. ZPL is an array-based parallel programming language that was developed at the University of Washington and has been featured in previous ARSC newsletters (most recently in the NAS MG study, issues 188 and 189 ). This release contains the latest features and optimizations available in ZPL and was used in obtaining the results for a forthcoming SC2000 paper that contains expanded results from the NAS MG study mentioned above.
The release is installed on yukon, ARSC's Cray T3E for use by its users. The documentation for using this installation can be found at:
http://www.cs.washington.edu/research/zpl/install/supported/arsc.t3e.html
Users who wish to download their own version of ZPL or find out more about the language are encouraged to visit the ZPL website at:
http://www.cs.washington.edu/research/zpl
What's New in ZPL?
This release adds support for:
- complex types
- the new wrap-@ operator
- config vars of type "file"
- external constants
New optimizations include:
- automatic optimization of @-based stencil operations
- automatic optimization of flood array usage
- an optimized implementation of @-communication
In addition, there are a number of significant bug fixes.
This release also contains new example programs, including a better jacobi (jacobi2.z), the SUMMA matrix multiplication algorithm, and the ZPL implementation of NAS MG.
Also, several language features are supported in an initial form that will be refined in future releases:
- multidirections, multiregions, and multiarrays
- support for processor grids of rank greater than 2
- "grid dimensions" that allow for privatized per-processor values
- pipelined wavefront computations
Please contact the ZPL group with any questions:
ARSC Viz Lab Open Houses
You are invited to open houses from 12-3 pm in the four ARSC access/visualization labs on the following dates:
- Natural Sciences Facility (rm 161) - Thursday, Sept. 21,
- Duckering (rm 234) - Friday, Sept. 22,
- Elvey (rm 221) - Monday, Sept. 25,
- Butrovich (rm 007) - Tuesday, Sept. 26.
Software demonstrations, practice sessions, visualization examples, and Q&A time with ARSC staff will be available.
Quick-Tip Q & A
A:[[ Tabs are an annoyance. Somehow they get in my source code files,
[[ and I want to get rid of them.
[[
[[ A simple search and replace, inserting 8 spaces per tab, won't work
[[ because spaces may be already hidden in front of the tabs. A given
[[ tab may need to be replaced by from 1-8 spaces, depending on how many
[[ spaces lurk in front of it.
[[
[[ Is there a decent way to convert tabs to spaces?
Thanks to Richard Griswold:
---------------------------
Check out the GNU textutils "expand" and "unexpand" commands.
expand - convert tabs to spaces
unexpand - convert spaces to tabs
http://www.gnu.org/software/textutils/textutils.html
Editor's Note:
(un)expand are available on the Crays and SGIs.
For perl aficionados, thanks to Stephan Pickles for his solution:
--------------------------------------------------------------------
Try this perl script:
#!/usr/local/bin/perl -w
while (<>) {
1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
print $_;
}
Make sure the shebang line has the correct path to your perl
interpreter.
If the script is in a file called detab, one way to invoke it (there
are several) is:
detab input_file >output_file
Q: How can I find out what the current UNICOS/mk version number is?
I use three different T3Es and they're always upgrading the OS
whenever they darn well feel like it. If I don't start logging this
stuff I'm gonna be in deep sauerkraut!
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
