ARSC T3E Users' Newsletter 149, August 21, 1998
Arctic Science Conference of the AAAS -- October Meeting
The 49th Arctic Science Conference of the American Association for the Advancement of Science will take place on the University of Alaska Fairbanks campus October 25-28, 1998. ARSC staff will be conducting tours of our facility daily at 1 pm during the conference. Details about the conference are available at:
http://www.gi.alaska.edu/aaas/index.html
or by sending email to fnmrf@uaf.edu .
ARSC encourages users to participate in the conference and to submit abstracts for poster sessions.
The 1998 AAAS Arctic Division annual conference has been designed around the theme of international cooperation in arctic research. It will provide a forum for scientists from around the world to come together to discuss important issues concerning global climate change and its impacts in the western arctic. It is widely accepted that such global change will be observed first in the arctic and sub-arctic regions, with serious implications for the rest of the world. The newly established International Arctic Research Center (IARC) will provide state-of-the-art facilities and opportunities for scientists to study these regions.
The conference format consists of two plenary sessions each day featuring internationally-known speakers, including keynote speaker Rita Colwell, Director-designate of the National Science Foundation. Between the morning and afternoon sessions each day there will be a poster session (abstracts invited). The conference will also serve as a Wadati Conference on Global Change; speakers for this conference are noted in the programs. In addition, tours of the IARC building and ARSC will provide participants with the chance to view some of the research facilities available at the University of Alaska Fairbanks.
Parallelizing Codes for the J90
Several users have asked about running in parallel on the J.
In short there are a number of easy ways to improve performance. You may use compiler options. Inspection of performance, using simple measurements or tools, also leads to optimization.
Compiling
The J90 system is a shared memory vector system. When looking to optimize code, it is necessary to consider both the vector operations and how many processors might work on the desired operations using a shared memory.
Cray originally supported both Macrotasking and Microtasking and the latest compilers merge both into the autotasking option in the compiler. (Macrotasking allowed the user to parallelize operations at the subroutine level, microtasking at the loop level.) Both used directives and the user was responsible for the correct execution of code. Autotasking is, as the name suggests, automatic and like microtasking exploits parallelism at the loop level.
The -O3 compiler option includes autotasking (this can also be set explicitly using -O and task2 or task3 , -O3 sets task2 level of optimization). As with all options, it is always worth applying this only to those expensive areas of code in which the most time is spent. Global application to an entire code is not advised. And, as with all optimizations, particularly those which perform major transformations, checking results against unoptimized code for correctness is strongly advised.
For diagnosing codes, a useful compiler option is -r . This prints out various information about your code, and provides information on vectorization and parallelization of the loops. For example, when used at the -r2 level in the following code the report includes the following:
PROG1A prog_do.f 14:42 Wed Aug12,1998
Page 1
------
3 program prog1a
4
5
6 parameter (nsize=1024*1024*4)
7
8
9 real a1,b1,c1,a2,b2,c2
10
11
12 common /d1/
a1(nsize),b1(nsize),c1(nsize)
13
14 common /d2/
a2(nsize),b2(nsize),c2(nsize)
15
16
17
18
19 iticks=irtc_rate()
20
21 write(6,*) ' data size is ',nsize
22
23 P-- v -------- do n=1,nsize
24 P v a1(n)=n*1.0
25 P v b1(n)=n*2.5
26 P v c1(n)=(nsize-n)*1.0
27 P v a2(n)=a1(n)
28 P v b2(n)=b1(n)
29 P v c2(n)=c1(n)
30 P-- v -------> enddo
31
32
33 i4b=irtc()
34
PROG1A prog_do.f 14:42 Wed Aug12,1998
Page 2
------
35 P-- v -------- do n=1,nsize
36 P v
a2(n)=b2(n)*(b2(n)*(b2(n)*(b2(n)*(b2(n)+1)+1)+1)+1)+1
37 P-- v -------> enddo
38
39
40
41 i4c=irtc()
42
43 write(6,*) ' best flops is
',9*nsize*(iticks()/(i4c-i4b))
44 write(7,*) a2(nsize)
45
46
47 end
This shows that the loops at lines 23 and 35 are both fully vectorized and parallelized. The -r option causes the compiler to print an additional summary report regarding these same loops:
f90 Compiler - 4 messages:
1) <cf90-6403,Tasking,Line=23> A loop starting at line 23 was tasked. 2) <cf90-6204,Vector,Line=23> A loop starting at line 23 was vectorized. 3) <cf90-6204,Vector,Line=35> A loop starting at line 35 was vectorized. 4) <cf90-6403,Tasking,Line=35> A loop starting at line 35 was tasked.
Inspection
Since the ARSC J90 system is shared by many users, and some timers report only the total CPU times, a user might observe is an increase in the total CPU time. This is because the same effort in terms of flops computed is performed, but management overhead to control several processors has been introduced!
In the example below, the timer, irtc , is used to get the wall clock time and to compute the actual Mflops (using manually counted loop operations). This is checked by comparing the totals with the count from the hardware performance monitor, hpm. Care is needed in choosing a suitable time basis and interpreting results.
The aim is to use parallelism to reduce the overall wall-clock time.
More Advanced Tools: atexpert
A graphical tool, atexpert, can be used to predict the parallel performance of a code. Compile with both -O3 and -eX and then run the code to generate performance measurements and then run atexpert to see how your code performs.
chilkoot% f90 -O3 -eX -o prog_do_at prog_do.f
chilkoot% ./prog_do_at
data size is 4194304
best flops is 113246208
chilkoot% atexpert
This will generate a predicted performance graph, extrapolated from the data gathered by running the instrumented code on a single processor. The following graph (figure 1) is such a prediction, and was produced by atexpert:
Figure 1 (click on figure for larger image - 16 K )
As you can see, the simple example code below gives a very high level of parallel performance since it is perfectly parallel. Real codes will be more likely to decline rapidly and atexpert allow users to investigate the parallel performance of each subroutine so poorly performing areas of code can be improved.
One advantage of multiprocessor shared memory systems is that a performance improvement can be obtained by parallelizing only the computationally intensive parts of the code.
While there is much debate on the need to ensure that a single code doesn't hog the entire memory while only using one processor, partial parallelization can be beneficial to both the user running in parallel and other users who gain access to resources sooner. As with all optimization, users should try to concentrate effort on those routines which take the most time and perform relatively badly.
Applying compiler options is only a small step. Code modification might be needed and atexpert makes observations on the routines which are likely to be improved. Sometimes the algorithm must even be replaced.
An Example.
The following code is taken from one of my class examples on how to optimize code for the T3E. This is a simple, naturally parallel loop which both vectorizes and can be tasked across several processors.
program prog1a
parameter (nsize=1024*1024*4)
real a1,b1,c1,a2,b2,c2
common /d1/ a1(nsize),b1(nsize),c1(nsize)
common /d2/ a2(nsize),b2(nsize),c2(nsize)
iticks=irtc_rate()
write(6,*) ' data size is ',nsize
do n=1,nsize
a1(n)=n*1.0
b1(n)=n*2.5
c1(n)=(nsize-n)*1.0
a2(n)=a1(n)
b2(n)=b1(n)
c2(n)=c1(n)
enddo
i4b=irtc()
do n=1,nsize
a2(n)=b2(n)*(b2(n)*(b2(n)*(b2(n)*(b2(n)+1)+1)+1)+1)+1
enddo
i4c=irtc()
write(6,*) ' best flops is ',9*nsize*(iticks()/(i4c-i4b))
write(7,*) a2(nsize)
end
If this is compiled with -O2 we get:
chilkoot% f90 -o prog_do_o2 -O2 prog_do.f
chilkoot% hpm ./prog_do_o2
data size is 4194304
best flops is 113246208
Group 0: CPU seconds : 0.58 CP executing : 116900662
Million inst/sec (MIPS) : 6.86 Instructions : 4011169
Avg. clock periods/inst : 29.14
% CP holding issue : 95.45 CP holding issue : 111586438
Inst.buffer fetches/sec : 0.00M Inst.buf. fetches: 1796
Floating adds/sec : 57.41M F.P. adds : 33554657
Floating multiplies/sec : 35.88M F.P. multiplies : 20971910
Floating reciprocal/sec : 0.00M F.P. reciprocals : 1
Cache hits/sec : 0.01M Cache hits : 5527
CPU mem. references/sec : 57.49M CPU references : 33602005
Floating ops/CPU second : 93.29M
chilkoot%
Moving up to -O3 we get:
chilkoot% f90 -O3 -o prog_do_o3 prog_do.f
chilkoot% hpm ./prog_do_o3
data size is 4194304
best flops is 490733568
Group 0: CPU seconds : 0.60 CP executing : 119366230
Million inst/sec (MIPS) : 7.85 Instructions : 4687445
Avg. clock periods/inst : 25.47
% CP holding issue : 94.74 CP holding issue : 113082842
Inst.buffer fetches/sec : 0.00M Inst.buf. fetches: 2207
Floating adds/sec : 56.22M F.P. adds : 33554658
Floating multiplies/sec : 35.14M F.P. multiplies : 20971911
Floating reciprocal/sec : 0.00M F.P. reciprocals : 1
Cache hits/sec : 0.01M Cache hits : 3462
CPU mem. references/sec : 56.34M CPU references : 33628114
Floating ops/CPU second : 91.36M
This happens because the default value for NCPUS on chilkoot is 4 . If we should set NCPUS to 3 the performance is:
chilkoot% setenv NCPUS 3
chilkoot% hpm ./prog_do_o3
data size is 4194304
best flops is 377487360
Group 0: CPU seconds : 0.59 CP executing : 117534552
Million inst/sec (MIPS) : 7.96 Instructions : 4680314
Avg. clock periods/inst : 25.11
% CP holding issue : 94.69 CP holding issue : 111291466
Inst.buffer fetches/sec : 0.00M Inst.buf. fetches: 2074
Floating adds/sec : 57.10M F.P. adds : 33554658
Floating multiplies/sec : 35.69M F.P. multiplies : 20971905
Floating reciprocal/sec : 0.00M F.P. reciprocals : 1
Cache hits/sec : 0.01M Cache hits : 3462
CPU mem. references/sec : 57.21M CPU references : 33621009
Floating ops/CPU second : 92.78M
chilkoot%
Note the actual elapsed time is as follows:
chilkoot% env NCPUS=4 time ./bprog_do_o3
seconds clocks
elapsed 12.76160 1276159892
user 46.45744 4645744272
sys 1.18272 118272061
chilkoot% env NCPUS=3 time ./bprog_do_o3
seconds clocks
elapsed 16.70126 1670126109
user 46.99182 4699182046
sys 0.89066 89066264
chilkoot% env NCPUS=2 time ./bprog_do_o3
seconds clocks
elapsed 24.13021 2413021230
user 46.36825 4636825171
sys 0.73992 73992258
chilkoot% env NCPUS=1 time ./bprog_do_o3
seconds clocks
elapsed 46.82797 4682797159
user 46.19941 4619941115
sys 0.44435 44435070
The above shows good speedup in the wall clock times. However these results took advantage of a time when the system was relatively idle. Trying a large number for NCPUS during the same, relatively idle, period does not yield such good speedups since, unlike the T3E and other MPP systems, processors are not dedicated to users but are shared by both system and other user activities.
chilkoot% env NCPUS=12 time ./bprog_do_o3
seconds clocks
elapsed 5.58927 558926937
user 45.74897 4574896788
sys 1.70961 170961040
chilkoot% env NCPUS=8 time ./bprog_do_o3
seconds clocks
elapsed 6.58672 658671930
user 46.48904 4648903562
sys 0.74683 74682794
Note that when trying to use large number of processors on an active system, contention with other users results in less than optimal speedups. On the ARSC J90, which has 12 processors, users are limited to a maximum of 4 processors at present. Users should determine which numbers of processors actually give the best performance and setenv NCPUS in scripts etc.
Documentation.
There is introductory documentation on autotasking in the manual pages for the f90 compilers and the atexpert tool online. Atexpert contains a demo which is worth looking at in terms of the different outputs possible. A Cray document SR-2182, A Guide to Parallel Vector Applications, is also good reading on the various tools and can be found online at the ARSC web site for access by ARSC users.
Center for Research on Parallel Computation -- Newsletter
> The Spring/Summer 1998 issue of Parallel Computing Research, the > quarterly newsletter of the Center for Research on Parallel > Computation, is now available at: > > http://www.crpc.rice.edu/CRPC/newsletters/sum98/ > > Previous issues and articles can be found at: > > http://www.crpc.rice.edu/CRPC/newsletters/index.html . > > If you have any difficulties accessing materials, please contact Kathy > El-Messidi at elmessy@rice.edu. If you do not have a Web browser, write > Kathy at the same address to request specific articles from the list > below this message. > > To subscribe or unsubscribe, mail requests to pcr@cs.rice.edu .
Co-Array Fortran Paper Available
[ We received the following announcement from John Reid. ]
The paper defining Co-Array Fortran is available, and will be published in the next issue of Fortran Forum.
This is the abstract:
Co-Array Fortran, formerly known as F--, is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran 95 is extended with additional trailing subscripts in square brackets to give a clear and straightforward representation of any access to data that is spread across images.
References without square brackets are to local data, so code that can run independently is uncluttered. Only where there are square brackets, or where there is a procedure call and the procedure contains square brackets, is communication between images involved.
There are intrinsic procedures to synchronize images, return the number of images, and return the index of the current image.
We introduce the extension; give examples to illustrate how clear, powerful, and flexible it can be; and provide a technical definition.
Significant recent changes involve synchronization and I/O.
The paper is available as the report RAL-TR-1998-060, by ftp from: matisa.cc.rl.ac.uk
in the file: pub/reports/nrRAL98060.ps.gz
Parallel Programmer Sought for Position at UAF
> > POSTDOCTORAL FELLOW > PETROLEUM DEVELOPMENT LABORATORY > UNIVERSITY OF ALASKA FAIRBANKS > > The Petroleum Development Laboratory at the University of Alaska > Fairbanks is seeking a postdoctoral fellow with an interest in > massively parallel computations and large-scale petroleum reservoir > simulations. The candidate will participate in NSF-funded parallel > programming research projects developing parallel data and computation > on distribution algorithms, computer-aided parallelizers for numerical > simulation codes and apply parallel codes to multi-million grid-cell > reservoirs on high performance computers (CRAY T3E, clusters of > Workstations). > > A Ph.D. degree in Applied Mathematics, Computational Physics, Computer > Science, Fluid Mechanics, Engineering or closely related field is > required. Candidate must demonstrate ability to write parallel > computational codes in C and FORTRAN. Preference will be given to > candidates with experience in PVM and MPI. The appointment will be > initially for 1 year, beginning September 1, 1998, and may be renewable > for up to 2 years depending on availability of funds. > > > Send an updated curriculum vitae and names and phone numbers of three > references to Professor David O. Ogbe, 437 Duckering Building P.O. Box > 755880 University of Alaska Fairbanks, Alaska 99775-5880. Fax: (907) > 474-5912 Email: ffdoo@uaf.edu >
Quick-Tip Q & A
A: {{ How do I determine which version of f90 I am using? }}
f90 -V
"-V" works for other products as well:
pghpf -V
cc -V
CC -V
Q: ARSC permits of the "chaining" of NQS jobs, as long as the
new job goes to the end of the queues. This can increase system
utilization.
Put another way, at the end of your qsub script, you may include a
qsub call which submits your next job--provided that jobs in all
other queues, including lower priority queues, get a chance to run
first.
What is a safe method to implement such chaining which is fair to
other users?
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
