ARSC HPC Users' Newsletter 283, December 19, 2003
Iceberg Status Update
/arsc/support/news/hpcnews/hpcnews265/index.xml and /arsc/support/news/hpcnews/hpcnews279/index.xml ,
ARSC is installing a large IBM p655+/p690+ cluster, "iceberg," with IBM's new Federation Switch technology for the server interconnect. The switch was successfully installed early this month, and is now under testing by ARSC staff.
A schedule for pioneer user access will be announced later.
A "Universal" High Performance Code?
[ Thanks to Jeff McAllister for this article and code. ]
Much is said about the benefits of one architecture over another. However, from the standpoint of writing code, extensive optimization for a specific machine is often not the best use of effort.
I hoped to find some concepts which could endure beyond a product lifecycle. I set out to write a simple, portable, distributed memory multiprocessor code free of specific architecture optimizations yet able to achieve near-peak performance. For inspiration I started with Guy Robinson's hard-to-beat "gflop" code from HPC Newsletter 213 .
What I ended up with is an MPI midpoint-rule integral solver. Code is included below. The function this version integrates is y=.5x+1 from 0 to 2000. The result should be very close to 1002000, regardless of the number of processors used. (Integrals make nice performance tests because they can generate a lot of work, each step can be done independently, and results are easily predictable.)
As I was looking for operations/second, I manually counted 12 adds/multiplies per loop, including incrementing the counter. Compiler optimizations like unrolling will affect the actual instruction count so the compute rates it generates should be considered estimates. However, the results are usually similar to actual CPU counter results reported by the various vendor tools, see:
Cray X1 "pat_hwpc", below; IBM's "hpmcount", HPC newsletter 251 ; Cray PVP "hpm", HPC newsletter 207 ; Cray MPP "pat", T3E newsletter 172 .
Timing, always a sticky issue, is handled with MPI_Wtime. This might not always provide the best granularity possible, but it's portable.
As you can see, the attempt at universal performance had mixed results:
1 CPU MFLOPS
-----------------------------------
system CPU achieved peak %of peak achieved
====== ============= ======== ===== ==================
iceflyer 1.7 GHz Power4 6066 6800 89
klondike X1 MSP 8248 12800 64
klondike X1 SSP 2255 3200 70
chilkoot 500 MHz SV1ex 1513 2000 76
yukon 450 MHz DEC 678 900 75
ambler 300 MHz R12k 493 600 82
quest 333 MHz Pent II 144 333 43
Consistent high performance is elusive. The code is quite portable and should run wherever Fortran 90 and MPI are available. IO and memory access are not bottlenecks in this code. On vector machines this code vectorizes perfectly. On cache machines it has about as much locality as you can get. The number of memory locations necessary is so low the variables should remain in CPU registers without ever needing to access even L1 cache. Even so, peak is still far away for some platforms.
Compiler options may help. On the Power4 machines, for example, this code runs almost twice as fast when compiled with -O4 as with -O3. The -O5 option is not best in this case. And on the T3E, performance improved from 207 to 678 MFLOPS with: "ftn -Oscalar3,aggress,bl,pipeline3,split2,unroll2 integral_mpi.f90". (My default on all the other platforms is -O3.) More time with the compiler options may lead to similar improvement, especially in the cases farthest from peak.
However, I'm not convinced it will be so easy. While originally developed for our Cray X1, this code's performance showed a definite spike when moved to the IBM systems. (For other codes, the reverse could just as easily be true.) Probably the main reason this program does so well on the Power4 chips has to do with a lucky match between the architecture and the algorithm. As this is a midpoint integral solver, the main kernel has a lot of multiplies and adds in succession:
do i=0,nsteps
x1=(i*interval)+a1
x2=((i+1)*interval)+a1
xmid=(x1+x2)*.5
y=(xmid*.5)+1.0
sum1=sum1+(y*interval)
end do
Multiply-add (FMA) just happens to be a single hardware operation on the Power4. When the compiler can represent code with this instruction, two operations occur in one cycle.
Clearly, some algorithms are a better fundamental match to some architectures than others. Guy Robinson's original "gflop" still gets closer to peak on the Cray vector systems, though this code performs better on the IBMs (and, presumably, on a wider variety of MPP and vector systems). As another argument in this code's favor, it could be more easily rewritten to match the special features of any platform, possibly by just changing the function it integrates.
The search for a universal strategy to demostrate and achieve high performance continues. Fortunately, just as there is a wide variety of codes, there is a wide variety of machines to run them.
Here is the program:
program integral
implicit none
include 'mpif.h'
!-----------------------------------------
! declare variables
!-----------------------------------------
integer::nsteps,i,nparts,part,master,mype,ierr,totpes,ops_per_loop
real(kind=8)::sum1,interval,x1,x2,xmid,y,a,b,area,a1,b1,fullsum
real(kind=4)::sumbuf
real(kind=4),allocatable,dimension(:)::psum
integer,allocatable,dimension(:)::pstart,pend
double precision::time,time1, time2,total_ops,total_loops
!-----------------------------------------
! initialize MPI
!-----------------------------------------
master=0
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, mype, ierr)
call mpi_comm_size(mpi_comm_world, totpes, ierr)
!-----------------------------------------
! partition the integration steps
!-----------------------------------------
nparts=totpes
part=mype
allocate(pstart(0:nparts-1),pend(0:nparts-1))
interval=.000001
a=0.0
b=2000.0
call partitiontable(int(b-a),nparts,pstart,pend)
a1=pstart(part)+a-1
b1=pend(part)+a
nsteps=((b1-a1)/interval)
!-----------------------------------------
! KERNEL
! each proc calculates the integral
! from a1 to b1 -- only this is timed
!-----------------------------------------
sum1=0
ops_per_loop=12
time1=mpi_wtime()
do i=0,nsteps
x1=(i*interval)+a1
x2=((i+1)*interval)+a1
xmid=(x1+x2)*.5
!-----------------------
! function integrated
!-----------------------
y=(xmid*.5)+1.0
sum1=sum1+(y*interval)
end do
time2=mpi_wtime()
!-----------------------------------------
! END KERNEL
!-----------------------------------------
!-----------------------------------------
! partial sums are gathered to master
!-----------------------------------------
if (mype==master) allocate(psum(totpes))
sumbuf=sum1
call mpi_gather(sumbuf,1,mpi_real,psum,1,mpi_real,master,mpi_comm_world,ierr)
!-----------------------------------------
! master prints results
!-----------------------------------------
if (mype==master) then
print ("(A16,x,F15.3)"),"result:",sum(psum)
total_loops=(b-a)/interval
total_ops=total_loops*ops_per_loop
print ("(A16,x,F15.0)"),"total ops:",total_ops
time=time2-time1
print ("(A16,x,F15.1)"),"elapsed time:",time
print ("(A16,x,F15.1)"),"MFLOPS:",total_ops/time/1000000.0
end if
call mpi_finalize(ierr)
end program integral
! ----------------------------------------------
! This is a generic subroutine to generate a
! partition table over the range 1:MAX.
! There are probably shorter ways to do this.
! ----------------------------------------------
SUBROUTINE partitiontable(MAX,totalPEs,Pstart,Pend)
IMPLICIT NONE
! -------- SCALAR VARIABLES -------
INTEGER MAX
INTEGER totalPEs
INTEGER I
INTEGER DIFF1,DIFF2
INTEGER LastOne
REAL LOOP_INC
REAL POS
! -------- DIMENSIONED VARIABLES-----
INTEGER Pstart(0:totalPEs-1)
INTEGER Pend(0:totalPEs-1)
! Set up the partition increments
pos=0
loop_inc=real(MAX-1)/real(totalPEs)
LastOne=totalPEs-1
! Create the initial partition table
DO I=0,LastOne
pos=pos+1.
Pstart(I)=pos
pos=pos+loop_inc-1
Pend(I)=pos
ENDDO
! Correct for imbalances caused by truncation of division
! for the increments by widening the first partition by one
IF (totalpes.ge.2) THEN
Pend(0)=Pend(0)+1
DO I=1,LastOne
Pstart(i)=Pstart(i)+1
Pend(i)=Pend(i)+1
ENDdo
ENDif
if (Pend(LastOne).ne.MAX) Pend(LastOne)=MAX
RETURN
END SUBROUTINE partitiontable
Basic X1 Optimization Tools: pat_hwpc, loopmarks, cray_pat
Once your code is running correctly on the X1, you may want to assess its performance to determine if it needs to be speeded up.
The three basic X1 performance analysis tools are:
- pat_hwpc
- compiler loopmark listing
- cray_pat
pat_hwpc
This tool reports data from the X1 hardware performance counters. The data is collected over the entire run of the program, and thus can't point you to a specific subroutine or loop which might need attention. It has no effect on the performance of your code, and running it requires no recompilation or relinking. To use it, preface your execution command with "pat_hwpc," as follows:
% pat_hwpc ./a.outor for an MPI job:
% pat_hwpc mpirun -np 8 ./a.out
This tool works in most cases, is trivial to use, and provides invaluable data.
The following pat_hwpc output is from an actual user application (140k lines of source). It shows us that 98% of the operations are vector (this is good!), vector length is 44 (okay), computational intensity is 4.5 flops per load (good), and performance is 1.4 GFLOPS (okay). This is acceptable, but if the user were planning multiple long production runs we'd want to dig deeper for possible improvement.
Here's the pat_hwpc output:
Exit status 0
Host name & type klondike crayx1 400 MHz
Operating system UNICOS/mp 2.3.13 12011516
Text page size 16 Mbytes
Other page size 16 Mbytes
Start time Fri Dec 19 14:46:09 2003
End time Fri Dec 19 14:54:28 2003
Elapsed time 499.016 seconds
User time 425.327 seconds 85%
System time 58.844 seconds 12%
Logical pe: 0 Node: 16 PID: 2414
Process resource usage:
User time 425.320431 seconds
System time 58.765827 seconds
P counter data
CPU Seconds 462.657200 sec
Cycles 1286.753M/sec 595325673264 cycles
Instructions graduated 204.432M/sec 94582156093 instr
Branches & Jumps 8.882M/sec 4109224077 instr
Branches mispredicted 0.651M/sec 301007611 misses 7.325%
Correctly predicted 8.231M/sec 3808216466 misses 92.675%
Vector instructions 37.922M/sec 17544896116 instr 18.550%
Scalar instructions 166.510M/sec 77037259977 instr 81.450%
Vector ops 1680.548M/sec 777517473692 ops
Vector FP adds 670.028M/sec 309993331718 ops
Vector FP multiplies 679.187M/sec 314230840083 ops
Vector FP divides etc 6.804M/sec 3147863184 ops
Vector FP misc 11.129M/sec 5148866098 ops
Vector FP ops 1367.148M/sec 632520901083 ops 98.244%
Scalar FP ops 24.432M/sec 11303869190 ops 1.756%
Total FP ops 1391.581M/sec 643824770273 ops
FP ops per load 4.532 flops/load
Scalar integer ops 26.255M/sec 12146964198 ops
Scalar memory refs 29.890M/sec 13828976669 refs 9.735%
Vector TLB misses 0.000M/sec 2645 misses
Scalar TLB misses 0.000M/sec 356 misses
Instr TLB misses 0.000M/sec 627 misses
Total TLB misses 0.000M/sec 3628 misses
Dcache references 25.188M/sec 11653277955 refs 84.267%
Dcache bypass refs 4.703M/sec 2175698714 refs 15.733%
Dcache misses 6.320M/sec 2923896184 misses
Vector integer adds 4.292M/sec 1985815905 ops
Vector logical ops 8.002M/sec 3702187534 ops
Vector shifts 5.657M/sec 2617061278 ops
Vector int ops 17.951M/sec 8305064717 ops
Vector loads 233.222M/sec 107901643404 refs
Vector stores 43.923M/sec 20321299783 refs
Vector memory refs 277.145M/sec 128222943187 refs 90.265%
Scalar memory refs 29.890M/sec 13828976669 refs 9.735%
Total memory refs 307.035M/sec 142051919856 refs
Average vector length 44.316
A-reg Instr 60.553M/sec 28015211191 instr
Scalar FP Instr 24.432M/sec 11303869190 instr
Syncs Instr 4.864M/sec 2250427345 instr
Stall VLSU 665.889secs 266355799373 clks
Stall VU 1037.923secs 415169288518 clks
Vector Load Alloc 205.741M/sec 95187337274 refs
Vector Load Index 1.947M/sec 900922851 refs
Vector Load Stride 25.407M/sec 11754815962 refs
Vector Store Alloc 43.853M/sec 20289135770 refs
Vector Store Stride 1.356M/sec 627230085 refs
Compiler Loopmark Listing
For Fortran codes, add "-rm" to your list of "ftn" options. E.g.:
% ftn -O3 -rm -c mySubroutine.fftn will compile the source and create a listing file (giving it the ".lst" suffix), like this:
mySubroutine.lst
The .lst file shows the source code and optimizations. A legend at the top of the file explains the various symbols.
Here's an example of what you want to see: the loops where all the work is happening are marked with "MV", meaning the compiler successfully vectorized and streamed them.
%%% L o o p m a r k L e g e n d %%%
Primary Loop Type Modifiers
------- ---- ---- ---------
A - Pattern matched b - blocked
C - Collapsed f - fused
D - Deleted i - interchanged
E - Cloned m - streamed but not partitioned
I - Inlined p - conditional, partial and/or computed
M - Multistreamed r - unrolled
P - Parallel/Tasked s - shortloop
V - Vectorized t - array syntax temp used
W - Unwound w - unwound
588. 1-----< do ispin = 1,1
589. 1 MV--< do n=1,nplwv
590. 1 MV rvxc(n,ispin) = rvxc(n,ispin) + real(cexf(n,ispin))
591. 1 MV--> enddo
592. 1-----> enddo
593.
594.
595. 1-----< do ispin=1,1
596. 1 MV--< do n=1,nplwv
597. 1 MV xcenc = xcenc + (xcend(n,ispin)-rvxc(n,ispin))
598. 1 MV $ *density(n,ispin)
599. 1 MV ecorec = ecorec +rvxc(n,ispin)*dencore(n)/float(1)
600. 1 MV--> enddo
601. 1-----> enddo
At the end of the ".lst" file you'll find messages explaining why each loop was or wasn't vectorized or streamed (for instance, there was a dependency on variable "X", a non-vectorizable function call, etc..). Loopmark listing is now available for C programs, too.
Cray_pat
Loopmark listing (above) is only really useful if you know where the code spends its time. (A loop which accounts for %0.1 of the time can be ignored, even if it doesn't vectorize.)
Profiling your code helps focus your optimization efforts. Cray_pat is like Unix prof or gprof, and will show the percentage of time spent in each subroutine or function (or loop, if needed).
Here's how to get a basic profile. First, compile your code as usual, with all desired optimizations. Then:
Step 1:
"Instrument" the executable file for profiling.
The exact object files used when the file was linked must be available in their original locations because "instrumenting" the code automatically relinks it as well. The "pat_build" tool performs the task. In this example, a.out is a pre-existing executable file, and a.out.inst will be generated:
% pat_build a.out a.out.inst Step 2:
Run the instrumented binary exactly as you'd run the original. This will produce an experiment file (with the suffix, .xf), containing output statistics for the run.
% ./a.out.inst
Step 3:
Generate a human-readable report from the .xf file using a second tool, "pat_report." E.g.:
% pat_report -i a.out.inst -o a.out.pat_report <.xf file>
Step 4:
View the report:
% more a.out.pat_report
The report will give you a table similar to this:
100.0%
100.0%
2386567
Total
---------------------------------------
39.8%
39.8%
950376
count_pair_position_
17.1%
56.9%
406924
vdw_compute_insert_
6.8%
63.7%
163462
compute_cb_environment_
4.5%
68.2%
107056
evaluate_envpair_
3.9%
72.1%
92691
__bcopy_prv
3.7%
75.8%
89230
vdw_compute_reset_
3.7%
79.5%
87765
setup_atom_type_
2.7%
82.2%
64119
bcmp
2.6%
84.8%
62419
evaluate_ss_
2.0%
86.8%
48213
_F90_FCD_ASG
1.8%
88.6%
42818
refold_coordinates_
1.8%
90.4%
41899
_F90_FCD_CMP_EQ
1.4%
91.8%
33712
memcmp
0.9%
92.7%
21880
setup_allatom_list_
0.8%
93.5%
18594
name_from_num_
0.8%
94.3%
18371
res1_from_num_
0.7%
95.0%
17858
__cis
Given this table, you know which loopmark listing file to examine first... (in this case, that which contains the subroutine "count_pair_position").
If you need help with any of these tools, contact ARSC consulting (consult@arsc.edu). Also see our "getting started" document for the X1:
http://www.arsc.edu/support/howtos/usingx1.html
Quick-Tip Q & A
A:[[ I'm finally appreciating the benefits of the "find" command, but
[[ here's a problem.
[[
[[ When I use grep from a find command, grep doesn't tell me the names
[[ of the files! Sure I've got hits, but what good is it if I can't
[[ tell what files they're in?
[[
[[ % find . -name "*.f" -exec grep -i flush6 {} \;
[[ include(flush6)
[[ !!dvo!! include(flush6)
[[ !!dvo!! include(flush6)
[[ include(flush6)
[[
[[ Any suggestions?
#
# Many thanks to nine (yes, 9) responders. There was duplication, so here
# are 5 responses which cover the range of answers.
#
#
# John Skinner
#
You have to add an extra filename for grep. /dev/null works best:
% find . -name "*.f" -exec grep -i "program rir" /dev/null {} \;
This is needed because grep won't list the filename of a match when
given only one file on the command line or when a wildcard like *.f only
expands to one filename. Since find's -exec option runs grep on only one
filename at a time, grep never gets two or more files on its command
line. Add /dev/null to get 2 files each time grep is run, with one of
them guaranteed to NEVER match.
You can also "turn around" the find/grep,
% grep -i "program rir" `find . -name "*.f"`
but check this out when *.f winds up being only one filename:
% ls *.f
r.f
What the heck! Where's my filename, with either method??
% grep -i "program rir" `find . -name "*.f"`
program rir
% find . -name "*.f" -print
xargs grep -i "program rir"
program rir
Again, add an extra filename for grep:
% grep -i "program rir" `find . -name "*.f"` /dev/null
./r.f: program rir
% find . -name "*.f" -print
xargs grep -i "program rir" /dev/null
./r.f: program rir
#
# Brad Chamberlain
#
The key is to find the flag on your grep command that prints the filename,
since find will call grep on each file one by one. On my desktop
systems (linux-based), it's --with-filename, so I use:
find . -name "*.txt" -exec grep --with-filename ZPL {} \;
#
# Daniel Kidger
#
Many versions of grep (eg.. Gnu) have a -H option. This prefixes the
output with the filename. The -n option of grep is handy too - it shows
the line number in the file. Also I generally prefer to use 'xargs'
rather than the slightly clumsy '-exec' option of grep. (the -l option
feeds one line at a time to whatever command follows).
Hence
$ find . -name "*.f"
xargs -l grep -inH getarg
./danmung.f:91: call getarg(1,file_in)
./danmung.f:92: call getarg(2,file_out)
./danfe.f:529:! .. cf use of GETARG, & if NARG = 0.
(Note, in years gone by 'find' often needed a '-print' option in the
above.)
#
# Jed Brown
#
You are probably looking for the -H option for grep (most versions).
Otherwise, you can use:
% grep -i flush6 `find . -name "*.f"`
since usually, grep prints the name of the file if it receives several
arguments on the command line. If this does not work or if it exceeds
the maximum number of command line arguments, you can always do
something like:
% echo 'a=$1; shift; for f in $*; do grep $a $f
sed "s
^
$f:\t
"; done' > mygrep
% chmod a+x mygrep && find . -name "*.f" -exec ./mygrep "-i flush6" {} \;
#
# Kurt Carlson
#
In ksh syntax:
find . -name "*.f" -print
while read F; do
grep -i flush6 $F >/dev/null; if [ 0 = $? ]; then echo "# $F"; fi
done
Q: Are data written from a fortran "implied do" incompatible with a
regular "read"? If so, is there a way to make them compatible,
without rewriting the code?
I just want to read data elements one item at a time from a
previously written file. Here's a test program which attempts
to show the problem:
iceflyer 56% cat unformatted_io.f
program unformatted_io
implicit none
integer, parameter :: SZ=10000, NF=111
real, dimension (SZ) :: z
real :: z_item, zsum
integer :: k
zsum = 0.0
do k=1,SZ
call random_number (z(k))
zsum = zsum + z(k)
enddo
print*,"SUM BEFORE: ", zsum
open(NF,file='test.out',form='unformatted',status='new')
write(NF) (z(k),k=1,SZ)
close (NF)
zsum=0.0
print*,"SUM DURING: ", zsum
open(NF,file='test.out',form='unformatted',status='old')
do k=1,SZ
read(NF) z_item
zsum = zsum + z_item
enddo
close (NF)
print*,"SUM AFTER: ", zsum
end
iceflyer 57% xlf90 unformatted_io.f -o unformatted_io
** unformatted_io === End of Compilation 1 ===
1501-510 Compilation successful for file unformatted_io.f.
iceflyer 58% ./unformatted_io
SUM BEFORE: 5018.278320
SUM DURING: 0.0000000000E+00
1525-001 The READ statement on the file test.out cannot be completed
because the end of the file was reached. The program will stop.
iceflyer 59%
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
