ARSC HPC Users' Newsletter 235, December 14, 2001
QUIZ: Vectorization
Here's a simple program to implement trapezoidal integration. Can I get the loop to vectorize on the SV1ex, and improve its performance?
CHILKOOT$ cat trap.serial.f
!******************************************************
! serial.f -- calculate definite integral using trapezoidal rule.
!
! The function f(x) is hardwired.
! Input: From file 'trap.input': a, b, n
! Output: estimate of integral from a to b of f(x)
! using n trapezoids.
!******************************************************
PROGRAM serial
implicit none
real :: integral ! accumulates sum of trapezoids
real :: a ! lower value of interval
real :: b ! upper value of interval
integer :: n ! number of trapezoids
real :: h ! width of trapezoid
real :: side(0:1) ! sides of trapezoid
integer:: i ! which trapezoid computing
real :: f ! real valued function integrating
open (unit=44, file='trap.input', status='old')
read (44, '(2f8.5,i12)') a, b, n
close (44)
h = (b-a)/n
integral = 0
! Left side of first trapezoid
side(1) = f(a+0*h)
do i = 0 , n-1
! Right side of current trapezoid. Left side of next.
side(MOD(i,2)) = f(a+(i+1)*h)
integral = integral + h*(side(0) + side(1)) / 2.0
enddo
print *,'With n =', n,' trapezoids, our estimate'
print *,'of the integral from ', a, ' to ',b, ' = ' , integral
end
!******************************************************
real function f(x)
real x
f = 4.0 / (x**2 + 1)
return
end
!******************************************************
When compiled for loopmark listing, as follows:
f90 -O3 -rm -o trap.serial trap.serial.f
we can see in the listing file that the loop is marked with "1's" and not "V's", which tells us it's not vectorized. From the listing file:
32. 1--< do i = 0 , n-1 33. 1 34. 1 ! Right side of current trapezoid. Left side of next. 35. 1 side(MOD(i,2)) = f(a+(i+1)*h) 36. 1 37. 1 integral = integral + h*(side(0) + side(1)) / 2.0 38. 1--> enddoThe run takes a long time, and it only gets 24 MFLOPS. Here's output including hpm statistics:
CHILKOOT$ hpm ./trap.serial With n = 50000000 trapezoids, our estimate of the integral from 0.E+0 to 1. = 3.141592457548199 Group 0:CPU seconds : 20.90579 CP executing : 10452894570 Million inst/sec (MIPS) : 193.76 Instructions : 4050623322 Avg. clock periods/inst : 2.58 % CP holding issue : 45.93 CP holding issue : 4801389871 Inst.buffer fetches/sec : 0.00M Inst.buf. fetches: 9512 Floating adds/sec : 9.57M F.P. adds : 200000514 Floating multiplies/sec : 11.96M F.P. multiplies : 250000430 Floating reciprocal/sec : 2.39M F.P. reciprocals : 50000002 Cache hits/sec : 9.59M Cache hits : 200397353 CPU mem. references/sec : 28.72M CPU references : 600421500 Floating ops/CPU second : 23.92M
The QUIZ: How can we speed this up?
Tuning a C++ MPI Code with VAMPIR: Part II
[ Part II of III. Thanks to Jim Long of ARSC for this series of articles. ]
In part I, we described a port of the UAF Institute of Arctic Biology's Terrestrial Ecosystem Model (TEM) to the Cray T3E and a linux cluster, and examined performance using VAMPIR. In this article, we explore an optimization to the communication algorithm and discuss performance on ARSC's IBM SP3.
As shown in part I, VAMPIR images suggested that TEM might be tuned by:
- overlapping computation on the master and slaves, and
- having the slaves begin computing as soon as they receive new data.
The relevant abstracted code section from the original implementation is:
if (mype == 0){
currentPE = 1;
while (currentPE<totpes){
READ CLIMATE DATA FOR CURRENT SLAVE (if available)
MPI_Barrier(MPI_COMM_WORLD);
MPI_Send CLIMATE DATA TO CURRENT SLAVE (many MPI_Send calls)
currentPE++;
}
}
else {
for (currentPE = 1; currentPE < totpes; currentPE++){
MPI_Barrier(MPI_COMM_WORLD);
if (mype == currentPE) MPI_Recv DATA FROM MASTER (many MPI_Recv calls)
}
COMPUTE WITH MY DATA
}
The MPI_Barrier call serves to synchronize the two loops so as not to overload MPI buffering. This could be a problem because the code above is inside a loop that can read files for hundreds of years into the future.
The barriers also mimic the situation that would exist in a synchronous coupling with a climate model, i.e., when there is no new climate data for the master to read until the slaves have computed and sent their data to the climate model. In a synchronous coupling, the master must wait until a new climate is computed.
In a sensitivity analysis for an uncoupled TEM, however, the climate might well be prescribed (as it is now), and the master can read the next year's data and have it ready for the slaves when they need it. This addresses issue 1, above.
The fact that no slave can begin computation until all slaves receive their data was recognized in the original implementation, but was left unchanged since it mimics the worst case scenario that would exist in a global run with many slaves trying to read/write their data at the same time. Worst case simulation is not necessary, however, when a sensitivity analysis is desired for only Arctic latitudes. This addresses issue 2.
Thus, it was safe to tune the code by simply removing the barrier calls. This eliminates the "for" loop in the "else" clause. The first in the series of MPI_Sends was replaced with an MPI_Ssend. MPI_Ssend is a synchronous send that guarantees that the send will not return until the destination begins to receive the message. This effectively implements a barrier between the master and one slave only, when that slave begins to receive, instead of having to stop at an explicit barrier when each slave is receiving. A slave may now begin computation as soon as it receives its data. The tuned code looks like:
if (mype == 0){
currentPE = 1;
while (currentPE<totpes){
READ CLIMATE DATA FOR CURRENT SLAVE (if available)
MPI_Ssend for the first of many MPI_Send calls
MPI_Send CLIMATE DATA TO CURRENT SLAVE (many MPI_Send calls)
currentPE++;
}
}
else {
MPI_Recv DATA FROM MASTER (many MPI_Recv calls)
COMPUTE WITH MY DATA
}
Results:
The general lesson here is to avoid global barriers if at all possible.
Figure 1 (click on icon for larger view)
Figure 1 gives two VAMPIR images, comparing old vs new communication patterns for the T3E during equal timeslices of the TEM transient portion. The T3E showed a roughly 10% reduction in time for the transient portion of the run, which shows up as a reduction in the amount of time spent in (red) MPI calls for the slaves in the VAMPIR output. (In all of these VAMPIR images, green, which shows time spent doing computation, is good, while red, which shows necessary, but unproductive, time in the communication library, is bad.)
Figure 2 shows the communication pattern on the ARSC linux cluster, using ethernet, where an impressive 40% reduction in time during transient portion is realized.
Of all platforms tested, MPI latency and bandwidth are worst on the cluster using ethernet, thus it's no surprise that the benefits from tuning the communication algorithm is most dramatic here.
Figure 3 (click on icon for larger view)
Figure 3 shows additional results from the ARSC linux cluster, but this time, using the myrinet network. This comparison shows about a 15% reduction in time during the transient portion.
Figure 4 is the promised look at results on ARSC's IBM SP3 (Icehawk) for an equal timeslice of the transient portion.
The original code ran in a blazing 9:55 (9 minutes, 55 seconds) total, while the tuned code ran in 8:31. The two transient portions ran in 3:50 and 2:25 respectively, a roughly 35% improvement in the tuned version for transient performance.
Since the compute time per time step is so low for the SP3, the MPI portion was a large percentage of the action, and hence a reduction in MPI time results in a large percentage improvement. The IBM SP3 is essentially a cluster technology, 4 CPUs per shared memory node, with nodes interconnected by a high speed switch. Each CPU has 8MB L2 cache, and so we have the combined benefits for this code of large cache and high performance CPUs.
In the next (and final) installment in this series, we address the question raised in part I. The problem is naturally parallel, so why doesn't it scale better? Is the tuned code more scalable?
CUG Call for papers
The CUG SUMMIT 2002 on high-performance computation and visualization will be held from May 20 through 24, 2002, in Manchester, United Kingdom. Our host will be the University of Manchester.
For further details about the CUG SUMMIT 2002 and electronic abstract submission, please visit the CUG home page at URL:
http://www.cug.org/The deadline for electronic abstract submissions is January 25, 2002.
ANSWER: Vectorization
Here's one answer...
First, ask the compiler why the loop didn't vectorize. Add the "-Onegmsgs" option (for "negative messages") to learn why desirable optimizations, like vectorization and tasking, were not applied:
f90 -O3,negmsgs -rm -o trap.serial trap.serial.f
Excerpts from the listing file, trap.serial.lst:
32. 1--< do i = 0 , n-1
33. 1
34. 1 ! Right side of current trapezoid. Left side of next.
35. 1 side(MOD(i,2)) = f(a+(i+1)*h)
36. 1
37. 1 integral = integral + h*(side(0) + side(1)) / 2.0
38. 1--> enddo
f90-6287 f90: VECTOR File = trap.serial.f, Line = 32
A loop starting at line 32 was not vectorized because it contains a call to
function "F" on line 35.
Ahhhh... We knew that! Function and subroutine calls inhibit vectorization. Recompile with inlining to eliminate the function call. As described in > Quick-Tip #207 , if the function were defined in a separate source file, we'd use "-Oinlinefrom=<FNM>". In this case, use -Oinline4:
f90 -O3,inline4 -rm -o trap.serial trap.serial.fThe result:
31. I----<> side(1) = f(a+0*h) 32. Vp----< do i = 0 , n-1 33. Vp 34. Vp ! Right side of current trapezoid. Left side of next. 35. Vp I-<> side(MOD(i,2)) = f(a+(i+1)*h) 36. Vp 37. Vp integral = integral + h*(side(0) + side(1)) / 2.0 38. Vp----> enddo"Vp" indicates the loop was "partially vectorized," which is encouraging. How much did this help?
CHILKOOT$ hpm ./trap.serial With n = 50000000 trapezoids, our estimate of the integral from 0.E+0 to 1. = 3.141592425474428 Group 0: CPU seconds : 8.02257 CP executing : 4011283880 Million inst/sec (MIPS) : 215.56 Instructions : 1729334273 Avg. clock periods/inst : 2.32 % CP holding issue : 44.27 CP holding issue : 1775631714 Inst.buffer fetches/sec : 0.00M Inst.buf. fetches: 9477 Floating adds/sec : 31.16M F.P. adds : 250000513 Floating multiplies/sec : 37.39M F.P. multiplies : 300000430 Floating reciprocal/sec : 6.23M F.P. reciprocals : 50000002 Cache hits/sec : 19.53M Cache hits : 156646453 CPU mem. references/sec : 31.99M CPU references : 256671541 Floating ops/CPU second : 74.79M
We got a 3-fold improvement, but 75 MFLOPS is still disappointing. Recompile again with "negative messages" to get guidance from the compiler:
f90 -O3,negmsgs,inline4 -rm -o trap.serial trap.serial.f
And the listing file shows:
f90-1204 f90: INLINE File = trap.serial.f, Line = 31
The call to F was inlined.
f90-6209 f90: VECTOR File = trap.serial.f, Line = 32
A loop starting at line 32 was partially vectorized.
f90-6511 f90: TASKING File = trap.serial.f, Line = 32
A loop starting at line 32 was not tasked because a recurrence was
found on "SIDE" between lines 35 and 37.
OF COURSE! There's a dependency in this loop. The value of "side" must be computed before "integral". This is probably inhibiting vectorization as well as parallelization.
It was clever to reuse the value of "side" for two adjacent trapezoids, but let's go back to the simplest coding of trapezoidal integration, and see what happens. Replacing the loop with this:
integral = 0
do i = 0 , n-1
integral = integral + h*( f(a+i*h) + f(a+(i+1)*h) )/2.0
enddo
should remove all dependencies. Recompile with:
f90 -O3,negmsgs,inline4 -rm -o trap.serial trap.serial.f
and we see this in the loopmark listing file:
25. integral = 0
26. V------< do i = 0 , n-1
27. V I I-<> integral = integral + h*( f(a+i*h) + f(a+(i+1)*h) )/2.0
28. V------> enddo
f90-6204 f90: VECTOR File = trap.serial.f, Line = 26
A loop starting at line 26 was vectorized.
f90-1204 f90: INLINE File = trap.serial.f, Line = 27
The call to F was inlined.
The fully vectorized version runs at 1600 MFLOPS and completes in 0.65 CPU seconds. This is a 13-fold speedup in CPU seconds over the partially vectorized version:
CHILKOOT$ hpm ./trap.serial With n = 50000000 trapezoids, our estimate of the integral from 0.E+0 to 1. = 3.141592650025629 Group 0: CPU seconds : 0.65617 CP executing : 328084890 Million inst/sec (MIPS) : 48.57 Instructions : 31873220 Avg. clock periods/inst : 10.29 % CP holding issue : 88.36 CP holding issue : 289879773 Inst.buffer fetches/sec : 0.01M Inst.buf. fetches: 9461 Floating adds/sec : 609.60M F.P. adds : 400000584 Floating multiplies/sec : 838.20M F.P. multiplies : 550000426 Floating reciprocal/sec : 152.40M F.P. reciprocals : 100000001 Cache hits/sec : 0.61M Cache hits : 397575 CPU mem. references/sec : 0.64M CPU references : 421673 Floating ops/CPU second : 1600.20M
Can you add anything to this discussion? Feel free to comment.
Next Newsletter
For you Santas out there, your cards from North Pole are on the way.
We're taking Dec 28th off, and will produce the next newsletter on Jan 4. Also, we're updating our technical reading list and plan to print it in the next issue. If you'd like to recommend a book, let us know.
A safe and happy holiday to everyone!
Quick-Tip Q & A
A:[[ As I migrate my code between Crays, IBMs, and SGIs, I assume [[ I can just stick with the default optimization levels. Is this a [[ good assumption? Nope. Okay on Crays and IBMs, but on SGIs, default optimization is NO optimization. Try -O2 on the SGIs for starters. Also, see the Quiz answer, above. If you're going into production, the compiler is your friend. It can really pay to analyze your code. Q: What are your "New Years 'Computing' Resolutions" ??? For example, "I resolve to learn python, change all my passwords, and ???" (Anonymity will be preserved when we list these in the Jan 4th issue.)
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
