ARSC HPC Users' Newsletter 235, December 14, 2001

QUIZ: Vectorization

Here's a simple program to implement trapezoidal integration. Can I get the loop to vectorize on the SV1ex, and improve its performance?


CHILKOOT$ cat trap.serial.f                                       
!******************************************************
! serial.f -- calculate definite integral using trapezoidal rule.
!
! The function f(x) is hardwired.
! Input: From file 'trap.input':  a, b, n
! Output: estimate of integral from a to b of f(x)
!    using n trapezoids.
!******************************************************

      PROGRAM serial
      implicit none
      real :: integral      ! accumulates sum of trapezoids
      real :: a             ! lower value of interval
      real :: b             ! upper value of interval
      integer :: n          ! number of trapezoids
      real :: h             ! width of trapezoid
      real :: side(0:1)     ! sides of trapezoid
      integer:: i           ! which trapezoid computing

      real :: f             ! real valued function integrating

      open (unit=44, file='trap.input', status='old') 
      read (44, '(2f8.5,i12)') a,  b,  n
      close (44)

      h = (b-a)/n

      integral = 0

      ! Left side of first trapezoid
      side(1) = f(a+0*h)       
      do i = 0 , n-1

          ! Right side of current trapezoid. Left side of next.
          side(MOD(i,2)) = f(a+(i+1)*h)  

          integral = integral + h*(side(0) + side(1)) / 2.0
      enddo

      print *,'With n =', n,' trapezoids, our estimate'
      print *,'of the integral from ', a, ' to ',b, ' = ' , integral
      end

!******************************************************
      real function f(x)
      real x

      f = 4.0 / (x**2  + 1)

      return
      end
!******************************************************

When compiled for loopmark listing, as follows:


  f90 -O3 -rm -o trap.serial trap.serial.f

we can see in the listing file that the loop is marked with "1's" and not "V's", which tells us it's not vectorized. From the listing file:


   32.  1--<       do i = 0 , n-1
   33.  1    
   34.  1              ! Right side of current trapezoid. Left side of next.
   35.  1              side(MOD(i,2)) = f(a+(i+1)*h)  
   36.  1    
   37.  1              integral = integral + h*(side(0) + side(1)) / 2.0
   38.  1-->       enddo
The run takes a long time, and it only gets 24 MFLOPS. Here's output including hpm statistics:

  CHILKOOT$ hpm ./trap.serial
   With n = 50000000  trapezoids, our estimate
   of the integral from  0.E+0  to  1.  =  3.141592457548199
  
  Group 0:CPU seconds   : 20.90579      CP executing     :  10452894570
  
  Million inst/sec (MIPS) :   193.76      Instructions     :   4050623322
  Avg. clock periods/inst :     2.58
  % CP holding issue      :    45.93      CP holding issue :   4801389871
  Inst.buffer fetches/sec :     0.00M     Inst.buf. fetches:         9512
  Floating adds/sec       :     9.57M     F.P. adds        :    200000514
  Floating multiplies/sec :    11.96M     F.P. multiplies  :    250000430
  Floating reciprocal/sec :     2.39M     F.P. reciprocals :     50000002
  Cache hits/sec          :     9.59M     Cache hits       :    200397353
  CPU mem. references/sec :    28.72M     CPU references   :    600421500
  
  Floating ops/CPU second :    23.92M

The QUIZ: How can we speed this up?

Tuning a C++ MPI Code with VAMPIR: Part II

[ Part II of III. Thanks to Jim Long of ARSC for this series of articles. ]

In part I, we described a port of the UAF Institute of Arctic Biology's Terrestrial Ecosystem Model (TEM) to the Cray T3E and a linux cluster, and examined performance using VAMPIR. In this article, we explore an optimization to the communication algorithm and discuss performance on ARSC's IBM SP3.

As shown in part I, VAMPIR images suggested that TEM might be tuned by:

  1. overlapping computation on the master and slaves, and
  2. having the slaves begin computing as soon as they receive new data.

The relevant abstracted code section from the original implementation is:


if (mype == 0){
   currentPE = 1;
   while (currentPE<totpes){
     READ CLIMATE DATA FOR CURRENT SLAVE (if available)
     MPI_Barrier(MPI_COMM_WORLD);
     MPI_Send CLIMATE DATA TO CURRENT SLAVE (many MPI_Send calls)
     currentPE++;
   }
}
else {
   for (currentPE = 1; currentPE < totpes; currentPE++){
     MPI_Barrier(MPI_COMM_WORLD);
     if (mype == currentPE) MPI_Recv DATA FROM MASTER (many MPI_Recv calls)
   }
   COMPUTE WITH MY DATA
}

The MPI_Barrier call serves to synchronize the two loops so as not to overload MPI buffering. This could be a problem because the code above is inside a loop that can read files for hundreds of years into the future.

The barriers also mimic the situation that would exist in a synchronous coupling with a climate model, i.e., when there is no new climate data for the master to read until the slaves have computed and sent their data to the climate model. In a synchronous coupling, the master must wait until a new climate is computed.

In a sensitivity analysis for an uncoupled TEM, however, the climate might well be prescribed (as it is now), and the master can read the next year's data and have it ready for the slaves when they need it. This addresses issue 1, above.

The fact that no slave can begin computation until all slaves receive their data was recognized in the original implementation, but was left unchanged since it mimics the worst case scenario that would exist in a global run with many slaves trying to read/write their data at the same time. Worst case simulation is not necessary, however, when a sensitivity analysis is desired for only Arctic latitudes. This addresses issue 2.

Thus, it was safe to tune the code by simply removing the barrier calls. This eliminates the "for" loop in the "else" clause. The first in the series of MPI_Sends was replaced with an MPI_Ssend. MPI_Ssend is a synchronous send that guarantees that the send will not return until the destination begins to receive the message. This effectively implements a barrier between the master and one slave only, when that slave begins to receive, instead of having to stop at an explicit barrier when each slave is receiving. A slave may now begin computation as soon as it receives its data. The tuned code looks like:


if (mype == 0){
   currentPE = 1;
   while (currentPE<totpes){
     READ CLIMATE DATA FOR CURRENT SLAVE (if available)
     MPI_Ssend for the first of many MPI_Send calls
     MPI_Send CLIMATE DATA TO CURRENT SLAVE (many MPI_Send calls)
     currentPE++;
   }
}
else {
   MPI_Recv DATA FROM MASTER (many MPI_Recv calls)
   COMPUTE WITH MY DATA
}

Results:

The general lesson here is to avoid global barriers if at all possible.

Figure 1 (click on icon for larger view)

Figure 1 gives two VAMPIR images, comparing old vs new communication patterns for the T3E during equal timeslices of the TEM transient portion. The T3E showed a roughly 10% reduction in time for the transient portion of the run, which shows up as a reduction in the amount of time spent in (red) MPI calls for the slaves in the VAMPIR output. (In all of these VAMPIR images, green, which shows time spent doing computation, is good, while red, which shows necessary, but unproductive, time in the communication library, is bad.)

Figure 2 (click on icon for larger view)

Figure 2 shows the communication pattern on the ARSC linux cluster, using ethernet, where an impressive 40% reduction in time during transient portion is realized.

Of all platforms tested, MPI latency and bandwidth are worst on the cluster using ethernet, thus it's no surprise that the benefits from tuning the communication algorithm is most dramatic here.

Figure 3 (click on icon for larger view)

Figure 3 shows additional results from the ARSC linux cluster, but this time, using the myrinet network. This comparison shows about a 15% reduction in time during the transient portion.

Figure 4 (click on icon for larger view)

Figure 4 is the promised look at results on ARSC's IBM SP3 (Icehawk) for an equal timeslice of the transient portion.

The original code ran in a blazing 9:55 (9 minutes, 55 seconds) total, while the tuned code ran in 8:31. The two transient portions ran in 3:50 and 2:25 respectively, a roughly 35% improvement in the tuned version for transient performance.

Since the compute time per time step is so low for the SP3, the MPI portion was a large percentage of the action, and hence a reduction in MPI time results in a large percentage improvement. The IBM SP3 is essentially a cluster technology, 4 CPUs per shared memory node, with nodes interconnected by a high speed switch. Each CPU has 8MB L2 cache, and so we have the combined benefits for this code of large cache and high performance CPUs.

In the next (and final) installment in this series, we address the question raised in part I. The problem is naturally parallel, so why doesn't it scale better? Is the tuned code more scalable?

CUG Call for papers

CUG SUMMIT 2002 Manchester, United Kingdom 20th to 24th May, 2002 Call for Papers

The CUG SUMMIT 2002 on high-performance computation and visualization will be held from May 20 through 24, 2002, in Manchester, United Kingdom. Our host will be the University of Manchester.

For further details about the CUG SUMMIT 2002 and electronic abstract submission, please visit the CUG home page at URL:

http://www.cug.org/

The deadline for electronic abstract submissions is January 25, 2002.

ANSWER: Vectorization

Here's one answer...

First, ask the compiler why the loop didn't vectorize. Add the "-Onegmsgs" option (for "negative messages") to learn why desirable optimizations, like vectorization and tasking, were not applied:


  f90 -O3,negmsgs -rm -o trap.serial trap.serial.f

Excerpts from the listing file, trap.serial.lst:


   32.  1--<       do i = 0 , n-1
   33.  1    
   34.  1              ! Right side of current trapezoid. Left side of next.
   35.  1              side(MOD(i,2)) = f(a+(i+1)*h)  
   36.  1    
   37.  1              integral = integral + h*(side(0) + side(1)) / 2.0
   38.  1-->       enddo

  f90-6287 f90: VECTOR File = trap.serial.f, Line = 32 
    A loop starting at line 32 was not vectorized because it contains a call to
    function "F" on line 35.

Ahhhh... We knew that! Function and subroutine calls inhibit vectorization. Recompile with inlining to eliminate the function call. As described in > Quick-Tip #207 , if the function were defined in a separate source file, we'd use "-Oinlinefrom=<FNM>". In this case, use -Oinline4:


  f90 -O3,inline4 -rm -o trap.serial trap.serial.f
The result:

   31.  I----<>       side(1) = f(a+0*h)       
   32.  Vp----<       do i = 0 , n-1
   33.  Vp      
   34.  Vp                ! Right side of current trapezoid. Left side of next.
   35.  Vp I-<>           side(MOD(i,2)) = f(a+(i+1)*h)  
   36.  Vp      
   37.  Vp                integral = integral + h*(side(0) + side(1)) / 2.0
   38.  Vp---->       enddo
"Vp" indicates the loop was "partially vectorized," which is encouraging. How much did this help?

  CHILKOOT$ hpm ./trap.serial
   With n = 50000000  trapezoids, our estimate
   of the integral from  0.E+0  to  1.  =  3.141592425474428

  Group 0:  CPU seconds   :    8.02257      CP executing     :     4011283880
  
  Million inst/sec (MIPS) :     215.56      Instructions     :     1729334273
  Avg. clock periods/inst :       2.32
  % CP holding issue      :      44.27      CP holding issue :     1775631714
  Inst.buffer fetches/sec :       0.00M     Inst.buf. fetches:           9477
  Floating adds/sec       :      31.16M     F.P. adds        :      250000513
  Floating multiplies/sec :      37.39M     F.P. multiplies  :      300000430
  Floating reciprocal/sec :       6.23M     F.P. reciprocals :       50000002
  Cache hits/sec          :      19.53M     Cache hits       :      156646453
  CPU mem. references/sec :      31.99M     CPU references   :      256671541
  
  Floating ops/CPU second :      74.79M

We got a 3-fold improvement, but 75 MFLOPS is still disappointing. Recompile again with "negative messages" to get guidance from the compiler:


  f90 -O3,negmsgs,inline4 -rm -o trap.serial trap.serial.f

And the listing file shows:


  f90-1204 f90: INLINE File = trap.serial.f, Line = 31 
    The call to F was inlined.
  
  f90-6209 f90: VECTOR File = trap.serial.f, Line = 32 
    A loop starting at line 32 was partially vectorized.
  
  f90-6511 f90: TASKING File = trap.serial.f, Line = 32 
    A loop starting at line 32 was not tasked because a recurrence was
    found on "SIDE" between lines 35 and 37.

OF COURSE! There's a dependency in this loop. The value of "side" must be computed before "integral". This is probably inhibiting vectorization as well as parallelization.

It was clever to reuse the value of "side" for two adjacent trapezoids, but let's go back to the simplest coding of trapezoidal integration, and see what happens. Replacing the loop with this:


      integral = 0
      do i = 0 , n-1
          integral = integral + h*( f(a+i*h) + f(a+(i+1)*h) )/2.0
      enddo

should remove all dependencies. Recompile with:


  f90 -O3,negmsgs,inline4 -rm -o trap.serial trap.serial.f

and we see this in the loopmark listing file:


   25.                 integral = 0
   26.  V------<       do i = 0 , n-1
   27.  V I I-<>           integral = integral + h*( f(a+i*h) + f(a+(i+1)*h) )/2.0
   28.  V------>       enddo

  f90-6204 f90: VECTOR File = trap.serial.f, Line = 26 
    A loop starting at line 26 was vectorized.
  
  f90-1204 f90: INLINE File = trap.serial.f, Line = 27 
    The call to F was inlined.

The fully vectorized version runs at 1600 MFLOPS and completes in 0.65 CPU seconds. This is a 13-fold speedup in CPU seconds over the partially vectorized version:


  CHILKOOT$ hpm ./trap.serial     
   With n = 50000000  trapezoids, our estimate
   of the integral from  0.E+0  to  1.  =  3.141592650025629

  Group 0:  CPU seconds   :    0.65617      CP executing     :      328084890
  
  Million inst/sec (MIPS) :      48.57      Instructions     :       31873220
  Avg. clock periods/inst :      10.29
  % CP holding issue      :      88.36      CP holding issue :      289879773
  Inst.buffer fetches/sec :       0.01M     Inst.buf. fetches:           9461
  Floating adds/sec       :     609.60M     F.P. adds        :      400000584
  Floating multiplies/sec :     838.20M     F.P. multiplies  :      550000426
  Floating reciprocal/sec :     152.40M     F.P. reciprocals :      100000001
  Cache hits/sec          :       0.61M     Cache hits       :         397575
  CPU mem. references/sec :       0.64M     CPU references   :         421673
  
  Floating ops/CPU second :    1600.20M

Can you add anything to this discussion? Feel free to comment.

Next Newsletter

For you Santas out there, your cards from North Pole are on the way.

We're taking Dec 28th off, and will produce the next newsletter on Jan 4. Also, we're updating our technical reading list and plan to print it in the next issue. If you'd like to recommend a book, let us know.

A safe and happy holiday to everyone!

Quick-Tip Q & A


A:[[ As I migrate my code between Crays, IBMs, and SGIs, I assume
  [[ I can just stick with the default optimization levels.  Is this a
  [[ good assumption?


  Nope.  Okay on Crays and IBMs, but on SGIs, default optimization is NO
  optimization.  Try -O2 on the SGIs for starters.  Also, see the Quiz
  answer, above.  

  If you're going into production, the compiler is your friend.  It can
  really pay to analyze your code.



Q: What are your "New Years 'Computing' Resolutions" ???
   
   For example, "I resolve to learn python, change all my 
   passwords, and ???"
 

   (Anonymity will be preserved when we list these in the Jan 4th
   issue.)

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top