ARSC HPC Users' Newsletter 217, April 5, 2001

SV1e Upgrade Postponed

Chilkoot's processor upgrade to an SV1e has been temporarily postponed.

Chilkoot users should watch the MOTD ("message of the day" -- that valuable information that appears every time you log on), for an announcement of the updated schedule. The upgrade will likely occur during a normal Wednesday night downtime. We regret this delay, but expect to proceed with the upgrade soon.

Portable MPI?

[Thanks to Jeff McAllister of ARSC for this contribution.]

Background

Depending on the system, MPI might not provide the best performance or the easiest way to parallelize codes. However, MPI is emerging as just about the only common element between HPC architectures (i.e. it will work on shared AND distributed memory platforms). As ARSC now has several systems on which to test MPI codes, it seems only natural to seek the common elements between the various implementations.

I set a goal to write a platform-independent example that could pass arbitrarily large array sections in a way that could be expanded to many variables. Basically, I was trying to produce a strategy for a pretty common case that I could reuse. In the past I'd written many "toy" MPI programs and played with lots of examples. I'd even gone through the effort of re-engineering larger scientific codes for MPI parallelism. (One works on the T3E, but it's not truly portable yet. The other still doesn't work correctly in its parallel version.) So I thought aha! Figuring out which MPI routines are actually going to work portably and scalably will be valuable knowledge...

I'm not going to say that I'm an MPI authority -- the purpose of this article is merely to communicate what I've observed. Someone with more MPI programming experience, or even a different programming style, may be able to come up with another (and perhaps better) set of MPI commands and strategies that work everywhere.

Testing the Standard-ness of MPI

Just because a program uses MPI doesn't mean it's portable. As there are differences in compilers, there are differences in MPI implementations. My theory is that some ways of programming in MPI are "more portable" than others. (High-level languages have the same problem. For instance, there are lots of ways to use common blocks in Fortran in completely nonportable ways.) So it makes sense to learn the dangerous parts of MPI and avoid using them.

The requirements of my portability project:

  1. transfer any part of an array between two processors
  2. this transfer should work regardless of size without changing code
  3. on a domain decomposed into arbitrarily-sized subsections

Furthermore, this approach should be general enough to work in a wide variety of settings. For example, one application I'm hoping to put this to is transferring the boundaries for a finite-difference solver, though I'd like to be able to transfer the entire array partition in other cases. The goal is to be able to figure out the most generic way to transfer data in MPI so that I can practice code reuse, avoiding re-inventing the wheel for each new application.

Delving into this project further, some suspicions I'd had from earlier work were confirmed:

  • The most basic subset of MPI subroutines is the best bet for portability: init, comm_rank, comm_size, send, recv, isend, irecv, wait, barrier, and finalize. (wtick and wtime may also be on this list though in testing how widely the granularity varies between systems suggests another project...)

  • probe and waitall, though they would be very nice to have, are unfortunately not on the above list (along with testall,testany, waitany, iprobe, and the more esoteric variations of send/recv. The way these function seems to vary widely between systems.)

  • Directly sending/receiving array subsections (i.e. mpi_send(A(1:mj,kstart(mype):kend(mype)... ) doesn't work reliably. On some platforms/compilers this will work, on others this is an unstable option. The alternative -- which does work everywhere -- is copying from a buffer into the desired array subsection. An additional step of allocating buffers to the exact count specified in the MPI call also seems to be required for full portability.

Summary

After testing many approaches to fulfill my requirements on ARSC's Cray T3E & SV1, IBM SP, Linux cluster, and SGI Origin 3800, I arrived at the following code. Here's a quick summary of how it works:

  • A "partition table" is created to drive domain deconstruction based on an arbitrary array size and an arbitrary number of processors. It's crude: # of procs can't be greater than # of columns and load balancing isn't considered.

  • Each processor iterates over its own section of the array in the "computational kernel".

  • Depending on where you end the outer loop, A) all computations can happen for each processor's subsection with one transfer at the end or B) the entire array subsection can be transferred from each processor each time. The result should be the same, but the second option is a stress test for the message-passing strategy.

  • Message passing is accomplished by a 2-message scheme. The first message is short, containing information about what to do with the next (longer) message. This depends on the "messages arrive in order from a particular source" property of the standard, which seems to be preserved on the systems listed, even under stress tests with tens of thousands of loops.

  • Data is received in a temporary buffer created at exactly the right size for the message then copied into the target array before being deallocated.


      program gridMPI
      implicit none

      include 'mpif.h'

      integer mj,mk

      real,dimension(:,:),allocatable:: A
      real,dimension(:),allocatable::sendbuffer,recvbuffer

      real x,y,z,minv,rad,inc,iso

      integer status(MPI_STATUS_SIZE) ! some MPI implementations don't deal
                                      ! well with MPI_STATUS_IGNORE

      integer I,K,J,L,numloops,error
      integer mype,totpes,ierr,master,tag,source,sz,tmp

      integer req1,req2,req3,req4
      integer msg(4)

      integer, DIMENSION(:),ALLOCATABLE ::Kstart,Kend
      character, dimension(:),allocatable::outline

 
      ! setup mpi
      call MPI_INIT(ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, mype, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, totpes, ierr)

      numloops=100

      master=0

      mj=60
      mk=mj

      ALLOCATE(outline(mj))
      ALLOCATE(Kstart(0:totpes-1))
      ALLOCATE(Kend(0:totpes-1))

c create a partition table mapping out the subsections
c to be handled by each processor
      CALL PartitionTable(1,mk,totpes,0,Kstart,Kend)

      print *,"PE ",mype,":",Kstart(mype),Kend(mype)

      ALLOCATE(A(mj,mk))

      A=0


c a sphere equation solver.  Put here to use up some time and
c generate an easily verifiable set of values to transfer
c Output should reflect the heights of a hemisphere which
c changes size with each timestep.  Output can be checked
c visually or vs scalar output (see below to disable MPI
c execution).  As the sphere is slightly off-center, testing output
c will show transposition errors or missing communications from other
c processors.

      rad=1.0
      inc=.3

      call mpi_barrier(MPI_COMM_WORLD,ierr)
      do L=1,numloops
        
         rad=rad+inc
         if ((rad>real(mj*2)).or.(rad<1.0)) inc=inc*(-1.0)

         do k=kstart(mype),kend(mype)
            do j=1,mj

               x=real(k)
               y=real(j)

               z=(rad**2-(x-mk/2)**2-(y-mk/2)**2)
               if (z>0) then
                  z=sqrt(z)
               else
                  z=0
               end if              


               a(j,k)=z/1000

            end do
         end do
      end do

c     all send their section of the array to master

c set this to .false. to turn off message passing     
      if (.true.) then



         ! yes, some of this information is redundant
         ! and could be obtained from status
         msg(1)=mype
         msg(2)=kstart(mype)
         msg(3)=kend(mype)
         msg(4)=mj


         tag=1
         call MPI_ISEND(msg,size(msg),MPI_INTEGER,
     *        master,tag,MPI_COMM_WORLD,req1,ierr)


         if (ierr /= MPI_SUCCESS) then
            print *, "initial msg send unsuccessful on PE ",mype
         end if

         sz=0
         allocate(sendbuffer(mj*(kend(mype)-kstart(mype)+1)),
     *        stat=error)

         if (error /=0) then
            print *,"error allocating sendbuffer in loop ",L
         end if

         do k=kstart(mype),kend(mype)
            do j=1,mj
               sz=sz+1
               sendbuffer(sz)=a(j,k)
            end do
         end do

         tag=999
         call MPI_ISEND(sendbuffer,sz,MPI_real,
     *        master,tag,MPI_COMM_WORLD,req2,ierr)


         if (ierr /= MPI_SUCCESS) then
            print *, "main buffer send unsuccessful on PE ",mype
         end if

         if (mype==master) then
            do i=1,totpes

               tag=1
               call MPI_IRECV(msg,size(msg),MPI_INTEGER,
     *              MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,req3,ierr)


               call MPI_WAIT(req3,status,ierr)

               if (ierr /= MPI_SUCCESS) then
                  print *, "init msg recv unsuccessful from PE ",
     *                 status(MPI_SOURCE)
               end if


               tag=999
               source=msg(1)
               sz=mj*(msg(3)-msg(2)+1)
               allocate(recvbuffer(sz),stat=error)
               if (error /=0) then
                  print *,"error allocating recvbuffer in loop ",L
               end if


               call MPI_IRECV(recvbuffer,sz,MPI_real,
     *              source,tag,MPI_COMM_WORLD,req4,ierr)

               call MPI_WAIT(req4,status,ierr)


               if (ierr /= MPI_SUCCESS) then
                  print *, "main msg recv unsuccessful from PE ",
     *                 status(MPI_SOURCE)
               end if


               sz=0
               do k=msg(2),msg(3)
                  do j=1,msg(4)
                     sz=sz+1
                     a(j,k)=recvbuffer(sz)
                  end do
               end do

               deallocate(recvbuffer,stat=error)
               if (error /=0) then
                  print *,"error deallocating recvbuffer in loop ",L
               end if
            end do
         end if

         call mpi_wait(req1,status,ierr)
         call mpi_wait(req2,status,ierr)
         deallocate(sendbuffer,stat=error)
         if (error /=0) then
            print *,"error deallocating sendbuffer in loop ",L
         end if

         call mpi_barrier(MPI_COMM_WORLD,ierr)
      end if

!  if you'd like to see a program with a really pathological
!  comp time/comm time ratio, uncomment the next line and
!  comment the last line of the "computational kernel"
!  the results should be the same but execution time will
!  increase significantly with each new processor
c      end do



      if (mype==master) then
         minv=minval(A)
         iso=(maxval(A)-minv)/9
       

         do j=1,mj
            do k=1,mk
               if (A(j,k)>0) then
                  outline(k:k)=char(int((A(j,k)-minv)/iso)+48)
               else
                  outline(k:k)=" "
               end if
            end do
            print *,outline
         end do
      end if

      call MPI_FINALIZE(ierr)

      end


       SUBROUTINE PartitionTable(start,MAX,nPEs,overlap,
     *                Pstart,Pend)

       IMPLICIT NONE

C -------- SCALAR VARIABLES -------
       integer MAX,start   ! partitioned range=start:Max
       integer nPEs,overlap ! divided over nPEs with specified overlap
C -------- DIMENSIONED VARIABLES-----
       integer Pstart(0:nPEs-1) ! the partition start and end vars
       integer Pend(0:nPEs-1)
C -------- local variables ----------
       integer I
       integer R,B
       integer POS,Size,Tsize


       ! get the base partition,B, (i.e. how this space would be partitioned
       ! if it divided evenly between the PEs) and the remainder(R).
       ! R PEs will do 1 extra loop.

       Tsize=(Max-start)+1
       if (Tsize<nPEs) then
          print *,"Cannot partition:  Max-start is less than nPEs"
          STOP
       end if

       B=Tsize/nPEs-1
       R=mod(Tsize,nPEs)

       POS=start-1
       DO I=0,nPEs-1
          POS=POS+1                ! advance to next partition
          Size=B                   ! set partition size
          if (I<R) Size=Size+1     ! 1 extra loop for remainder PEs
          Pstart(I)=POS            ! Set Pstart to partition start
          POS=POS+Size
          Pend(I)=POS              ! Set Pend to partition end

          Pstart(I)=Pstart(I)-overlap  ! adjust for overlaps
          if (Pstart(I)<start) Pstart(I)=start

          Pend(I)=Pend(I)+overlap
          if (Pend(I)>MAX) Pend(I)=MAX
       END DO

       RETURN
       END SUBROUTINE

Conclusions

MPI can be a "portable" standard, though (as with other standards like programming languages) just because it's written in MPI doesn't make a code portable. Portability can be increased by:

  • Testing code on a wide variety of systems during development.

  • Using the simplest subset of MPI possible.

As you can see from the above code, though, the results of attempting to be as general and as portable as possible are not compact, elegant, or optimal in terms of performance. We're back to the old problems of performance being at odds with portability (and good programming practice in general). However, this example (unlike many of the other example codes out there) is actually scalable to "real problem" size on many systems. This is just a beginning. It has often been said that MPI is too hard to use. Hopefully I've in some small way made it easier with this experiment.

[ Editor's note: comments and other experiences of MPI portability are invited. ]

ARSC Training, Next Two Weeks

For details on ARSC's spring 2001 offering of short courses, visit:

http://www.arsc.edu/user/Classes.html

Final course of the semester:

Visualization with Vis5D, Part I

Wednesday, April 11, 2001, 2-4pm

Visualization with Vis5D, Part II

Wednesday, April 18, 2001, 2-4pm

Please let us know what topics you'd like us to cover next fall. Also, ARSC staff are generally available for one-on-one consultation. ARSC users may always contact consult@arsc.edu if you need help with programming, optimization, parallelization, visualization, or anything else. We'll try to connect you with an appropriate staff member.

Quick-Tip Q & A



A:[[ Is there a way to peek at my NQS job's stdout and stderr files 
  [[ (.o and .e files), while the job is still running?  
  [[
  [[  I'm in debug mode, and wasting a lot of CPU time because these jobs
  [[  must run to completion before I can see ANY output.  Sometimes, I
  [[  could qdel them based on debugging output emitted early in the run.


  From "man qsub", 

     -ro     Writes the standard output file of the request to its
             destination while the request executes.

     -re     <ditto, for stderr>
  
  Thus, the answer is to add this to your qsub script:

    #QSUB -ro
    #QSUB -re

  Experiments on the T3E and SV1 reveal a minor problem, however. If I
  don't combine stderr and stdout (using "-eo") and do specify "-re",
  the stderr file gets stored to my home directory, rather than the
  normal destination, the directory from which the qsub script is
  submitted.

  Everything works fine for the stdout file, whether or not "-eo" is
  given.  Of course, you should only do read operations, like "cat," on
  these output files while they're still being written by the NQS job.




Q: Help!  I've got myself stuck on 4 (FOUR!) list servers, and can't
   unsubscribe! It was 3, but last month I joined a support group for
   people addicted to listservers!  Ahhhh!  That made 4.

   Problem seems to be I'm subscribed as (making something up here),

     "myname@my_old_host.frazzle.com"

   but they've retired "my_old_host" and moved me to a new workstation
   and now, when I send an "unsubscribe" message, I get a reply that,

     "myname@my_new_host.frazzle.com"    is not subscribed  

   How can I get my name off these lists?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top