[Menu Bar] Resourses at ARSC Science at ARSC Newsroom Support About ARSC ARSC Home

 

ARSC T3E Users' Newsletter 141, April 17, 1998

Newsletter Index Quick-Tip Index Search Newsletters

Storage Upgrade for Yukon Users

As announced in newsletter #139, ARSC is upgrading its storage environment. The upgrade will occur in phases, with the first taking place in a couple of weeks, concurrent with the cut-over from denali to chilkoot. Watch news items for the exact date.

Following the cut-over:

For more information visit our transition pages, at:

http://www.arsc.edu/pubs/bulletins/Transition.html

MPI Collective Communications

MPI provides a set of functions which perform collective communications between processor groups. This includes the following functions.

Sometimes it is asked, why are all these routines needed, since everything can be done with send and recv? Perhaps the simplest answer is that these frequently used data movement operations can be tailored to particular systems communication topologies, memory, or other hardware features and like many other libraries are tested to ensure correct behavior. This provides the user with both correct code and higher performance.

For example consider a simple broadcast operation. In the code below the broadcast operation is hand-coded using a simple serial send and recv, and this is compared with a call to MPI_BCAST which results in the same data transfer.

      program prog1

      implicit none

      include 'mpif.h'

! information for local and global conditions.
      integer mype
      integer totpes
      integer mproc

! general mpi error flag
      integer ierr
      integer istatus(MPI_STATUS_SIZE)

! dataspaces
      integer, allocatable, dimension(:) :: ia,ib
      integer isize

      integer i,is,ip
      integer ndata,itag,mrproc

!timing
      integer isrtc,ifrtc,isbtc,ifbtc

! setup mpi
      call MPI_INIT(ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, mype, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, totpes, ierr)
      write(6,*) ' mype is ',mype,' of total ',totpes

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)

      mproc=0

      isize=1
      do i=1,20

! setup data for testing
      allocate (ia(isize))
      allocate (ib(isize))

      do is=1,isize
        ia(is)=i
        ib(is)=i
      enddo

! broadcast using send/recv

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      isrtc=irtc()


      if(mype.eq.mproc) then
        ndata=isize
        itag=i
        do ip=0,totpes-1
          if(ip.ne.mproc) then
            call MPI_SEND(ia,ndata,MPI_INTEGER,ip,itag,
     $                         MPI_COMM_WORLD,ierr)
          endif
        enddo

      else

        ndata=isize
        itag=i
        mrproc=mproc
        call MPI_RECV(ia,ndata,MPI_INTEGER,mrproc,itag,
     $                MPI_COMM_WORLD,istatus,ierr)

      endif

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      ifrtc=irtc()



! broadcast using broadcast
      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      isbtc=irtc()

      call MPI_BCAST(ib,isize,MPI_INTEGER,mproc,
     $      MPI_COMM_WORLD,ierr)

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      ifbtc=irtc()



! report timing

      if(mype.eq.mproc) then
        write(6,*) ' send/recv ',isize,' took ',
     $       ifrtc-isrtc,ifbtc-isbtc,ia(isize),ib(isize)
        write(6,*)
     $       totpes,isize,ifrtc-isrtc,ifbtc-isbtc
      endif

      write(10+mype,*) ia(isize),ib(isize)

!increase size
      isize=isize*2
      deallocate(ia)
      deallocate(ib)
      enddo


      call MPI_FINALIZE(ierr)

      end

Note that, for both versions, the processor which hosts the data is identified as mproc and the data is distributed to arrays of the same name on all the processors. The routine MPI_BCAST is called on all processors. If one of the processors in the communicator MPI_COMM_WORLD does not call MPI_BCAST the code will deadlock.

By using MPI_BCAST in one line, the code has done what took tens of lines in user code with MPI_SEND/RECV.

Also the user has coded the broadcast operation so that the master sends data to each slave. The master's linear progression through the other PEs is clearly visualized in the VAMPIR graph shown.

VAMPIR Trace

VAMPIR Graph
(click on image for larger view)

(For details on VAMPIR see newsletter 129: http://www.arsc.edu/support/news/T3Enews/T3Enews129.shtml.)

A simple optimization here is to use a store and forward mechanism whereby each slave may send data to other slaves to reduce the reliance on one processor. However the user would have to reconsider the algorithm with each change of architecture or even for different data sizes. (A simple store forward algorithm takes nearly 100 lines of code.)

By using MPI_BCAST it is hoped that the vendor or supplier of the MPI library will do this work for us! In fact Cray MPI_BCAST uses an unbalanced tree algorithm to broadcast the message when the group is MPI_COMM_WORLD.

Below are some data transfer times comparing a simple hand-code send/recv and MPI_BCAST, where,

  N:      Number of integers broadcast.
  s/r:    Indicates send/recv
  bcast:  Indicates MPI_BCAST


       |       4 PEs         |     16 PEs            |      32 PEs
       |                     |                       |                    
N      | s/r      | bcast    | s/r       | bcast     | s/r       | bcast
-------|----------|----------|-----------|-----------|-----------|-----------
1      | 19157    | 14170    | 100203    | 24761     | 217539    | 31113
32     | 211910   | 91972    | 843725    | 166380    | 1721043   | 210290
1024   | 614992   | 282801   | 2446954   | 539433    | 4793081   | 677011
524288 | 78409833 | 55263055 | 375497958 | 110121727 | 772674099 | 137610295

A graph of the full data set, which shows an interesting stair-step pattern (discussed below), is shown:

MPI_BCAST vs send-recv

MPI_CAST vs Send/Recv
(click on image for larger view)

For both the serial send/recv and MPI_BCAST, the time taken increases both with the number of processors and the number of data items being broadcast. For the simple serial send this is linear with processor number, for the MPI_BCAST the stepped structure comes from the use of a tree structure. The data is sent to a few processors first which then send data to other processors which repeat this process until all processors have received the data. The time taken is determined by the number of levels and the steps in the performance curve occur when the addition of one processor causes the next level to be occupied. As more processors are involved the width of the tree at the base level increases.

In conclusion MPI_BCAST is faster than a simple sequential send/recv, the performance advantage being greater the more processors are involved.

!! Skipping Next Issue of Newsletter !!

Next issue: May 15, 1998. We're very busy bringing chilkoot on-line!

Quick-Tip Q & A

A: {{ When I first learned to program, I used "bubble-sort".  Now I use
      "quick-sort".  Is there any faster-yet sorting algorithm? }}

   It depends on your data.  Sorting is a big subject, but assuming
   your data is stored in arrays, here are some ideas.

   Given general data and no advance knowledge of the degree to which
   it is pre-sorted, stick with quicksort.  It sorts in order N*log(N)
   time (where N is the number of elements to sort).

   If you're *sure* that the data is only out of place by a few
   elements, insertion sort will beat quicksort.  This might happen if
   you added a few elements, re-sorted, added a few more, etc.  But be
   careful: in the general case, it's an O(N^2) sort.

   If your data is of an ordinal type (integer or character, for
   instance) and has a sufficiently limited range (integers between 0
   and 1,000,000, for instance), try the bucket sort.  It works in O(N)
   time.

                         |
                         |                                             
Q: Can "vi" insert a "|" |character into the same column over a specific
   range of rows?        |
                         |
   Even if some of the ro|ws are empty?  
                         |
   Or short?             |
                         |
   Say you wanted to crea|te a crude table (see the last article) or
   delimit a specific col|umn number (column #33, as shown here, for
   instance).            |
                         |
                         |

[ Answers, questions, and tips graciously accepted. ]

 


Current Editors:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619
Donald Bahls ARSC User Consultant ph: 907-450-8674
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
Contact:
Send comments and questions to the current editors using this Contact Form.
Email Subscriptions: Archives:

 

Newsletter Index Quick-Tip Index Search Newsletters

 

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:

home | search | about | support | news | science | resources