ARSC T3E Users' Newsletter 141, April 17, 1998
Storage Upgrade for Yukon Users
As announced in newsletter #139, ARSC is upgrading its storage environment. The upgrade will occur in phases, with the first taking place in a couple of weeks, concurrent with the cut-over from denali to chilkoot. Watch news items for the exact date.
Following the cut-over:
- CRL will be accessed from chilkoot rather than denali.
- Two new file systems will be cross-mounted between yukon, chilkoot, and the ARSC network of SGI visualization hosts. They are:
- /viztmp: 14GB;
- under DMF control;
- 10 GB quota (per user) for all files (migrated and on-line);
- no backups;
- files over 10-days old purged;
- exported to yukon and ARSC SGIs.
- /allsys: 37GB;
- under DMF control;
- 10 GB quota (per user) for on-line (unmigrated) files;
- no quota for migrated files;
- nightly backups;
- exported to yukon and ARSC SGIs.
For more information visit our transition pages, at:
http://www.arsc.edu/pubs/bulletins/Transition.html
MPI Collective Communications
MPI provides a set of functions which perform collective communications between processor groups. This includes the following functions.
- Barrier synchronization.
- Global communication functions such as broadcast and gather/scatter of data across processor sets.
- Global reduction operations which perform gather/scatters and an operation.
Sometimes it is asked, why are all these routines needed, since everything can be done with send and recv? Perhaps the simplest answer is that these frequently used data movement operations can be tailored to particular systems communication topologies, memory, or other hardware features and like many other libraries are tested to ensure correct behavior. This provides the user with both correct code and higher performance.
For example consider a simple broadcast operation. In the code below the broadcast operation is hand-coded using a simple serial send and recv, and this is compared with a call to MPI_BCAST which results in the same data transfer.
program prog1
implicit none
include 'mpif.h'
! information for local and global conditions.
integer mype
integer totpes
integer mproc
! general mpi error flag
integer ierr
integer istatus(MPI_STATUS_SIZE)
! dataspaces
integer, allocatable, dimension(:) :: ia,ib
integer isize
integer i,is,ip
integer ndata,itag,mrproc
!timing
integer isrtc,ifrtc,isbtc,ifbtc
! setup mpi
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, mype, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, totpes, ierr)
write(6,*) ' mype is ',mype,' of total ',totpes
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
mproc=0
isize=1
do i=1,20
! setup data for testing
allocate (ia(isize))
allocate (ib(isize))
do is=1,isize
ia(is)=i
ib(is)=i
enddo
! broadcast using send/recv
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
isrtc=irtc()
if(mype.eq.mproc) then
ndata=isize
itag=i
do ip=0,totpes-1
if(ip.ne.mproc) then
call MPI_SEND(ia,ndata,MPI_INTEGER,ip,itag,
$ MPI_COMM_WORLD,ierr)
endif
enddo
else
ndata=isize
itag=i
mrproc=mproc
call MPI_RECV(ia,ndata,MPI_INTEGER,mrproc,itag,
$ MPI_COMM_WORLD,istatus,ierr)
endif
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
ifrtc=irtc()
! broadcast using broadcast
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
isbtc=irtc()
call MPI_BCAST(ib,isize,MPI_INTEGER,mproc,
$ MPI_COMM_WORLD,ierr)
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
ifbtc=irtc()
! report timing
if(mype.eq.mproc) then
write(6,*) ' send/recv ',isize,' took ',
$ ifrtc-isrtc,ifbtc-isbtc,ia(isize),ib(isize)
write(6,*)
$ totpes,isize,ifrtc-isrtc,ifbtc-isbtc
endif
write(10+mype,*) ia(isize),ib(isize)
!increase size
isize=isize*2
deallocate(ia)
deallocate(ib)
enddo
call MPI_FINALIZE(ierr)
end
Note that, for both versions, the processor which hosts the data is identified as mproc and the data is distributed to arrays of the same name on all the processors. The routine MPI_BCAST is called on all processors. If one of the processors in the communicator MPI_COMM_WORLD does not call MPI_BCAST the code will deadlock.
By using MPI_BCAST in one line, the code has done what took tens of lines in user code with MPI_SEND/RECV.
Also the user has coded the broadcast operation so that the master sends data to each slave. The master's linear progression through the other PEs is clearly visualized in the VAMPIR graph shown.
VAMPIR Graph (click on image for larger view)
(For details on VAMPIR see newsletter 129: /arsc/support/news/t3enews/t3enews129/index.xml .)
A simple optimization here is to use a store and forward mechanism whereby each slave may send data to other slaves to reduce the reliance on one processor. However the user would have to reconsider the algorithm with each change of architecture or even for different data sizes. (A simple store forward algorithm takes nearly 100 lines of code.)
By using MPI_BCAST it is hoped that the vendor or supplier of the MPI library will do this work for us! In fact Cray MPI_BCAST uses an unbalanced tree algorithm to broadcast the message when the group is MPI_COMM_WORLD.
Below are some data transfer times comparing a simple hand-code send/recv and MPI_BCAST, where,
N: Number of integers broadcast.
s/r: Indicates send/recv
bcast: Indicates MPI_BCAST
4 PEs
16 PEs
32 PEs
N
s/r
bcast
s/r
bcast
s/r
bcast
-------
----------
----------
-----------
-----------
-----------
-----------
1
19157
14170
100203
24761
217539
31113
32
211910
91972
843725
166380
1721043
210290
1024
614992
282801
2446954
539433
4793081
677011
524288
78409833
55263055
375497958
110121727
772674099
137610295
A graph of the full data set, which shows an interesting stair-step pattern (discussed below), is shown:
MPI_CAST vs Send/Recv (click on image for larger view)
For both the serial send/recv and MPI_BCAST, the time taken increases both with the number of processors and the number of data items being broadcast. For the simple serial send this is linear with processor number, for the MPI_BCAST the stepped structure comes from the use of a tree structure. The data is sent to a few processors first which then send data to other processors which repeat this process until all processors have received the data. The time taken is determined by the number of levels and the steps in the performance curve occur when the addition of one processor causes the next level to be occupied. As more processors are involved the width of the tree at the base level increases.
In conclusion MPI_BCAST is faster than a simple sequential send/recv, the performance advantage being greater the more processors are involved.
!! Skipping Next Issue of Newsletter !!
Next issue: May 15, 1998. We're very busy bringing chilkoot on-line!
Quick-Tip Q & A
A: {{ When I first learned to program, I used "bubble-sort". Now I use
"quick-sort". Is there any faster-yet sorting algorithm? }}
It depends on your data. Sorting is a big subject, but assuming
your data is stored in arrays, here are some ideas.
Given general data and no advance knowledge of the degree to which
it is pre-sorted, stick with quicksort. It sorts in order N*log(N)
time (where N is the number of elements to sort).
If you're *sure* that the data is only out of place by a few
elements, insertion sort will beat quicksort. This might happen if
you added a few elements, re-sorted, added a few more, etc. But be
careful: in the general case, it's an O(N^2) sort.
If your data is of an ordinal type (integer or character, for
instance) and has a sufficiently limited range (integers between 0
and 1,000,000, for instance), try the bucket sort. It works in O(N)
time.
Q: Can "vi" insert a "
"
character into the same column over a specific
range of rows?
Even if some of the ro
ws are empty?
Or short?
Say you wanted to crea
te a crude table (see the last article) or
delimit a specific col
umn number (column #33, as shown here, for
instance).
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
