ARSC T3E Users' Newsletter 158, December 18, 1998
ARSC Upgrades to VAMPIR 2.0 and Revamps Tutorial
VAMPIR is a graphical tool for analyzing the performance and message passing characteristics of parallel programs that use the MPI communication library. There are three steps to using VAMPIR: 1) compile your T3E MPI code for tracing, 2) run the executable, 3) analyze the resulting .bpv file on an SGI workstation.
VAMPIRtrace has been upgraded to version 1.5.1 on yukon and VAMPIR has been upgraded to version 2.0 on the ARSC SGIs. VAMPIR 2.0 offers several new features and a significantly improved user interface.
The location of the libraries and license files has been changed to conform with ARSC's standard method of installing third-party packages.
We have extracted the 2-part tutorial given in issues #146, #147, brought it up-to-date for the new version of VAMPIR, and put it on-line at:
http://www.arsc.edu/support/howtos/usingvampir.html
This tutorial tells everything you need to know in order to use this powerful tool. Alternatively, ARSC users can read "news VAMPIR" for the nitty-gritty. Enjoy!
MPICH-T3E Installed on Yukon
The MPICH-T3E implementation of MPI is now available on yukon in the directory:
/usr/local/pkg/mpich/current/
MPICH-T3E is the port of MPICH-1.1.0 to the Cray T3E supercomputer. This port was developed by the High Performance Computing Lab at Mississippi State University. For more information, visit:
http://www.erc.msstate.edu/mpi/cray-t3x/mpich-t3e.html
To access MPICH-T3E:
Add to your include path when you compile:
-I /usr/local/pkg/mpich/current/include
And add to your library path:
-L usr/local/pkg/mpich/current/lib/cray_t3e/t3e/
And specify the mpi libraries:
Fortran: -lfmpi -lmpi
C: -lmpi
For instance:
cc prog.c -I /usr/local/pkg/mpich/current/include \
-L /usr/local/pkg/mpich/current/lib/cray_t3e/t3e/ -lmpi
f90 prog.f -I /usr/local/pkg/mpich/current/include \
-L /usr/local/pkg/mpich/current/lib/cray_t3e/t3e/ -lmpi -lfmpi
Below are times from the "ring" program given first in newsletter #66 . This program was modified to include larger buffers and to use MPI_REAL instead of MPI_REAL4 buffers.
The program times a buffer as it is passed around a ring of PEs using MPI_Send and MPI_Recv.
The table below reports transfer times in microseconds per MPI_REAL buffer. It shows runs compiled with both the MPICH-T3E 1.1.0 and MPT 1.2.0.2 version of MPI. This was run on a T3E-900. Your mileage may vary, but if your program passes a lot of messages, you might try MPICH. Be sure to do some validations runs, and let us know what you find.
buffer mpich mpt size (usec) (usec) 1 9.85 17.35 2 10.45 17.5 3 10.5 17.6 4 10.65 21.6 7 11.15 22.65 8 10.7 23.05 15 12.85 22.45 16 12.15 22.05 31 14.7 24.4 32 13.7 24.45 63 15.35 27.9 64 15.1 27.3 127 16.75 35.45 128 16.6 34.6 255 20.15 49.35 256 20.2 47.9 511 28.6 69.8 512 27.8 67.3 1023 42.3 109.45 1024 40.9 95.2 2047 66.7 145.15 2048 67.35 144.55 4095 118.5 247.75 4096 120.8 247.4
Another Example of Post-Processing: Co-array Fortran
A strength of Co-array Fortran (CAF) is that it's such a simple extension to a well-known language. We thought we might learn something by using CAF to rewrite the post-processing program given in the last issue ( /arsc/support/news/t3enews/t3enews157/index.xml ).
As a reminder, the problem is data reduction of data stored across many files. We read the files on all PEs, extract particular fields from them, combine the data to a "master" PE, sum it, and store these results.
Here's the code that combines and sums:
MPI version:
! global averages
call MPI_REDUCE(aglb,apglb,ndata,MPI_REAL,MPI_SUM,mroot,
& MPI_COMM_WORLD,ierr)
CAF version #1:
apglb = aglb
call sync_images ()
if ( myimg .eq. master ) then
! Sum arrays from all images on master
do imgn = 2, nimgs
apglb(:) = apglb(:) + apglb(:)[imgn]
enddo
endif
The CAF version was easy to write, the logic is transparent, and, given a little background, would make sense to most Fortran programmers. On the other hand, someone who knows MPI could write the MPI version easily and the MPI_REDUCE call conveys a lot of meaning by encapsulating several actions. Also, the MPI version is faster.
Addressing the performance issue, in the CAF version, only the master PE does work. It gets data from the other PEs, one at a time, while they sit idle. We can distribute this work better, using a tree to gather data to the master:
1 2 3 4 5 6 7 8 9 10 11
/
/
/
/
/
1 3 5 7 9 11
/
/
/
/
/
/
1 5 9
/
/
1 9
/
/
/
1
This reduces the number of passes through the above loop from nimgs-1 to ceiling(log2 (nimgs)).
CAF version 2:
! Sum arrays from all images on master
apglb = aglb
! Tree algorithm for combining results. In the first stage, every odd
! numbered image gets data from image to its right. In the second stage,
! images 1,5,9, etc... get data from images 3,7,11, etc...
! Total number of stages, or levels, in binary tree is
! the ceiling of log base-2 of the number of images.
nstages = ceiling (alog (real (nimgs)) / alog (2.0))
do istage = 1, nstages
call sync_images ()
if ( mod (myimg - 1, 2**istage) .eq. 0) then
mypartner = myimg + 2**(istage - 1)
if (mypartner .le. nimgs) then
apglb(:) = apglb(:) + apglb(:)[mypartner]
endif
endif
enddo
CAF version #2 is almost as fast as the MPI version. It's certainly not as easy to write or read as CAF version #1.
The graphs in figure 1 give timing comparisons between the two CAF versions and the MPI version. The execute times include the time to read the files, combine the results, and reduce the data.
Figure 1
The first graph considers a problem size that scales as we add PEs. This is a common scenario--people want to solve bigger problems, not just the same problems faster. The inefficiency of the CAF #1 implementation is apparent in this graph as the number of PEs increase.
The second graph considers a fixed problem size. We see the same problem with CAF #1. It's also apparent that the overall rate of disk I/O stops improving once 8 or more PEs are reading at the same time. (This was the conclusion of the article in the previous issue.)
It's interesting that the three traces converge at 8-PES. Given the file I/O constraint, 8-PEs seems appropriate, regardless of algorithm. Thus, in this situation, the simple CAF program is perfectly serviceable, and (depending on your MPI experience) might be easier to write.
Here is the code for CAF #1. (Send e-mail if you'd like to see all of CAF #2.)
!****************************************************************************
program avg_caf
!
! Co-array Fortran version of multiple-file read and data reduce post-
! processing example.
!
! ARSC, December 1998
!****************************************************************************
implicit none
include 'mpif.h'
integer myimg, nimgs, imgn, master
integer ierr
! data for file read
integer nfiles[*],ndata[*]
! filename
character*80 myfilename
! file channel
integer myread
! data for processing
real, allocatable, dimension(:)[:] :: apglb, apmax
real, allocatable, dimension(:) :: aread, aglb, apglb_tmp
integer iloc(1)
! loops
integer iread, ido
integer idata
! timers
double precision :: io_read[*]
double precision :: io_start,io_end,mio_read
double precision :: start,end
!--------------------------------------------------
! setup MPI
call MPI_INIT( ierr )
myimg = this_image()
nimgs = num_images()
print *, "image", myimg, " of ", nimgs, " is alive"
master = 1
if (myimg .eq. master) then
write(6,*) ' enter number of files '
read(5,99) nfiles
write(6,*) ' reading ',nfiles,' files '
write(6,*) ' enter number of data items in each file '
read(5,99) ndata
write(6,*) ' reading ',ndata,' data items '
99 format(i10)
endif
! copy values to all images
call sync_images ()
nfiles = nfiles[master]
ndata = ndata[master]
start=MPI_WTIME()
!--------------------------------------------------
! data to store global average
allocate ( aread(ndata) )
allocate ( aglb(ndata) )
allocate ( apmax(nfiles)[*] )
aglb=0.0
! set separate channel for each processor
myread=30+myimg
! initialize read time
io_read=0.0
! read files as round robin.
do iread=1,nfiles
ido=mod(iread-myimg-1,nimgs)
if(ido .eq. 0) then
! write(6,*) ' image ',myimg,
! & ' reading from file number ',iread
write(myfilename,"(a,i6.6)") '/tmp/baring/D/data',iread
! write(6,*) myfilename
! open file
open(unit=myread,file=myfilename,form='unformatted')
! read data
io_start=MPI_WTIME()
read(myread) aread
io_end=MPI_WTIME()
io_read=io_read+io_end-io_start
! sum data read locally thus far
aglb=aglb+aread
! copy maximum from this file to master array
apmax(iread)[master] = maxval(aread)
! close this file
close(myread)
endif
enddo
!--------------------------------------------------
! Done reading local files. Gather results to master
allocate (apglb(ndata)[*])
apglb = aglb
call sync_images ()
if ( myimg .eq. master ) then
! Sum arrays from all images on master
do imgn = 2, nimgs
apglb(:) = apglb(:) + apglb(:)[imgn]
enddo
! Sum io_read times from all images on master
mio_read = io_read
do imgn = 2, nimgs
mio_read = mio_read + io_read[imgn]
enddo
! Compute average
apglb=apglb/real(nfiles)
! Print results
iloc=maxloc(apglb)
write(6,*) ' maximum average is at ',iloc(1),apglb(iloc(1))
iloc=maxloc(apmax)
write(6,*) ' maximum value is in file number ',
& iloc(1),apmax(iloc(1))
end=MPI_WTIME()
write(6,*) ' took ',end-start,' seconds ',
& ' io total time ',mio_read
endif
call MPI_FINALIZE(ierr)
stop
end
Call for CUG Technical Paper Abstracts by 8 Jan 1999
[ Received recently... ]
> You are invited to submit a technical paper for the 41st CUG Conference > (Supercomputing Summit) in Minneapolis, Minnesota USA during 24-28 May > 1999. > > CUG is your SGI Cray systems user forum for high-performance > computation and visualization, and your CUG Program Committee is > working diligently to maximize that value for you and your colleagues > at supercomputing sites like yours across the world. > > That's why we're asking you to consider submitting a technical paper > abstract for the next Supercomputing Summit sponsored by CUG. All the > information you need is on-line at either of the following URLs. > > http://www.cug.org/ > or > http://www.fpes.com/cug_abstracts/call.html > > As it was with the 40 technical programs before it, the key to our next > conference is a broad variety of high quality technical papers from SGI > Cray contacts and from CUG site colleagues like yourself. Exchanging > supercomputing insights is the essence of the technical presentations > and their publication in the conference proceedings on CD-ROM. So, we > have a great need for your suggestions on how to make the next > conference the best one, yet. > > And, we hope that you will please consider submitting an abstract for a > technical paper! > > In any case, you will want to mark your calendar and plan to register > for the conference. You can find out more about the conference on-line > at our CUG home page. > > http://www.cug.org/ > > There will be many formal and informal opportunities for you to share > challenges and exchange information with your colleagues from other CUG > sites and with SGI Cray technical experts, so you won't want to miss > this meeting! And, we don't want you to miss the opportunity to submit > an abstract for a technical paper. The deadline for submitting an > abstract is Friday,8 Jan 1999. > > Please check the WWW information on the CUG home page today! > > > With regards, > > Sam Milosevich (sam@lilly.com) > CUG Vice President and Program Committee Chair
Next Issue: Jan. 8, 1999
Happy Holidays, Everyone!
Quick-Tip Q & A
A: {{ Fortran again. ALOG10 is an intrinsic CF90 function. The man
page even says so:
DESCRIPTION
LOG10 is the generic function name. ALOG10 and DLOG10 are
intrinsic for the CF90 compiler.
But this program won't compile:
!----------------------------------------
program test
write (6,*) "alog10 (1000) :", alog10 (1000)
end
!----------------------------------------
Here's the error message:
yukon% f90 test.f
write (6,*) "alog10 (1000) :", alog10 (1000)
^
cf90-700 f90: ERROR TEST, File = junk.f, Line = 3, Column = 35
No specific intrinsic exists for the intrinsic call "ALOG10".
What's wrong? }}
Thanks to two readers:
######################
Here is my two-minute fix; insert a decimal point into 1000 to make
it a floating point number, which is the argument that the intrinsic
function alog10 expects.
###
I think there are two problems here, i) an imprecise error message
and ii) ugly Fortran programming. The compiler tries to tell it has
no integer intrinsic for the log10. If you replace alog10(1000) with
log10(1000) you get the same error message. There is no ilog10, as
documented in the man page.
The second problem is peoples bad habits of using specific intrinsics
rather than generic. I think generic intrinsics are one of the nice
things with Fortran. The compiler looks at the data type of the
argument and chooses by itself, making code cleaner. If you need or
want to direct it towards a specific intrinsic this is done by type
casting the argument. In this case you could for example type
write (6,*) "alog10 (1000) :", log10 (real(1000))
and everything will work as expected.
Q: (Finally, one for the C programmers!) Will C correctly cast
a pointer reference value? For instance, given this declaration:
unsigned temp = *uint;
will "temp" be set as expected even if "uint" is an int pointer?
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
