ARSC T3E Users' Newsletter 158, December 18, 1998

ARSC Upgrades to VAMPIR 2.0 and Revamps Tutorial

VAMPIR is a graphical tool for analyzing the performance and message passing characteristics of parallel programs that use the MPI communication library. There are three steps to using VAMPIR: 1) compile your T3E MPI code for tracing, 2) run the executable, 3) analyze the resulting .bpv file on an SGI workstation.

VAMPIRtrace has been upgraded to version 1.5.1 on yukon and VAMPIR has been upgraded to version 2.0 on the ARSC SGIs. VAMPIR 2.0 offers several new features and a significantly improved user interface.

The location of the libraries and license files has been changed to conform with ARSC's standard method of installing third-party packages.

We have extracted the 2-part tutorial given in issues #146, #147, brought it up-to-date for the new version of VAMPIR, and put it on-line at:

This tutorial tells everything you need to know in order to use this powerful tool. Alternatively, ARSC users can read "news VAMPIR" for the nitty-gritty. Enjoy!

MPICH-T3E Installed on Yukon

The MPICH-T3E implementation of MPI is now available on yukon in the directory:


MPICH-T3E is the port of MPICH-1.1.0 to the Cray T3E supercomputer. This port was developed by the High Performance Computing Lab at Mississippi State University. For more information, visit:

To access MPICH-T3E:

  Add to your include path when you compile:
    -I /usr/local/pkg/mpich/current/include

  And add to your library path:
    -L usr/local/pkg/mpich/current/lib/cray_t3e/t3e/

  And specify the mpi libraries:
    Fortran: -lfmpi -lmpi
    C:       -lmpi

  For instance:
    cc prog.c   -I /usr/local/pkg/mpich/current/include               \
        -L /usr/local/pkg/mpich/current/lib/cray_t3e/t3e/ -lmpi

    f90 prog.f   -I /usr/local/pkg/mpich/current/include              \
        -L /usr/local/pkg/mpich/current/lib/cray_t3e/t3e/ -lmpi -lfmpi

Below are times from the "ring" program given first in newsletter #66 . This program was modified to include larger buffers and to use MPI_REAL instead of MPI_REAL4 buffers.

The program times a buffer as it is passed around a ring of PEs using MPI_Send and MPI_Recv.

The table below reports transfer times in microseconds per MPI_REAL buffer. It shows runs compiled with both the MPICH-T3E 1.1.0 and MPT version of MPI. This was run on a T3E-900. Your mileage may vary, but if your program passes a lot of messages, you might try MPICH. Be sure to do some validations runs, and let us know what you find.

 buffer   mpich        mpt
 size     (usec)      (usec)

  1        9.85        17.35
  2        10.45       17.5
  3        10.5        17.6
  4        10.65       21.6
  7        11.15       22.65
  8        10.7        23.05
  15       12.85       22.45
  16       12.15       22.05
  31       14.7        24.4
  32       13.7        24.45
  63       15.35       27.9
  64       15.1        27.3
  127      16.75       35.45
  128      16.6        34.6
  255      20.15       49.35
  256      20.2        47.9
  511      28.6        69.8
  512      27.8        67.3
  1023     42.3        109.45
  1024     40.9        95.2
  2047     66.7        145.15
  2048     67.35       144.55
  4095     118.5       247.75
  4096     120.8       247.4

Another Example of Post-Processing: Co-array Fortran

A strength of Co-array Fortran (CAF) is that it's such a simple extension to a well-known language. We thought we might learn something by using CAF to rewrite the post-processing program given in the last issue ( /arsc/support/news/t3enews/t3enews157/index.xml ).

As a reminder, the problem is data reduction of data stored across many files. We read the files on all PEs, extract particular fields from them, combine the data to a "master" PE, sum it, and store these results.

Here's the code that combines and sums:

MPI version:

! global averages

      call MPI_REDUCE(aglb,apglb,ndata,MPI_REAL,MPI_SUM,mroot,
     &     MPI_COMM_WORLD,ierr)

CAF version #1:

      apglb = aglb

      call sync_images ()

      if ( myimg .eq. master ) then

! Sum arrays from all images on master

        do imgn = 2, nimgs
          apglb(:) = apglb(:) + apglb(:)[imgn]


The CAF version was easy to write, the logic is transparent, and, given a little background, would make sense to most Fortran programmers. On the other hand, someone who knows MPI could write the MPI version easily and the MPI_REDUCE call conveys a lot of meaning by encapsulating several actions. Also, the MPI version is faster.

Addressing the performance issue, in the CAF version, only the master PE does work. It gets data from the other PEs, one at a time, while they sit idle. We can distribute this work better, using a tree to gather data to the master:

  1 2 3 4 5 6 7 8 9 10 11

  1   3   5   7   9    11
  1       5       9


  1               9

This reduces the number of passes through the above loop from nimgs-1 to ceiling(log2 (nimgs)).

CAF version 2:

! Sum arrays from all images on master
      apglb = aglb

! Tree algorithm for combining results. In the first stage, every odd
! numbered image gets data from image to its right. In the second stage,
! images 1,5,9, etc... get data from images 3,7,11, etc...

      ! Total number of stages, or levels, in binary tree is
      ! the ceiling of log base-2 of the number of images.
      nstages = ceiling (alog (real (nimgs)) / alog (2.0))

      do istage = 1, nstages

        call sync_images ()

        if ( mod (myimg - 1, 2**istage) .eq. 0) then
          mypartner = myimg + 2**(istage - 1)

          if (mypartner .le. nimgs) then
            apglb(:) = apglb(:) + apglb(:)[mypartner]


CAF version #2 is almost as fast as the MPI version. It's certainly not as easy to write or read as CAF version #1.

The graphs in figure 1 give timing comparisons between the two CAF versions and the MPI version. The execute times include the time to read the files, combine the results, and reduce the data.

Figure 1

The first graph considers a problem size that scales as we add PEs. This is a common scenario--people want to solve bigger problems, not just the same problems faster. The inefficiency of the CAF #1 implementation is apparent in this graph as the number of PEs increase.

The second graph considers a fixed problem size. We see the same problem with CAF #1. It's also apparent that the overall rate of disk I/O stops improving once 8 or more PEs are reading at the same time. (This was the conclusion of the article in the previous issue.)

It's interesting that the three traces converge at 8-PES. Given the file I/O constraint, 8-PEs seems appropriate, regardless of algorithm. Thus, in this situation, the simple CAF program is perfectly serviceable, and (depending on your MPI experience) might be easier to write.

Here is the code for CAF #1. (Send e-mail if you'd like to see all of CAF #2.)

      program avg_caf
! Co-array Fortran version of multiple-file read and data reduce post-
! processing example.
! ARSC, December 1998

      implicit none

      include 'mpif.h'

      integer myimg, nimgs, imgn, master
      integer ierr

! data for file read
      integer nfiles[*],ndata[*]

! filename
      character*80 myfilename

! file channel
      integer myread

! data for processing
      real, allocatable, dimension(:)[:] :: apglb, apmax
      real, allocatable, dimension(:) :: aread, aglb, apglb_tmp

      integer iloc(1)

! loops
      integer iread, ido
      integer idata

! timers 
      double precision :: io_read[*]
      double precision :: io_start,io_end,mio_read
      double precision :: start,end


! setup MPI
      call MPI_INIT( ierr )

      myimg = this_image()
      nimgs = num_images()
      print *, "image", myimg, " of ", nimgs, " is alive"

      master = 1

      if (myimg .eq. master) then

        write(6,*) ' enter number of files '
        read(5,99) nfiles
        write(6,*) ' reading ',nfiles,' files '
        write(6,*) ' enter number of data items in each file '
        read(5,99) ndata
        write(6,*) ' reading ',ndata,' data items '
 99     format(i10)


! copy values to all images
      call sync_images ()
      nfiles = nfiles[master]
      ndata = ndata[master]



! data to store global average
      allocate ( aread(ndata) )
      allocate ( aglb(ndata) )
      allocate ( apmax(nfiles)[*] )


! set separate channel for each processor

! initialize read time 

! read files as round robin.
      do iread=1,nfiles


        if(ido .eq. 0) then

!          write(6,*) ' image ',myimg,
!     &      ' reading from file number ',iread

          write(myfilename,"(a,i6.6)") '/tmp/baring/D/data',iread
!          write(6,*) myfilename

! open file

! read data
          read(myread) aread


! sum data read locally thus far

! copy maximum from this file to master array
          apmax(iread)[master] = maxval(aread)

! close this file




! Done reading local files.  Gather results to master

      allocate (apglb(ndata)[*])
      apglb = aglb

      call sync_images ()

      if ( myimg .eq. master ) then

! Sum arrays from all images on master

        do imgn = 2, nimgs
          apglb(:) = apglb(:) + apglb(:)[imgn]

! Sum io_read times from all images on master

        mio_read = io_read
        do imgn = 2, nimgs
          mio_read = mio_read + io_read[imgn]

! Compute average

! Print results
        write(6,*) ' maximum average is at ',iloc(1),apglb(iloc(1))

        write(6,*) ' maximum value is in file number ',
     &      iloc(1),apmax(iloc(1))


        write(6,*) ' took ',end-start,' seconds ',
     &      ' io total time ',mio_read


      call MPI_FINALIZE(ierr)

Call for CUG Technical Paper Abstracts by 8 Jan 1999

[ Received recently... ]

> You are invited to submit a technical paper for the 41st CUG Conference
> (Supercomputing Summit) in Minneapolis, Minnesota USA during 24-28 May
> 1999.
> CUG is your SGI Cray systems user forum for high-performance
> computation and visualization, and your CUG Program Committee is
> working diligently to maximize that value for you and your colleagues
> at supercomputing sites like yours across the world.
> That's why we're asking you to consider submitting a technical paper
> abstract for the next Supercomputing Summit sponsored by CUG. All the
> information you need is on-line at either of the following URLs.

> or

> As it was with the 40 technical programs before it, the key to our next
> conference is a broad variety of high quality technical papers from SGI
> Cray contacts and from CUG site colleagues like yourself. Exchanging
> supercomputing insights is the essence of the technical presentations
> and their publication in the conference proceedings on CD-ROM. So, we
> have a great need for your suggestions on how to make the next
> conference the best one, yet.
> And, we hope that you will please consider submitting an abstract for a
> technical paper!
> In any case, you will want to mark your calendar and plan to register
> for the conference. You can find out more about the conference on-line
> at our CUG home page.

> There will be many formal and informal opportunities for you to share
> challenges and exchange information with your colleagues from other CUG
> sites and with SGI Cray technical experts, so you won't want to miss
> this meeting!  And, we don't want you to miss the opportunity to submit
> an abstract for a technical paper. The deadline for submitting an
> abstract is Friday,8 Jan 1999.
> Please check the WWW information on the CUG home page today!
> With regards,
> Sam Milosevich (
> CUG Vice President and Program Committee Chair

Next Issue: Jan. 8, 1999

Happy Holidays, Everyone!

Quick-Tip Q & A

A: {{ Fortran again. ALOG10 is an intrinsic CF90 function.  The man
      page even says so:

       LOG10 is the generic function name.  ALOG10 and DLOG10 are
       intrinsic for the CF90 compiler.

    But this program won't compile:
        program test
          write (6,*) "alog10 (1000)  :",  alog10 (1000)

    Here's the error message:  
      yukon% f90 test.f

      write (6,*) "alog10 (1000)  :",  alog10 (1000)
      cf90-700 f90: ERROR TEST, File = junk.f, Line = 3, Column = 35 
        No specific intrinsic exists for the intrinsic call "ALOG10".

    What's wrong? }}

  Thanks to two readers:

  Here is my two-minute fix; insert a decimal point into 1000 to make
  it a floating point number, which is the argument that the intrinsic
  function alog10 expects.


  I think there are two problems here, i) an imprecise error message
  and ii) ugly Fortran programming. The compiler tries to tell it has
  no integer intrinsic for the log10. If you replace alog10(1000) with
  log10(1000) you get the same error message. There is no ilog10, as
  documented in the man page.

  The second problem is peoples bad habits of using specific intrinsics
  rather than generic. I think generic intrinsics are one of the nice
  things with Fortran. The compiler looks at the data type of the
  argument and chooses by itself, making code cleaner. If you need or
  want to direct it towards a specific intrinsic this is done by type
  casting the argument. In this case you could for example type

        write (6,*) "alog10 (1000)  :",  log10 (real(1000))

  and everything will work as expected.

Q: (Finally, one for the C programmers!)  Will C correctly cast 
   a pointer reference value?  For instance, given this declaration:
     unsigned temp = *uint;
   will "temp" be set as expected even if "uint" is an int pointer?

[ Answers, questions, and tips graciously accepted. ]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top