ARSC T3E Users' Newsletter 164, March 19, 1999


MPI provides a wide range of collective communications. These send/receive data to/from a group of processors from/to a single processor. This single processor is referred to as the root processor. A simple MPI_GATHER/MPI_SCATTER operation receives/sends the same amount of data from/to each processor. As mentioned last week, MPI_GATHERV/MPI_SCATTERV extend this functionality by allowing both the amount of data received/sent to each processor to vary and for an offset in the location on the root processor where data is written to/read from.

The code below gives an example of the use of MPI_SCATTER and MPI_GATHERV along with a simple parallel IO optimization. The root computes task sizes and uses MPI_SCATTER to distribute them to all processors. Here's an example with the differences between task sizes exaggerated:

  Total data items to be computed:           23
  Number of processors:                      6
  Distribution of tasks computed on root as: 6+2+2+5+7+1 = 23
  Task sizes stored to an array of length 6: /622571/
  Array then scattered from root...
  <after MPI_SCATTER>

  Processor number:  0       1       2       3       4       5
  Array sizes:       1       1       1       1       1       1
  Scattered Array:   /6/     /2/     /2/     /5/     /7/     /1/

After the processors have completed their independent computations, MPI_GATHERV sends the different sized chunks of results back to the root:

  Processor number:  0         1         2         3         4         5
  Array sizes:       6         2         2         5         7         1
  Result Arrays:     /mmmmmm/  /nn/      /oo/      /ppppp/   /qqqqqqq/ /r/
  Array then MPI_GATHERV'ed to root...

  <after MPI_GATHERV on root>

  Array size:        23
  Gathered array:    /mmmmmmnnoopppppqqqqqqqr/

In greater detail, the program:

  1. Starts MPI and sets a main root processor, mroot.
  2. For each work variable it then sets a different root processor for each variable to be gathered to. (Note use of MOD here to ensure processors are valid.)
  3. A simple counting method is used to set roughly equal data sizes on each processor. Tables for use in MPI_GATHERV of data size, indata, and displacement, iofs, are created.
  4. Table and offsets are broadcast, to all processors, and MPI_SCATTER is used to send each processor the size of data it will work with.
  5. Each processor, if it has a non-zero worksize, allocated space and does work required.
  6. Root processors allocate space to GATHER data into.
  7. All processors call MPI_GATHERV, 4 times to collect each variable.
  8. Root processors now dump data to disk.
  9. MPI services closed.

Points to note:

  • The problem size, processor worksize table, and offsets are broadcast so that any processor could work as a root. These values are only required on the processors which actually gather data, i.e. the roots themselves. The only data items which need to exist on the non-root processors are the sending buffer, the size of data to be sent, the type of data being sent, the identity of the root processor, and the communicator to use. (This broadcast could have been refined to only send data to the processors acting as roots, see newsletter #146:


    for details on the use of MPI communicators to restrict MPI activity to processor subsets.)

  • Note that the offsets are how far from the start of the array the data should be stored, not pointers or array addresses. So, to put data at the start of the array the offset value is 0, not 1!
  • In this example MPI_GATHERV is used to create an array in which data from processors is stored in order without gaps in array xglb. Both the order and spacing of data could be varied by use of the offset values. However it is an error for MPI_GATHERV to attempt to write to any location in the receiving array more than once.
  • By gathering data to different root processors parallel IO is achieved since each of the root processors calls the routine var_dump at the same time but to different files with the local variable.

Note all processors complete all 4 calls to MPI_GATHERV before starting IO. If the call to var_dump was made after each MPI_GATHERV the processor writing would be the only processor working, as others waited for it to complete writing the file and play its part in the MPI_GATHERV operation. (For another example on parallel IO see newsletter #157:

/arsc/support/news/t3enews/t3enews157/index.xml )

The code:

c gatherv demo
c written by G Robinson, ARSC 17,18,19th March 1999.
      program main

      implicit none
      include 'mpif.h'

! mpi admin variable
      integer myid, numprocs, mroot,ierr
      common /par_info/ myid, numprocs, mroot

! distribution tables
      integer, dimension(:), allocatable :: indata, ipdata, iofs
      integer ipglb,ipset

! which processors does which glb storage
      integer Uroot,Vroot,Wroot,Troot
      integer iuchan,ivchan,iwchan,itchan

! merger array
      integer, dimension(:), allocatable :: uglb,vglb,wglb,tglb

! processors workdata
      integer i,ipdo
      integer, dimension(:), allocatable :: u,v,w,t
! number of different distributions
      integer nvar

!loop and counters
      integer ip,ipsize
      integer is,if

!! start MPI
      call MPI_INIT( ierr )
      call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
      call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
      print *, "Process ", myid, " of ", numprocs, " is alive"

      mroot = 0

!set glb workspace roots

!set datafile channel output

      write(6,*) ' processor ',myid,' roots are ',
     $      Uroot,Vroot,Wroot,Troot

!! for remote roots table sizes are globally allocated

!! read problem size

      if(myid.eq.mroot) then

        write(6,*) ' input problem size '
        read(5,*) ipglb
        write(6,*) ' size input: ',ipglb

!! determine distribution
!! simple counting method for distribution
        do ip=1,ipglb

!! set pointers for gather operation
        do ip=1,numprocs

!! for remote roots broadcast tables.
      call MPI_BCAST(ipglb, nvar, MPI_INTEGER, mroot,
     $      MPI_COMM_WORLD, ierr)
      call MPI_BCAST(indata, numprocs, MPI_INTEGER, mroot,
     $      MPI_COMM_WORLD, ierr)
      call MPI_BCAST(iofs, numprocs, MPI_INTEGER, mroot,
     $      MPI_COMM_WORLD, ierr)

!! distribute to each processor worksize
      call MPI_SCATTER(indata,nvar,MPI_INTEGER,
     $      ipdata,nvar,MPI_INTEGER,mroot,MPI_COMM_WORLD,ierr)

      write(6,*) ' on ',myid,' working on ',ipdata(nvar)

!! allocate work array, U,V,W,T
      if( then

!! do something simple
        do i=1,ipsize


!! gatherv results to different processors
!! only allocate space on root processors
      if(myid.eq.Uroot) allocate(uglb(ipglb))
      if(myid.eq.Vroot) allocate(vglb(ipglb))
      if(myid.eq.Wroot) allocate(wglb(ipglb))
      if(myid.eq.Troot) allocate(tglb(ipglb))

! call gatherv on ALL processors.
      call MPI_GATHERV(u,ipdata(nvar),MPI_INTEGER,
     $      uglb,indata,iofs,MPI_INTEGER,Uroot, MPI_COMM_WORLD,
     $      ierr)

      call MPI_GATHERV(v,ipdata(nvar),MPI_INTEGER,
     $      vglb,indata,iofs,MPI_INTEGER,Vroot, MPI_COMM_WORLD,
     $      ierr)

      call MPI_GATHERV(w,ipdata(nvar),MPI_INTEGER,
     $      wglb,indata,iofs,MPI_INTEGER,Wroot, MPI_COMM_WORLD,
     $      ierr)

      call MPI_GATHERV(t,ipdata(nvar),MPI_INTEGER,
     $      tglb,indata,iofs,MPI_INTEGER,Troot, MPI_COMM_WORLD,
     $      ierr)


!! write data out on host processor.
      if(myid.eq.Uroot) call var_dump(iuchan,ipglb,uglb)
      if(myid.eq.Vroot) call var_dump(ivchan,ipglb,vglb)
      if(myid.eq.Wroot) call var_dump(iwchan,ipglb,wglb)
      if(myid.eq.Troot) call var_dump(itchan,ipglb,tglb)

      if(myid.eq.Uroot) then
        do ip=1,numprocs
          write(6,*) ' on ',myid,' from  ',ip,' data start is at ',
     $     is,' value ',uglb(is),' end ',if,' value ',uglb(if)

!! quit MPI
 399  call MPI_FINALIZE(ierr)
      subroutine var_dump(ic,in,idata)

! input channel
      integer ic
! input number of data items
      integer in
! input data to write
      integer, dimension(in) :: idata

!par common
      integer myid, numprocs, mroot
      common /par_info/ myid, numprocs, mroot

      write(6,*) ' dumping on ',myid,' channel ',ic,in

      write(ic) idata


"Restart" vs "Rerun" and Preventing the Latter

In NQS terminology, "restart" and "rerun" are distinct terms, as follows:


the NQS request runs for a while, is interrupted, and a checkpoint image is created. Later, using the checkpoint image, it starts up again from the the point at which it was interrupted. Wall-clock time is lost, but no CPU time is lost.


the request runs for a while and is interrupted. Later, it starts again FROM THE BEGINNING. Wall-clock time is lost, and generally CPU time is lost as well.

"Reruns" are uncommon. When a controlled shutdown occurs or running jobs are held, the T3E system operator will hold and checkpoint all jobs first. When he/she releases the jobs, they "restart" from where they were when halted. However, if the system crashes (takes "unexpected downtime") the system will either "rerun" the work from the beginning, or if the job was checkpointed at some time in its past, resume processing from there.

In some cases, jobs should NOT be rerun. For example consider a job which works on a database updating entries. Being rerun may repeat some operations and would not be correct. (Ever wondered why you got billed twice for something?)

Jobs which require manual user intervention between runs should generally not be rerun either. For instance, a program which overwrites result files from prior runs might overwrite useful output generated prior to the interruption.

Fortunately, the "-nr" QSUB options allows you to prevent your job from rerunning. From the qsub man page:

-nr Specifies that the batch request cannot be rerun.

If your job should not be rerun, it's your responsibility to specify qsub -nr.

Computer Art Show Opens at UAF

The first art student at UAF to complete a Bachelor of Fine Arts entirely through creation of computer graphics will present his senior portfolio to the public at an art exhibit in the UAF Art Gallery 6 p.m. Monday, March 22 in the Fine Arts Complex.

Ben Barton used Macintosh, PC, and SGI computers with off-the-shelf software to create large, colorful images of reflective, refractive objects in a 3D environment. His show, titled "RayFract," will be on display 8 am - 5 pm in the UAF Art Gallery through Friday, March 26.

"They're all raytraced images," Barton said. "I set up the parameters of a 3D environment, and create the 3D shapes, colors, textures and reflections."

Barton is the first student at UAF to go through the formal BFA process in Computer Art, a degree program that was just accepted last year, said Bill Brody, Chairman of the UAF art department. "Ben has chosen to construct these imaginary scenes with a devotion to detail that is really quite remarkable" Brody said. "I'm not familiar with any other people working like this. The resolution, the image quality is much, much higher than anybody does in movies. Most people don't have the patience to spend as much time on a computer image as you would on a painting, and Ben does."

Editor's note: Ben Barton is an ARSC student assistant and uses ARSC visualization equipment in his work. Beginning next week, he expects to have some of his images on the web, at:

Congratulations, Ben!


Quick-Tip Q & A

A:{{ You may have different storage quotas for files residing on disk
     and those that have been migrated by DMF to tape (this is true for
     ARSC users).  How can you determine your current storage volume
     separated into disk and DMF components? }}

    du -m

    The "du" command summarizes disk usage.  From the man page, here are
    three especially useful options:

       -k   Expresses all block counts in terms of 1024-byte units, rather
            than the default 512-byte units.

       -m   Reports block counts for migrated (offline) files and nonmigrated
            (online) files.  Valid only with Cray Research file systems.  See
            the EXAMPLES section for the format of this output.

       -s   Reports only the grand total for each of the specified file

    And an example, showing the effect of "-m":

      yukon:baring$ du -sk ~  
      216320  /u1/uaf/baring

      yukon:baring$ du -msk ~
      216320     on-line   844744     off-line /u1/uaf/baring

Q: Here's a fun one...

   Sourdough Sam sells moose for $10 each, reindeer for $3 each, and
   ducks for $0.50 each.  He floats his raft down the Yukon River to
   the Chilkoot Pass Trading Post one spring morning and sells exactly
   100 animals for exactly $100, selling at least one of each species.
   How many of each did he sell?

[ Answers, questions, and tips graciously accepted. ]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top