[Menu Bar] Resourses at ARSC Science at ARSC Newsroom Support About ARSC ARSC Home

 

ARSC T3E Users' Newsletter 164, March 19, 1999

Newsletter Index Quick-Tip Index Search Newsletters

MPI_GATHERV Example

MPI provides a wide range of collective communications. These send/receive data to/from a group of processors from/to a single processor. This single processor is referred to as the root processor. A simple MPI_GATHER/MPI_SCATTER operation receives/sends the same amount of data from/to each processor. As mentioned last week, MPI_GATHERV/MPI_SCATTERV extend this functionality by allowing both the amount of data received/sent to each processor to vary and for an offset in the location on the root processor where data is written to/read from.

The code below gives an example of the use of MPI_SCATTER and MPI_GATHERV along with a simple parallel IO optimization. The root computes task sizes and uses MPI_SCATTER to distribute them to all processors. Here's an example with the differences between task sizes exaggerated:

  Total data items to be computed:           23
  Number of processors:                      6
  Distribution of tasks computed on root as: 6+2+2+5+7+1 = 23
  Task sizes stored to an array of length 6: /622571/
  Array then scattered from root...
             
  <after MPI_SCATTER>

  Processor number:  0       1       2       3       4       5
  Array sizes:       1       1       1       1       1       1
  Scattered Array:   /6/     /2/     /2/     /5/     /7/     /1/

After the processors have completed their independent computations, MPI_GATHERV sends the different sized chunks of results back to the root:

  Processor number:  0         1         2         3         4         5
  Array sizes:       6         2         2         5         7         1
  Result Arrays:     /mmmmmm/  /nn/      /oo/      /ppppp/   /qqqqqqq/ /r/
  Array then MPI_GATHERV'ed to root...

  <after MPI_GATHERV on root>

  Array size:        23
  Gathered array:    /mmmmmmnnoopppppqqqqqqqr/

In greater detail, the program:

  1. Starts MPI and sets a main root processor, mroot.

  2. For each work variable it then sets a different root processor for each variable to be gathered to. (Note use of MOD here to ensure processors are valid.)

  3. A simple counting method is used to set roughly equal data sizes on each processor. Tables for use in MPI_GATHERV of data size, indata, and displacement, iofs, are created.

  4. Table and offsets are broadcast, to all processors, and MPI_SCATTER is used to send each processor the size of data it will work with.

  5. Each processor, if it has a non-zero worksize, allocated space and does work required.

  6. Root processors allocate space to GATHER data into.

  7. All processors call MPI_GATHERV, 4 times to collect each variable.

  8. Root processors now dump data to disk.

  9. MPI services closed.

Points to note:

Note all processors complete all 4 calls to MPI_GATHERV before starting IO. If the call to var_dump was made after each MPI_GATHERV the processor writing would be the only processor working, as others waited for it to complete writing the file and play its part in the MPI_GATHERV operation. (For another example on parallel IO see newsletter #157:

http://www.arsc.edu/support/news/T3Enews/T3Enews157.shtml )

The code:


c**********************************************************************
c gatherv demo
c written by G Robinson, ARSC 17,18,19th March 1999.
c**********************************************************************
      program main


      implicit none
      include 'mpif.h'

! mpi admin variable
      integer myid, numprocs, mroot,ierr
      common /par_info/ myid, numprocs, mroot

! distribution tables
      integer, dimension(:), allocatable :: indata, ipdata, iofs
      integer ipglb,ipset

! which processors does which glb storage
      integer Uroot,Vroot,Wroot,Troot
      integer iuchan,ivchan,iwchan,itchan

! merger array
      integer, dimension(:), allocatable :: uglb,vglb,wglb,tglb

! processors workdata
      integer i,ipdo
      integer, dimension(:), allocatable :: u,v,w,t
! number of different distributions
      integer nvar

!loop and counters
      integer ip,ipsize
      integer is,if

!! start MPI
      call MPI_INIT( ierr )
      call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
      call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
      print *, "Process ", myid, " of ", numprocs, " is alive"
      call MPI_BARRIER(MPI_COMM_WORLD,ierr)

      mroot = 0

!set glb workspace roots
      Uroot=mod(0,numprocs)
      Vroot=mod(1,numprocs)
      Wroot=mod(2,numprocs)
      Troot=mod(3,numprocs)

!set datafile channel output
      iuchan=20;ivchan=21;iwchan=22;itchan=25

      write(6,*) ' processor ',myid,' roots are ',
     $      Uroot,Vroot,Wroot,Troot

!! for remote roots table sizes are globally allocated
      allocate(indata(numprocs))
      allocate(iofs(numprocs))

!! read problem size

      if(myid.eq.mroot) then

        write(6,*) ' input problem size '
        read(5,*) ipglb
        write(6,*) ' size input: ',ipglb

        indata=0
!! determine distribution
!! simple counting method for distribution
        ipset=1
        do ip=1,ipglb
          indata(ipset)=indata(ipset)+1
          ipset=mod(ipset,numprocs)+1
        enddo

!! set pointers for gather operation
        ipset=0
        do ip=1,numprocs
          iofs(ip)=ipset
          ipset=ipset+indata(ip)
        enddo
      endif

!! for remote roots broadcast tables.
      nvar=1
      call MPI_BCAST(ipglb, nvar, MPI_INTEGER, mroot,
     $      MPI_COMM_WORLD, ierr)
      call MPI_BCAST(indata, numprocs, MPI_INTEGER, mroot,
     $      MPI_COMM_WORLD, ierr)
      call MPI_BCAST(iofs, numprocs, MPI_INTEGER, mroot,
     $      MPI_COMM_WORLD, ierr)


!! distribute to each processor worksize
      nvar=1
      allocate(ipdata(nvar))
      call MPI_SCATTER(indata,nvar,MPI_INTEGER,
     $      ipdata,nvar,MPI_INTEGER,mroot,MPI_COMM_WORLD,ierr)



      write(6,*) ' on ',myid,' working on ',ipdata(nvar)

!! allocate work array, U,V,W,T
      ipsize=ipdata(nvar)
      if(ipsize.gt.0) then
        allocate(u(ipsize))
        allocate(v(ipsize))
        allocate(w(ipsize))
        allocate(t(ipsize))

!! do something simple
        do i=1,ipsize
          u(i)=i+100000*(myid+1)
        enddo
        v=2*u;w=3*u;t=w+u

      endif

!! gatherv results to different processors
!! only allocate space on root processors
      if(myid.eq.Uroot) allocate(uglb(ipglb))
      if(myid.eq.Vroot) allocate(vglb(ipglb))
      if(myid.eq.Wroot) allocate(wglb(ipglb))
      if(myid.eq.Troot) allocate(tglb(ipglb))

! call gatherv on ALL processors.
      call MPI_GATHERV(u,ipdata(nvar),MPI_INTEGER,
     $      uglb,indata,iofs,MPI_INTEGER,Uroot, MPI_COMM_WORLD,
     $      ierr)

      call MPI_GATHERV(v,ipdata(nvar),MPI_INTEGER,
     $      vglb,indata,iofs,MPI_INTEGER,Vroot, MPI_COMM_WORLD,
     $      ierr)

      call MPI_GATHERV(w,ipdata(nvar),MPI_INTEGER,
     $      wglb,indata,iofs,MPI_INTEGER,Wroot, MPI_COMM_WORLD,
     $      ierr)

      call MPI_GATHERV(t,ipdata(nvar),MPI_INTEGER,
     $      tglb,indata,iofs,MPI_INTEGER,Troot, MPI_COMM_WORLD,
     $      ierr)


      call MPI_BARRIER(MPI_COMM_WORLD,ierr)

!! write data out on host processor.
      if(myid.eq.Uroot) call var_dump(iuchan,ipglb,uglb)
      if(myid.eq.Vroot) call var_dump(ivchan,ipglb,vglb)
      if(myid.eq.Wroot) call var_dump(iwchan,ipglb,wglb)
      if(myid.eq.Troot) call var_dump(itchan,ipglb,tglb)


!!
      if(myid.eq.Uroot) then
        do ip=1,numprocs
          is=iofs(ip)+1
          if=is+indata(ip)-1
          write(6,*) ' on ',myid,' from  ',ip,' data start is at ',
     $     is,' value ',uglb(is),' end ',if,' value ',uglb(if)
        enddo
      endif

!! quit MPI
 399  call MPI_FINALIZE(ierr)
      stop
      end
c**********************************************************************
      subroutine var_dump(ic,in,idata)

! input channel
      integer ic
! input number of data items
      integer in
! input data to write
      integer, dimension(in) :: idata

!par common
      integer myid, numprocs, mroot
      common /par_info/ myid, numprocs, mroot

      write(6,*) ' dumping on ',myid,' channel ',ic,in

      write(ic) idata

      return
      end
c**********************************************************************

"Restart" vs "Rerun" and Preventing the Latter

In NQS terminology, "restart" and "rerun" are distinct terms, as follows:

restart:

the NQS request runs for a while, is interrupted, and a checkpoint image is created. Later, using the checkpoint image, it starts up again from the the point at which it was interrupted. Wall-clock time is lost, but no CPU time is lost.

rerun:

the request runs for a while and is interrupted. Later, it starts again FROM THE BEGINNING. Wall-clock time is lost, and generally CPU time is lost as well.

"Reruns" are uncommon. When a controlled shutdown occurs or running jobs are held, the T3E system operator will hold and checkpoint all jobs first. When he/she releases the jobs, they "restart" from where they were when halted. However, if the system crashes (takes "unexpected downtime") the system will either "rerun" the work from the beginning, or if the job was checkpointed at some time in its past, resume processing from there.

In some cases, jobs should NOT be rerun. For example consider a job which works on a database updating entries. Being rerun may repeat some operations and would not be correct. (Ever wondered why you got billed twice for something?)

Jobs which require manual user intervention between runs should generally not be rerun either. For instance, a program which overwrites result files from prior runs might overwrite useful output generated prior to the interruption.

Fortunately, the "-nr" QSUB options allows you to prevent your job from rerunning. From the qsub man page:

-nr Specifies that the batch request cannot be rerun.

If your job should not be rerun, it's your responsibility to specify qsub -nr.

Computer Art Show Opens at UAF

The first art student at UAF to complete a Bachelor of Fine Arts entirely through creation of computer graphics will present his senior portfolio to the public at an art exhibit in the UAF Art Gallery 6 p.m. Monday, March 22 in the Fine Arts Complex.

Ben Barton used Macintosh, PC, and SGI computers with off-the-shelf software to create large, colorful images of reflective, refractive objects in a 3D environment. His show, titled "RayFract," will be on display 8 am - 5 pm in the UAF Art Gallery through Friday, March 26.

"They're all raytraced images," Barton said. "I set up the parameters of a 3D environment, and create the 3D shapes, colors, textures and reflections."

Barton is the first student at UAF to go through the formal BFA process in Computer Art, a degree program that was just accepted last year, said Bill Brody, Chairman of the UAF art department. "Ben has chosen to construct these imaginary scenes with a devotion to detail that is really quite remarkable" Brody said. "I'm not familiar with any other people working like this. The resolution, the image quality is much, much higher than anybody does in movies. Most people don't have the patience to spend as much time on a computer image as you would on a painting, and Ben does."

Editor's note: Ben Barton is an ARSC student assistant and uses ARSC visualization equipment in his work. Beginning next week, he expects to have some of his images on the web, at:

http://www.art.uaf.edu/benb/

Congratulations, Ben!

Quick-Tip Q & A


A:{{ You may have different storage quotas for files residing on disk
     and those that have been migrated by DMF to tape (this is true for
     ARSC users).  How can you determine your current storage volume
     separated into disk and DMF components? }}

    du -m

  
    The "du" command summarizes disk usage.  From the man page, here are
    three especially useful options:

       -k   Expresses all block counts in terms of 1024-byte units, rather
            than the default 512-byte units.

       -m   Reports block counts for migrated (offline) files and nonmigrated
            (online) files.  Valid only with Cray Research file systems.  See
            the EXAMPLES section for the format of this output.

       -s   Reports only the grand total for each of the specified file
            operands.


    And an example, showing the effect of "-m":

      yukon:baring$ du -sk ~  
      216320  /u1/uaf/baring

      yukon:baring$ du -msk ~
      216320     on-line   844744     off-line /u1/uaf/baring
  

Q: Here's a fun one...

   Sourdough Sam sells moose for $10 each, reindeer for $3 each, and
   ducks for $0.50 each.  He floats his raft down the Yukon River to
   the Chilkoot Pass Trading Post one spring morning and sells exactly
   100 animals for exactly $100, selling at least one of each species.
   How many of each did he sell?

[ Answers, questions, and tips graciously accepted. ]

 


Current Editors:
Donald Bahls ARSC User Consultant ph: 907-450-8674
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
Contact:
Send comments and questions to the current editors using this Contact Form.
E-mail Subscriptions: Archives:

 

Newsletter Index Quick-Tip Index Search Newsletters

 

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:

home | search | about | support | news | science | resources