ARSC T3E Users' Newsletter 157, December 4, 1998

Example of Parallel Post-Processing

Parallel programs often generate large amounts of data which must then be processed to extract important information, this can sometimes be the slowest, or most time consuming part of the work. In the example below a simple framework code is supplied for the parallel analysis of a large number of datafiles.

In this simple example code, a number of files are assumed to exist with names of the form, data<number>, all of the same size/structure and containing a single array. It is desired to find which has the maximum value and to construct an average for each array entry.

The code starts MPI and then one processor reads from standard IO the number and size of the files it will process. This information is then sent to all processors using an MPI_BCAST. Space is now allocated on each processor to read in the data from the files, aread, and to store the working average, aglb, and the maximum in array, amax. (Note amax is set to -huge(amax), a useful F90 way to set such data.)

The code now loops over files with each processor reading data files in a round robin assignment. Note that each processor reads from a different channel number derived from the processor id. After reading data, each processor adds to its own running average, aglb, and finds the maximum value in the dataset, storing this in an array, amax, according to the filenumber being processed.

Once all files have been read, the data is collected on one processor for final processing. MPI_REDUCE is used with the MPI_SUM operator to sum all the different copies of aglb into one array on the mroot processor. Other possible operators are listed here:


  MPI_MAX,        maximum
  MPI_MIN,        minimum
  MPI_SUM,        summation
  MPI_PROD        product
  MPI_MAXLOC/MPI_MINLOC, max/min and location.
  MPI_LAND,MPI_BAND,MPI_LOR,MPI_BOR,MPI_LXOR,MPI_BXOR.
                      simple logical operations.
The MPI_MAX operator is used for the maximum and the array is initially set to be a huge negative value. Note that MPI_REDUCE stores the desired results on the mroot processor only. (MPI_ALLREDUCE can be used if the result is to be available on all processors. This is equivalent to an MPI_REDUCE followed by a MPI_BCAST.)

After results are collated on the mroot processor some simple facts are printed and an array written out.

Limitations of this code

There are several limitations to this code:
  • It assumes the data being extracted from the file and the processed data can all be stored on one processor. If this is not the case then the post processing code should consider using the same parallel data distribution as the computational code employed.
  • A round robin scheme is used to read data from files, this assumes an equal workload for each file. If this were not the case a more complex task farming approach could be used to distribute files to workers.
  • Since each processor reads a file the scaling of this code is more likely to depend on the file system performance than any other factor.

A more realistic example code to process results might read filenames from an input list and have many variables and a more complex task to process. It is hoped this example provides a frame for users to build a simple, yet functional, parallel postprocessor.

Example timings follow for the MPI version when processing 32 files of 2,000,000 entries each.


NPES        time to complete          total time spent in read  
                 (secs)          (secs, summed over all processors)

1                53.8                           47.4
2                27.6                           47.4
4                14.4                           47.6
6                11.9                           49.8
8                 9.8                           53.6
10               14.4                           72.4
12               13.4                           95.2
16                9.2                          105.5
20                9.4                          116.7

As can be seen, the speedup is good for relatively small numbers of processors; above 8 processors, the need to read many files at once becomes a bottleneck.

The total time all processors spend reading data is constant for small processor numbers but as more processors attempt to access the file system, it becomes saturated and the time taken to read each file increases dramatically.

Here is the code:


c**********************************************************************
c simple multifile post processor
c guy robinson, v1.0 dec 1st 1998.
c
c****************************************************************************
      program main

      implicit none

      include 'mpif.h'
      integer  myid, numprocs, mroot
      integer ierr

! data for file read
      integer nfiles,ndata
! filename
      character*80 myfilename
! file channel
      integer myread

! data for processing
      real, allocatable, dimension(:) :: aread, aglb, amax
! data fro master only
      real, allocatable, dimension(:) ::      apglb,apmax
      integer iloc(1)

! loops
      integer iread, ido
      integer idata


! timers
      double precision :: start,end
      double precision :: io_start, io_end, io_read,mio_read
! setup MPI
      call MPI_INIT( ierr )
      call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
      call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
      print *, "Process ", myid, " of ", numprocs, " is alive"

      mroot = 0
 10   if ( myid .eq. mroot ) then

        write(6,*) ' enter number of files '
        read(5,99) nfiles
        write(6,*) ' reading ',nfiles,' files '
        write(6,*) ' enter number of data items in each file '
        read(5,99) ndata
        write(6,*) ' reading ',ndata,' data items '
 99     format(i10)

      endif

      call MPI_BCAST(nfiles,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

      call MPI_BCAST(ndata,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

!get workspaces
! data to store global average
      allocate (aglb(ndata))
      allocate (aread(ndata))

      allocate (amax(nfiles))

      aglb=0.0
      amax=-huge(amax)

!set separate channel for each processor
      myread=30+myid
      io_read=0.0

      start=MPI_WTIME()
! read files as round robin.

      do iread=1,nfiles

        ido=mod(iread-myid-1,numprocs)

        if(ido.eq.0) then
! read from file on this processor
          write(6,*) ' processor ',myid,
     &      ' reading from file number ',iread

          write(myfilename,"(a,i6.6)") '/tmp/robinson/data',iread
          write(6,*) myfilename
! open file
          open(unit=myread,file=myfilename,form='unformatted')

! read data
      io_start=MPI_WTIME()
            read(myread) aread
      io_end=MPI_WTIME()
      io_read=io_read+io_end-io_start
! sum data
          aglb=aglb+aread

! find maximum
          amax(iread)=maxval(aread)

! close file
          close(myread)

        endif

      enddo


! gather results

      if ( myid .eq. mroot ) then
        allocate (apglb(ndata))
        allocate (apmax(nfiles))
      endif
! global averages

      call MPI_REDUCE(aglb,apglb,ndata,MPI_REAL,MPI_SUM,mroot,
     &     MPI_COMM_WORLD,ierr)

! collect data for a graph of maximum
      call MPI_REDUCE(amax,apmax,nfiles,MPI_REAL,MPI_MAX,mroot,
     &     MPI_COMM_WORLD,ierr)

      if ( myid .eq. mroot ) then
! take average
        apglb=apglb/real(nfiles)

        iloc=maxloc(apglb)
        write(6,*) ' maximum average is at ',iloc(1),apglb(iloc(1))
        iloc=maxloc(apmax)
        write(6,*) ' maximum value is in file number ',
     &      iloc(1),apmax(iloc(1))

        do idata=1,nfiles
          write(6,*) idata,apmax(idata)
        enddo
      endif

      end=MPI_WTIME()

      call MPI_REDUCE(io_read,mio_read,1,MPI_REAL,MPI_SUM,mroot,
     &     MPI_COMM_WORLD,ierr)

      if ( myid .eq. mroot ) then
        write(6,*) ' took ',end-start,
     &      ' seconds, io total time ',mio_read
      endif

      call MPI_FINALIZE(ierr)
      stop
      end

Sparse '99 Conference Announcement

[ We received this from Michael Olesen, conference administrator. ]

> 
>  TITLE:    1999 INTERNATIONAL CONFERENCE  ON PRECONDITIONING TECHNIQUES
>            FOR  LARGE SPARSE MATRIX PROBLEMS IN INDUSTRIAL APPLICATIONS
> 
>  DATE:     June 10 - 12, 1999                                        
> 
>  PLACE:    University of Minnesota, Hubert H. Humphrey Institute,  
>            Minneapolis, Minnesota                                 
> 
>  Please consult the web-site
>   http://www2.msi.umn.edu/Symposia/sparse99 for detailed
>   information.
> 
> CONFERENCE TOPICS:
> 
>       o Incomplete factorization preconditioners
>       o Domain decomposition  preconditioners
>       o Approximate  inverse  preconditioners
>       o Multi-level preconditoners
>       o Preconditioning techniques in optimization problems
>       o Preconditioning techniques in finite  element problems
>       o Preconditioning techniques in image processing
>       o Applications in Computational Fluid Dynamics  (CFD)
>       o Applications in computational finance
>       o Multiphase  subsurface  flow applications
>       o Applications in petroleum industry
>       o Applications in semiconductor device simulation
> 
> IMPORTANT DATES:
> 
>  * February 26th, 1999 : deadline for submission of extended
>    abstracts. 
> 

[ ... rest deleted ... ]

Quick-Tip Q & A


A: {{ The syntax:  (\  STUFF  \) , appears in the "sync_images" article
      sample code, above.  What does this kind of bracket do and
      where's it from?  Is it Fortran? }}


  # 
  # Thanks to Alan Wallcraft of NRL for his response:
  # 

  This should be (/ STUFF /).  The backslash is not in the Fortran
  character set.

  It is a Fortran 90 "array constructor" that contains a list of any
  mixture of scalars, arrays, and/or implied-do specifications.  It
  creates a rank-one array.  I think two-character delimiters are used
  to keep Fortran's character set small.  In different contexts Fortran
  uses other pairs of symbols, e.g. //, =>, ==, >=, <=.  I initially
  had trouble with the implied-do syntax because of all the
  parentheses:

      I_1TO10 = (/ (I, I=1,10) /)  ! not (/ I, I=1,10 /)

  which is equivalent to:

      I_1TO10 = (/ 1,2,3,4,5,6,7,8,9,10 /)

  A "gotcha" is how to construct a zero-sized array.  The following
  does not work:

      I_0 = (/ /)  ! illegal, because (/ /) has no type

  but this does:

      I_0 = (/ (I, I=1,0) /)  ! zero-length implied-do

  Otherwise array constructors work as expected.  They can be used with
  RESHAPE to construct higher rank arrays.





Q: Fortran again. ALOG10 is an intrinsic CF90 function.  The man page 
   even says so:

     DESCRIPTION
       LOG10 is the generic function name.  ALOG10 and DLOG10 are
       intrinsic for the CF90 compiler.

    But this program won't compile:

      !----------------------------------------
        program test
          write (6,*) "alog10 (1000)  :",  alog10 (1000)
        end
      !----------------------------------------


    Here's the error message:  

      yukon% f90 test.f

      write (6,*) "alog10 (1000)  :",  alog10 (1000)
                                         ^             
      cf90-700 f90: ERROR TEST, File = junk.f, Line = 3, Column = 35 
        No specific intrinsic exists for the intrinsic call "ALOG10".


    What's wrong?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top