ARSC T3E Users' Newsletter 146, July 10, 1998

MPI Communicators for Coupled T3E Codes

[ Thanks to Don Morton of the University of Montana for this contribution. Don is a visiting researcher, at ARSC for his fifth summer. ]

Sometimes we find it advantageous to combine pieces of computer models in order to construct more realistic representations of the real world.

For example, researchers at UAF's Water Research Center have developed both a thermal and a hydrologic model for simulating various physical processes in the arctic.

The thermal model predicts soil temperatures and thaw depths (depth from soil surface to the underlying permafrost) in a watershed as a function of various meteorological data, and, among other things, available soil moisture. The hydrologic model predicts runoff and soil moisture storage as a function of various meteorological data, and, among other things, the thaw depth. The thermal model currently estimates soil moisture storage as a constant in time and space, clearly neglecting the heterogeneous nature of this parameter. Likewise, the hydrologic model estimates thaw depth empirically as a function of the day of the year, assigning a single value to the entire spatial domain, again failing to capture any heterogeneity.

Coupling of these two models, by having the thermal model send accurate thaw depth data to the hydrologic model, and having the hydrologic model send accurate soil moisture storage data to the thermal model, would capture the inherent heterogeneity while more closely representing the various feedback loops that exist in the real-world system.

Both models had been parallelized for Cray MPP architectures, and the goal was to construct a system in which both parallel models would execute simultaneously on the same architecture, periodically communicating with each other to exchange coupling data. The coupled system was intended to run on the Cray T3E, necessitating a Single Program Multiple Data (SPMD) paradigm, in which identical copies of the executable were loaded on each PE. Consolidating two separate codes into a single code is usually accomplished by transforming each of the original programs into subroutines, then constructing a driving program which calls the subroutines. This has been discussed in previous newsletters.

Because both codes had already been parallelized, it was desirable that each code continue to run as before, but this time simultaneously with periodic communication between the two codes. This was facilitated through the use of various MPI constructs that allow for creating separate communications groups and for creating an "intercommunicator" to send messages between processes in separate groups. A sample code follows, illustrating the basic infrastructure needed to have two parallel codes execute simultaneously with periodic exchange of data.

The code was constructed from two separate parallel programs - "program hydro" and "program thermal." When run independently, each code utilized the default MPI_COMM_WORLD communicator to send messages among the processes.

In order to create an SPMD version which would include both codes, each program was transformed into a subroutine and a higher-level driving program was created to call the appropriate subroutine, depending on processor number. Additionally, new communication groups were created, and the MPI_COMM_WORLD references in the MPI subroutine calls were replaced with the new communicator for the group. Finally, an intercommunicator was also created, allowing the master processes in each group to communicate with each other.

Once the groups and communicators have been set up, the appropriate hydro or thermal subroutine is called. The hydro and thermal subroutines illustrate some basic communication between processors in the same group, and between the two groups. The hydro subroutine has the hydro processors send their processor number to the master hydro process, which sums these values and then sends it to the thermal master process in the other group. Then the hydro master process waits for a message from the thermal master process, and when received, broadcasts it to the rest of the hydro processes.

Note - additional background information on this work can be found at


C      Don Morton, June 1998
C      Department of Computer Science
C      The University of Montana
C      Missoula, Montana 59812, USA
C      Email:
C     This is a sample code which illustrates the basic
C     mechanism used to couple two previously existing
C     parallel codes.

      program couple
      implicit none

      include 'mpif.h'

      integer myid,                   ! Rank of a process
     &        numprocs,               ! Total number processes
     &        i,                      ! Loops counter
     &        rc, ierr                ! Return values for MPI subroutines

     &     hydro_num procs,           ! Number processes for hydro code
     &     therm_numprocs,            ! Number processes for thermal code
     &     ranks_in_old_group(0:127)  ! list of ranks in old group

      integer     !MPI_Comm  -- handles for MPI communicators
     &     world_comm_handle,
     &     hydro_comm_handle,
     &     therm_comm_handle,
     &     inter_comm_handle 

      integer     !MPI_Group -- handles for MPI groups
     &     world_group_handle,
     &     hydro_group_handle,
     &     therm_group_handle

ccc   Initialization of each process into MPI_COMM_WORLD
      call MPI_INIT(ierr)

ccc   Find out what my processor number is
      call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)

ccc   Find out how many total processors are running
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

      print *, 'PE ', myid, ': Hello, World' 

ccc   Wait here until everyone has reached this point
      call MPI_Barrier(MPI_COMM_WORLD, ierr)

ccc   Assign roughly half of total processors to each of the two models
      hydro_numprocs = numprocs/2
      therm_numprocs = numprocs - hydro_numprocs

ccc   Obtain my "group" handle for use in creating new groups
      call MPI_Comm_group(MPI_COMM_WORLD, world_group_handle, ierr)

ccc   Create hydro group and communicator  cccccccccccccccccccccc    

ccc   First, get a list of the ranks I held in the "global" group
      do i=0,hydro_numprocs-1
        ranks_in_old_group(i) = i

ccc   Setup up hydro processes and create a handle for the new hydro group
      call MPI_Group_incl(world_group_handle, hydro_numprocs,
     &         ranks_in_old_group, hydro_group_handle, ierr)

      call MPI_Comm_create(MPI_COMM_WORLD, hydro_group_handle,
     &         hydro_comm_handle, ierr)

ccc   Obtain my "group" handle for use in creating new groups
      call MPI_Comm_group(MPI_COMM_WORLD, world_group_handle, ierr)

ccc   Setup up thermal processes and create a handle for the new thermal group

      do i=0,therm_numprocs-1
        ranks_in_old_group(i) = i+hydro_numprocs

      call MPI_Group_incl(world_group_handle, therm_numprocs,
     &         ranks_in_old_group, therm_group_handle, ierr)

      call MPI_Comm_create(MPI_COMM_WORLD, therm_group_handle,
     &         therm_comm_handle, ierr)

ccc   create intercommunicator and then proceed with hydro and thermal codes
      if (myid .lt. hydro_numprocs) then   ! I am a hydro process
         call MPI_Intercomm_create(hydro_comm_handle, 0, MPI_COMM_WORLD,
     &          hydro_numprocs, 0, inter_comm_handle, ierr)                  
         call hydro(hydro_comm_handle, inter_comm_handle)
      else  ! I am a thermal process
         call MPI_Intercomm_create(therm_comm_handle, 0, MPI_COMM_WORLD,
     &          0, 0, inter_comm_handle, ierr)                  
         call thermal(therm_comm_handle, inter_comm_handle)

      call MPI_Barrier(MPI_COMM_WORLD, ierr)

      call MPI_FINALIZE(rc)



      subroutine hydro(hydro_comm_handle, inter_comm_handle)
      implicit none

      integer hydro_comm_handle,   ! "handles" used for communication
     &        inter_comm_handle

      include 'mpif.h'

      integer hydro_myrank, 
     &        hydro_numprocs,
     &        sum_of_ranks,
     &        thermal_number, 
     &        status(MPI_STATUS_SIZE),
     &        ierr

ccc   Find my rank and number of processes in the hydro group
      call MPI_COMM_RANK(hydro_comm_handle, hydro_myrank, ierr)
      print *, 'Rank number in hydro group: ', hydro_myrank 

      call MPI_COMM_SIZE(hydro_comm_handle, hydro_numprocs, ierr)
      print *, 'Number of processes in hydro group: ', hydro_numprocs 

      call MPI_Barrier(hydro_comm_handle, ierr)

ccc   Gather some numbers from all the hydro processes, and "do 
ccc   something" with them to form a single number.  In this
ccc   trivial example, each process simply sends its processor
ccc   number to the master hydro process, which in turn adds them
ccc   all together to produce a single value
      call MPI_Reduce(hydro_myrank, sum_of_ranks, 1, MPI_INTEGER,
     &                MPI_SUM, 0, hydro_comm_handle, ierr)

ccc   The hydro master process, having obtained and summed all the
ccc   numbers from the other hydro processes, prints the value
      if (hydro_myrank .eq. 0) then
         print *, 'Hydro0: Reduced value is: ', sum_of_ranks

ccc   Send number from hydro master process to thermal master process
      if (hydro_myrank .eq. 0) then
         call MPI_Send(sum_of_ranks, 1, MPI_INTEGER, 0, 
     &                 0, inter_comm_handle, ierr) 

ccc   OK, we've sent out our value to the thermal processes.  Now
ccc   we're going to receive something from the thermal processes.

ccc   Receive number from thermal master process, and print
      if (hydro_myrank .eq. 0) then
         call MPI_Recv(thermal_number, 1, MPI_INTEGER, 0, 
     &                 0, inter_comm_handle, status, ierr) 
         print *, 'Hydro0: Thermal value is: ', thermal_number
ccc   Distribute the thermal number to all the other hydro processes
      call MPI_Bcast(thermal_number, 1, MPI_INTEGER, 0, 
     &               hydro_comm_handle, ierr)

      print *, 'Hydro', hydro_myrank, ': Thermal value is - ', 
     &          thermal_number

      call MPI_Barrier(hydro_comm_handle, ierr)



      subroutine thermal(therm_comm_handle, inter_comm_handle)
      implicit none
      integer therm_comm_handle,
     &        inter_comm_handle

      include 'mpif.h'

      integer therm_myrank, 
     &        therm_numprocs,
     &        thermal_number,
     &        sum_of_ranks,
     &        status(MPI_STATUS_SIZE),
     &        ierr

ccc   Find my rank and number of processes in the thermal group
      call MPI_COMM_RANK(therm_comm_handle, therm_myrank, ierr)
      print *, 'Rank number in thermal group:', therm_myrank 

      call MPI_COMM_SIZE(therm_comm_handle, therm_numprocs, ierr)
      print *, 'Number of processes in therm group:', therm_numprocs 

      call MPI_Barrier(therm_comm_handle, ierr)

ccc   Receive some value from the hydro master process, then print it.
      if (therm_myrank .eq. 0) then
         call MPI_Recv(sum_of_ranks, 1, MPI_INTEGER, 0, 
     &                 0, inter_comm_handle, status, ierr) 
         print *, 'Therm0: Received value is: ', sum_of_ranks

ccc   Send some arbitrary value to the hydro master process
      if (therm_myrank .eq. 0) then
         thermal_number = 1000
         call MPI_Send(thermal_number, 1, MPI_INTEGER, 0, 
     &                 0, inter_comm_handle, ierr) 
      call MPI_Barrier(therm_comm_handle, ierr)

MPI Performance Tuning -- VAMPIR Tutorial

I recently had this e-mail exchange with an ARSC MPI user:

      > Have you tried VAMPIR yet?
      Not yet, the name is too scary.  :)

Well, that's probably true. But the little bat icons are smiling, VAMPIR is frighteningly powerful, and once bitten--I mean, smitten--your enthusiasm will be infectious.

So. What is VAMPIR?

VAMPIR (introduced in issue #129 ) is a tool which shows how YOUR code exchanges MPI messages. It makes pretty graphs. It shows you message bottlenecks. It helps you eliminate time wasted at barriers, receives, etc.

Don't be afraid of VAMPIR! This tutorial should help you get started:

  1. On the T3E, define these environment variables: PAL_LICENSEFILE=/usr/local/pkg/VAMPIRtrace/etc/license.dat PAL_ROOT=/usr/local/pkg/VAMPIRtrace
  2. On the T3E, recompile your MPI code with this option: -I/usr/local/pkg/VAMPIRtrace/include And link the code with these options: -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi For instance, given the source files vt.c or vt.f, you could compile/link with one of the commands: yukon$ cc vt.c -o vt -I/usr/local/pkg/VAMPIRtrace/include \ -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi yukon$ f90 vt.f -o vt -I/usr/local/pkg/VAMPIRtrace/include \ -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi
  3. Run the program on the T3E, interactively or through NQS: yukon$ mpprun -n6 ./vt
  4. It produces the normal output plus a ".bpv" file: yukon$ ls -l *.bpv -rw------- 1 baring staff 32563 Jul 10 09:48 vt.bpv
  5. Copy the .bpv file to the ARSC SGIs (use rcp or ftp): sgi$ rcp yukon:Progs/Vamptest/vt.bpv ./
  6. On the SGIs, define these environment variables: PAL_LICENSEFILE=/usr/local/vampir/etc/license.dat PAL_ROOT=/usr/local/vampir And add to your PATH environment variable: /usr/local/vampir/bin/R4K-OS5 (Remote users also need to export the SGI display to their local workstation.)
  7. On the SGIs, run vampir with the .bpv file for input: sgi$ vampir vt.bpv
  8. Make graphs and charts, zoom in and out, etc. Below is a sample session based on a code (given at the end of this article) which does the following:
    1. "ping-pongs" messages between pairs of PEs, using send/recv.
    2. Broadcasts to all PEs, once for each PE, using Bcast.
    3. Has PE 0 send to all other PEs.
    4. Has all PEs except 0 send to PE 0.
    5. Has all PEs exit, calling Finalize.

    Here are instructions for reproducing the sample session, plus the graphs and charts produced: 8.1) In the VAMPIR main menu, select: Global Displays:Global Timeline Settings:Colors <change the "MPI" color to red, the "Application" color to green > This produces the display shown in figure 1. The x-axis is execution time. Each message is drawn as a black line. Time spent in an MPI call is shown in red, application work is in green. Thus, red indicates non-productive time. Figure 1 (click on image for larger view) 8.2) Zoom-in on a segment of the run by drawing a rectangle (using the left mouse button) inside the global timeline graph. Figure 2 shows the "zoomed" graph. PE 0 sends one message, in PE order, to the other, blocked PEs, and then receives a message from each other PE, in any order. Note that each PE calls MPI_Finalize as soon as it has sent its message to PE 0. We could zoom in again and again to separate the MPI_Finalize calls from the send/recvs.

  • Figure 2 (click on image for larger view) 8.3) Make pie-charts, showing the percentage of time that each PE spent running application code versus MPI code. In the VAMPIR main menu, select: Global Displays:Chart View Figure 3 (click on image for larger view) 8.4) Make histograms, showing the total number of MPI calls of each type, made by each PE. Inside the "chart view" window, click the RIGHT mouse button. In the menu which appears, select these options: mode:histogram count:occurrences display:MPI options:absolute scale
  • Figure 4 (click on image for larger view) 8.5) Make pie-charts, showing the percentage of time spent in different types of MPI calls, for each PE. Inside the "chart view" window (which still shows histograms), click the RIGHT mouse button. In the menu which appears, select these options: mode:pie count:times Figure 5 (click on image for larger view)
  • Some gotchas, and fixes:
    1. If VAMPIR was working, but now everything seems broken, you might delete (or move) the directory, ~/.VAMPIR_defaults .
    2. In the "global timeline" display, if you "undo zoom" too far, and lose the data off screen, click with the right mouse button to bring up a local menu, then select: Window options:Adapt

      If message lines seem hidden, click with the right mouse button to bring up a local menu, then select: Components:Message Lines

    3. Remember, you can control the display colors, and some defaults aren't too good (see step 8.1 above). Make sure you're using different colors for different types of objects.
    4. If your code doesn't produce a .bpv file, make sure all PEs are calling MPI_Finalize before they exit. Also, you can't use MPICH with VAMPIR (yet).
  • Here's the sample code:

      Test code for VAMPIR demonstration. 
    #include <stdio.h>
    #include "mpi.h"
    #define MAX_ORDER 100000
    #define NPASSES 4
    main(int argc, char* argv[]) {
        int    npes;
        int    my_rank;
        int    test;
        int    flag;
        int    min_size = 0;
        int    max_size = MAX_ORDER;
        int    incr;
        float  x[MAX_ORDER];
        int    size;
        int    pass;
        int    dest, source;
        MPI_Status  status;
        int         i;
        MPI_Comm    comm;
        MPI_Init(&argc, &argv);
        MPI_Comm_size(MPI_COMM_WORLD, &npes);
        MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
        MPI_Comm_dup(MPI_COMM_WORLD, &comm);
        if (npes % 2 != 0) {
          printf ("ERROR: Even number of PEs required\n");
          goto BAILOUT;
        /* Ping-pong in pairs */
        if (my_rank % 2 == 0) {
            incr = MAX_ORDER/(my_rank+1);
            for (test = 0, size = min_size;
                    size <= max_size; size = size + incr, test++) {
                printf ("Even PE %d starting ping-pong test %d\n", my_rank, test);
                for (pass = 0; pass < NPASSES; pass++) {
                    MPI_Send(x, size, MPI_FLOAT, my_rank + 1, 0, comm);
                    MPI_Recv(x, size, MPI_FLOAT, my_rank + 1, 0, comm,
        else {
            incr = MAX_ORDER/(my_rank);
            for (test = 0, size = min_size; 
                    size <= max_size; size = size + incr, test++) {
                for (pass = 0; pass < NPASSES; pass++) {
                    MPI_Recv(x, size, MPI_FLOAT, my_rank - 1, 0, comm,
                    MPI_Send(x, size, MPI_FLOAT, my_rank - 1, 0, comm);
        /* All processors broadcast */
        for (source = 0; source < npes; source++) {
          if (source == my_rank)
            printf ("PE %d initiates Bcast\n", my_rank);
          MPI_Bcast (x, max_size, MPI_FLOAT, source, comm);
        /* Master sends individual messages to all */
        if (my_rank == 0) {
            printf("starting batch send from master \n");
            for (dest = 1; dest < npes; dest++) {
              MPI_Send(x, max_size, MPI_FLOAT, dest, 0, comm);
        else {
           MPI_Recv(x, max_size, MPI_FLOAT, 0, 0, comm, &status); 
           printf ("PE %d received from 0\n", my_rank);
        /* Master receives from all in any order */
        if (my_rank == 0) {
            printf("starting batch receive in any order \n");
            for (source = 1; source < npes; source++) {
              MPI_Recv(x, max_size, MPI_FLOAT, MPI_ANY_SOURCE, 0, comm, &status); 
            printf ("Master received from all\n");
        else {
            MPI_Send(x, max_size, MPI_FLOAT, 0, 0, comm);

    Web-Journal on Performance Evaluation and Modeling

    We have recently learned of the "Journal of Performance Evaluation and Modeling for Computer Systems" (PEMCS). It is at:

    and currently has these postings:

    1. PERFORM - A Fast Simulator For Estimating Program Execution Time, By Alistair Dunlop and Tony Hey, Department Electronics and Computer Science, The University of Southampton, Southampton, SO17 1BJ, U.K.
    2. Performance Comparison of MPI, PGHPF/CRAFT and HPF Implementations of the Cholesky Factorization on the Cray T3E and IBM SP-2, By Glenn R. Luecke and Ying Li, Iowa State University, Ames, Iowa, 50011-2251, USA.
    3. Comparing The Performance of MPI on the Cray T3E-900, THE Cray Origin2000 And The IBM P2SC, by Glenn R. Luecke and James J. Coyle Iowa State University, Ames, Iowa 50011-2251, USA.
    4. 4.EuroBen Experiences with the SGI Origin 2000 and the Cray T3E, by A.J. van der Steen, Computational Physics, Utrecht University, Holland

    The followng introduction is taken from the PEMCS site:

    Starting at the beginning of 1997, the journal aims to publish--on the Web--high quality, peer reviewed, original scientific papers alongside review articles and short notes in the rapidly developing area of performance evaluation and modeling of computer systems with special emphasis on high performance computing.

    The rush for higher and higher performance has always been one of the main goals of electronic computers. Currently, high performance computing is moving rapidly from an era of `Big Iron' to a future that will be dominated by systems built from commodity components. Very soon, users will be able to construct high-performance systems by clustering off-the-shelf processing modules using widely available high-speed communication switches. Alternatively, the World Wide Web itself represents the largest available `parallel' computer, with more than 20 million potential nodes worldwide. This makes the innovative Web technologies particularly attractive for distributed computing. Equally exciting is the goal to achieving Petaflop computing rates on real production codes.

    All this makes the performance evaluation and modeling of emerging hybrid shared/distributed memory parallel architectures with complex memory hierarchies and corresponding applications a natural area of priority for science, research and development.

    The main objectives of this journal are, therefore, to provide a focus for performance evaluation activities and to establish a flexible environment on the Web for reporting and discussing performance modeling and Petaflop computing.

    Quick-Tip Q & A

    A: {{ What's your favorite shell alias? }}
    Thanks to all!
    On the T3E it is:
    alias pss '(ps -do "ruser,pid,ppid,himem,npes,vpe,addr,time,command" \\
     egrep -v "root 
     1    0 .../....    00:00:0[0-9] " \\
     grep "^.\{39\}[P ][E03]" \\
     sort -k 5n,5n -k 1,1 \\
      ; date)'
    This is how it appears in my .cshrc, the \\ gets converted to \ in the
    actual alias.  It lists all user commands that have used at least 10 
    seconds of processor time, and for multi-PE commands it only lists 
    PE 0 and PE 3.
    For example:
    seymour 3> pss
     smith     92836  92590   42363   32    0 180/0000    00:02:08 sdd
     smith    113188 113186   42347   32    3 183/0000    00:04:31 sdd
     jones     89995  89985   16447   48    0  d0/0000    02:32:39 a.out
     jones     91354  91348   16457   48    0 100/0000    01:59:35 a.out
     jones    112695  89995   16447   48    3  d3/0000    02:32:28 a.out
     jones    113000  91354   16441   48    3 103/0000    01:59:53 a.out
    Fri Jun 26 21:41:38 CDT 1998
    Regarding your Q&A tips, my favorite 2 aliases:
      alias x 'xterm -sb -sk -sl 3000 -geometry 80x45 -n \!$ -title \!$  -e telnet \!$ &'
      alias quake 'finger'
    I use my "x" (and a similar "ssh" alias) alias more than any other alias, to 
    open an xterm window on a particular machine. My ssh alias is similar, where 
    I change my x alias "-e telnet" to "-e ssh". The x and ssh aliases run xterm 
    with all my preferences (e.g., scroll bars, a large number of lines in the 
    scrolling buffer, the geometry of the window, the font, a window title, icon 
    name, and more). It is used by typing, for example, "x yukon" (telnet) or
    "ssh yukon" (ssh) to open a session on a machine called yukon.
    The quake alias displays the latest northern California earthquake
    information.  Sites for other areas are described in the output.
    I have rmcore to get rid of corefiles.
      alias rmcore rm -f core
    note you can say rmcore * and delete all files! Danger!
     sort -u +0 -1'
    psg='ps -eaf 
    These two korn shell aliases extract information from "qstat -f". The
    output records look like this:
    NQS BATCH REQUEST: qsub.cmd             Status:         RUNNING 
            NQS Identifier: 21712.yukon             Target User:    smith
            Created:        Wed Jul  8 1998         Queued:         Wed Jul  8 1998
                            16:04:47 AKDT                           16:04:47 AKDT  
            Name:           xlarge@yukon               Priority:       61 
            MPP Processor Elements                        50              50
            MPP Time Limit            <28800sec>       28800sec        24365sec
    The first alias reports on all running NQS requests, the second on all
    other requests (waiting, queued, checkpointed, etc.).
      qfr='(qstat -f $(qstat -sa
    egrep "^[0-9].*[R]..$"
    cut -f1 -d" ") 
    egrep "BATCH R
    MPP Time
    MPP Proc
    NQS Id
      qfq='(qstat -f $(qstat -sa
    egrep "^[0-9].*[^R]..$"
    cut -f1 -d" ") 
    egrep "BATCH R
    MPP Time
    MPP Proc
    NQS Id
    Q: When I use:
         chmod -R go+rx ./
       it makes EVERYTHING group/other executable, even text files!  
       Is there a way to add the execute permissions to only those files
       that were originally executable?   (This drives me batty!)

    [ Answers, questions, and tips graciously accepted. ]

    Current Editors:
    Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
    Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
    Arctic Region Supercomputing Center
    University of Alaska Fairbanks
    PO Box 756020
    Fairbanks AK 99775-6020
    E-mail Subscriptions: Archives:
      Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
    Back to Top