ARSC T3E Users' Newsletter 146, July 10, 1998
MPI Communicators for Coupled T3E Codes
[ Thanks to Don Morton of the University of Montana for this contribution. Don is a visiting researcher, at ARSC for his fifth summer. ]
Sometimes we find it advantageous to combine pieces of computer models in order to construct more realistic representations of the real world.
For example, researchers at UAF's Water Research Center have developed both a thermal and a hydrologic model for simulating various physical processes in the arctic.
The thermal model predicts soil temperatures and thaw depths (depth from soil surface to the underlying permafrost) in a watershed as a function of various meteorological data, and, among other things, available soil moisture. The hydrologic model predicts runoff and soil moisture storage as a function of various meteorological data, and, among other things, the thaw depth. The thermal model currently estimates soil moisture storage as a constant in time and space, clearly neglecting the heterogeneous nature of this parameter. Likewise, the hydrologic model estimates thaw depth empirically as a function of the day of the year, assigning a single value to the entire spatial domain, again failing to capture any heterogeneity.
Coupling of these two models, by having the thermal model send accurate thaw depth data to the hydrologic model, and having the hydrologic model send accurate soil moisture storage data to the thermal model, would capture the inherent heterogeneity while more closely representing the various feedback loops that exist in the real-world system.
Both models had been parallelized for Cray MPP architectures, and the goal was to construct a system in which both parallel models would execute simultaneously on the same architecture, periodically communicating with each other to exchange coupling data. The coupled system was intended to run on the Cray T3E, necessitating a Single Program Multiple Data (SPMD) paradigm, in which identical copies of the executable were loaded on each PE. Consolidating two separate codes into a single code is usually accomplished by transforming each of the original programs into subroutines, then constructing a driving program which calls the subroutines. This has been discussed in previous newsletters.
Because both codes had already been parallelized, it was desirable that each code continue to run as before, but this time simultaneously with periodic communication between the two codes. This was facilitated through the use of various MPI constructs that allow for creating separate communications groups and for creating an "intercommunicator" to send messages between processes in separate groups. A sample code follows, illustrating the basic infrastructure needed to have two parallel codes execute simultaneously with periodic exchange of data.
The code was constructed from two separate parallel programs - "program hydro" and "program thermal." When run independently, each code utilized the default MPI_COMM_WORLD communicator to send messages among the processes.
In order to create an SPMD version which would include both codes, each program was transformed into a subroutine and a higher-level driving program was created to call the appropriate subroutine, depending on processor number. Additionally, new communication groups were created, and the MPI_COMM_WORLD references in the MPI subroutine calls were replaced with the new communicator for the group. Finally, an intercommunicator was also created, allowing the master processes in each group to communicate with each other.
Once the groups and communicators have been set up, the appropriate hydro or thermal subroutine is called. The hydro and thermal subroutines illustrate some basic communication between processors in the same group, and between the two groups. The hydro subroutine has the hydro processors send their processor number to the master hydro process, which sums these values and then sends it to the thermal master process in the other group. Then the hydro master process waits for a message from the thermal master process, and when received, broadcasts it to the rest of the hydro processes.
Note - additional background information on this work can be found at
http://www.arsc.edu/~morton/geocomp98
.
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
C
C Don Morton, June 1998
C Department of Computer Science
C The University of Montana
C Missoula, Montana 59812, USA
C Email: morton@cs.umt.edu
C
C This is a sample code which illustrates the basic
C mechanism used to couple two previously existing
C parallel codes.
C
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
program couple
implicit none
include 'mpif.h'
integer myid, ! Rank of a process
& numprocs, ! Total number processes
& i, ! Loops counter
& rc, ierr ! Return values for MPI subroutines
integer
& hydro_num procs, ! Number processes for hydro code
& therm_numprocs, ! Number processes for thermal code
& ranks_in_old_group(0:127) ! list of ranks in old group
integer !MPI_Comm -- handles for MPI communicators
& world_comm_handle,
& hydro_comm_handle,
& therm_comm_handle,
& inter_comm_handle
integer !MPI_Group -- handles for MPI groups
& world_group_handle,
& hydro_group_handle,
& therm_group_handle
ccc Initialization of each process into MPI_COMM_WORLD
call MPI_INIT(ierr)
ccc Find out what my processor number is
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
ccc Find out how many total processors are running
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
print *, 'PE ', myid, ': Hello, World'
ccc Wait here until everyone has reached this point
call MPI_Barrier(MPI_COMM_WORLD, ierr)
ccc Assign roughly half of total processors to each of the two models
hydro_numprocs = numprocs/2
therm_numprocs = numprocs - hydro_numprocs
ccc Obtain my "group" handle for use in creating new groups
call MPI_Comm_group(MPI_COMM_WORLD, world_group_handle, ierr)
ccc Create hydro group and communicator cccccccccccccccccccccc
ccc First, get a list of the ranks I held in the "global" group
do i=0,hydro_numprocs-1
ranks_in_old_group(i) = i
enddo
ccc Setup up hydro processes and create a handle for the new hydro group
call MPI_Group_incl(world_group_handle, hydro_numprocs,
& ranks_in_old_group, hydro_group_handle, ierr)
call MPI_Comm_create(MPI_COMM_WORLD, hydro_group_handle,
& hydro_comm_handle, ierr)
ccc Obtain my "group" handle for use in creating new groups
call MPI_Comm_group(MPI_COMM_WORLD, world_group_handle, ierr)
ccc Setup up thermal processes and create a handle for the new thermal group
do i=0,therm_numprocs-1
ranks_in_old_group(i) = i+hydro_numprocs
enddo
call MPI_Group_incl(world_group_handle, therm_numprocs,
& ranks_in_old_group, therm_group_handle, ierr)
call MPI_Comm_create(MPI_COMM_WORLD, therm_group_handle,
& therm_comm_handle, ierr)
ccc create intercommunicator and then proceed with hydro and thermal codes
if (myid .lt. hydro_numprocs) then ! I am a hydro process
call MPI_Intercomm_create(hydro_comm_handle, 0, MPI_COMM_WORLD,
& hydro_numprocs, 0, inter_comm_handle, ierr)
call hydro(hydro_comm_handle, inter_comm_handle)
else ! I am a thermal process
call MPI_Intercomm_create(therm_comm_handle, 0, MPI_COMM_WORLD,
& 0, 0, inter_comm_handle, ierr)
call thermal(therm_comm_handle, inter_comm_handle)
endif
call MPI_Barrier(MPI_COMM_WORLD, ierr)
call MPI_FINALIZE(rc)
end
ccc==============================================================
subroutine hydro(hydro_comm_handle, inter_comm_handle)
implicit none
integer hydro_comm_handle, ! "handles" used for communication
& inter_comm_handle
include 'mpif.h'
integer hydro_myrank,
& hydro_numprocs,
& sum_of_ranks,
& thermal_number,
& status(MPI_STATUS_SIZE),
& ierr
ccc Find my rank and number of processes in the hydro group
call MPI_COMM_RANK(hydro_comm_handle, hydro_myrank, ierr)
print *, 'Rank number in hydro group: ', hydro_myrank
call MPI_COMM_SIZE(hydro_comm_handle, hydro_numprocs, ierr)
print *, 'Number of processes in hydro group: ', hydro_numprocs
call MPI_Barrier(hydro_comm_handle, ierr)
ccc Gather some numbers from all the hydro processes, and "do
ccc something" with them to form a single number. In this
ccc trivial example, each process simply sends its processor
ccc number to the master hydro process, which in turn adds them
ccc all together to produce a single value
call MPI_Reduce(hydro_myrank, sum_of_ranks, 1, MPI_INTEGER,
& MPI_SUM, 0, hydro_comm_handle, ierr)
ccc The hydro master process, having obtained and summed all the
ccc numbers from the other hydro processes, prints the value
if (hydro_myrank .eq. 0) then
print *, 'Hydro0: Reduced value is: ', sum_of_ranks
endif
ccc Send number from hydro master process to thermal master process
if (hydro_myrank .eq. 0) then
call MPI_Send(sum_of_ranks, 1, MPI_INTEGER, 0,
& 0, inter_comm_handle, ierr)
endif
ccc OK, we've sent out our value to the thermal processes. Now
ccc we're going to receive something from the thermal processes.
ccc Receive number from thermal master process, and print
if (hydro_myrank .eq. 0) then
call MPI_Recv(thermal_number, 1, MPI_INTEGER, 0,
& 0, inter_comm_handle, status, ierr)
print *, 'Hydro0: Thermal value is: ', thermal_number
endif
ccc Distribute the thermal number to all the other hydro processes
call MPI_Bcast(thermal_number, 1, MPI_INTEGER, 0,
& hydro_comm_handle, ierr)
print *, 'Hydro', hydro_myrank, ': Thermal value is - ',
& thermal_number
call MPI_Barrier(hydro_comm_handle, ierr)
return
end
ccc==============================================================
subroutine thermal(therm_comm_handle, inter_comm_handle)
implicit none
integer therm_comm_handle,
& inter_comm_handle
include 'mpif.h'
integer therm_myrank,
& therm_numprocs,
& thermal_number,
& sum_of_ranks,
& status(MPI_STATUS_SIZE),
& ierr
ccc Find my rank and number of processes in the thermal group
call MPI_COMM_RANK(therm_comm_handle, therm_myrank, ierr)
print *, 'Rank number in thermal group:', therm_myrank
call MPI_COMM_SIZE(therm_comm_handle, therm_numprocs, ierr)
print *, 'Number of processes in therm group:', therm_numprocs
call MPI_Barrier(therm_comm_handle, ierr)
ccc Receive some value from the hydro master process, then print it.
if (therm_myrank .eq. 0) then
call MPI_Recv(sum_of_ranks, 1, MPI_INTEGER, 0,
& 0, inter_comm_handle, status, ierr)
print *, 'Therm0: Received value is: ', sum_of_ranks
endif
ccc Send some arbitrary value to the hydro master process
if (therm_myrank .eq. 0) then
thermal_number = 1000
call MPI_Send(thermal_number, 1, MPI_INTEGER, 0,
& 0, inter_comm_handle, ierr)
endif
call MPI_Barrier(therm_comm_handle, ierr)
return
end
MPI Performance Tuning -- VAMPIR Tutorial
I recently had this e-mail exchange with an ARSC MPI user:
ARSC:
> Have you tried VAMPIR yet?
MPI USER:
Not yet, the name is too scary. :)
Well, that's probably true. But the little bat icons are smiling, VAMPIR is frighteningly powerful, and once bitten--I mean, smitten--your enthusiasm will be infectious.
So. What is VAMPIR?
VAMPIR (introduced in issue #129 ) is a tool which shows how YOUR code exchanges MPI messages. It makes pretty graphs. It shows you message bottlenecks. It helps you eliminate time wasted at barriers, receives, etc.
Don't be afraid of VAMPIR! This tutorial should help you get started:
- On the T3E, define these environment variables: PAL_LICENSEFILE=/usr/local/pkg/VAMPIRtrace/etc/license.dat PAL_ROOT=/usr/local/pkg/VAMPIRtrace
- On the T3E, recompile your MPI code with this option: -I/usr/local/pkg/VAMPIRtrace/include And link the code with these options: -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi For instance, given the source files vt.c or vt.f, you could compile/link with one of the commands: yukon$ cc vt.c -o vt -I/usr/local/pkg/VAMPIRtrace/include \ -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi yukon$ f90 vt.f -o vt -I/usr/local/pkg/VAMPIRtrace/include \ -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi
- Run the program on the T3E, interactively or through NQS: yukon$ mpprun -n6 ./vt
- It produces the normal output plus a ".bpv" file: yukon$ ls -l *.bpv -rw------- 1 baring staff 32563 Jul 10 09:48 vt.bpv
- Copy the .bpv file to the ARSC SGIs (use rcp or ftp): sgi$ rcp yukon:Progs/Vamptest/vt.bpv ./
- On the SGIs, define these environment variables: PAL_LICENSEFILE=/usr/local/vampir/etc/license.dat PAL_ROOT=/usr/local/vampir And add to your PATH environment variable: /usr/local/vampir/bin/R4K-OS5 (Remote users also need to export the SGI display to their local workstation.)
- On the SGIs, run vampir with the .bpv file for input: sgi$ vampir vt.bpv
-
Make graphs and charts, zoom in and out, etc.
Below is a sample session based on a code (given at the end of this article) which does the following:
- "ping-pongs" messages between pairs of PEs, using send/recv.
- Broadcasts to all PEs, once for each PE, using Bcast.
- Has PE 0 send to all other PEs.
- Has all PEs except 0 send to PE 0.
- Has all PEs exit, calling Finalize.
Here are instructions for reproducing the sample session, plus the graphs and charts produced: 8.1) In the VAMPIR main menu, select: Global Displays:Global Timeline Settings:Colors <change the "MPI" color to red, the "Application" color to green > This produces the display shown in figure 1. The x-axis is execution time. Each message is drawn as a black line. Time spent in an MPI call is shown in red, application work is in green. Thus, red indicates non-productive time. Figure 1 (click on image for larger view) 8.2) Zoom-in on a segment of the run by drawing a rectangle (using the left mouse button) inside the global timeline graph. Figure 2 shows the "zoomed" graph. PE 0 sends one message, in PE order, to the other, blocked PEs, and then receives a message from each other PE, in any order. Note that each PE calls MPI_Finalize as soon as it has sent its message to PE 0. We could zoom in again and again to separate the MPI_Finalize calls from the send/recvs.
- If VAMPIR was working, but now everything seems broken, you might delete (or move) the directory, ~/.VAMPIR_defaults .
-
In the "global timeline" display, if you "undo zoom" too far, and lose the data off screen, click with the right mouse button to bring up a local menu, then select:
Window options:Adapt
If message lines seem hidden, click with the right mouse button to bring up a local menu, then select: Components:Message Lines
- Remember, you can control the display colors, and some defaults aren't too good (see step 8.1 above). Make sure you're using different colors for different types of objects.
- If your code doesn't produce a .bpv file, make sure all PEs are calling MPI_Finalize before they exit. Also, you can't use MPICH with VAMPIR (yet).
Here's the sample code:
/*============================================================
Test code for VAMPIR demonstration.
============================================================*/
#include <stdio.h>
#include "mpi.h"
#define MAX_ORDER 100000
#define NPASSES 4
main(int argc, char* argv[]) {
int npes;
int my_rank;
int test;
int flag;
int min_size = 0;
int max_size = MAX_ORDER;
int incr;
float x[MAX_ORDER];
int size;
int pass;
int dest, source;
MPI_Status status;
int i;
MPI_Comm comm;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_dup(MPI_COMM_WORLD, &comm);
if (npes % 2 != 0) {
printf ("ERROR: Even number of PEs required\n");
goto BAILOUT;
}
MPI_Barrier(MPI_COMM_WORLD);
/* Ping-pong in pairs */
if (my_rank % 2 == 0) {
incr = MAX_ORDER/(my_rank+1);
for (test = 0, size = min_size;
size <= max_size; size = size + incr, test++) {
printf ("Even PE %d starting ping-pong test %d\n", my_rank, test);
for (pass = 0; pass < NPASSES; pass++) {
MPI_Send(x, size, MPI_FLOAT, my_rank + 1, 0, comm);
MPI_Recv(x, size, MPI_FLOAT, my_rank + 1, 0, comm,
&status);
}
}
}
else {
incr = MAX_ORDER/(my_rank);
for (test = 0, size = min_size;
size <= max_size; size = size + incr, test++) {
for (pass = 0; pass < NPASSES; pass++) {
MPI_Recv(x, size, MPI_FLOAT, my_rank - 1, 0, comm,
&status);
MPI_Send(x, size, MPI_FLOAT, my_rank - 1, 0, comm);
}
}
}
/* All processors broadcast */
for (source = 0; source < npes; source++) {
if (source == my_rank)
printf ("PE %d initiates Bcast\n", my_rank);
MPI_Bcast (x, max_size, MPI_FLOAT, source, comm);
}
/* Master sends individual messages to all */
if (my_rank == 0) {
printf("starting batch send from master \n");
for (dest = 1; dest < npes; dest++) {
MPI_Send(x, max_size, MPI_FLOAT, dest, 0, comm);
}
}
else {
MPI_Recv(x, max_size, MPI_FLOAT, 0, 0, comm, &status);
printf ("PE %d received from 0\n", my_rank);
}
/* Master receives from all in any order */
if (my_rank == 0) {
printf("starting batch receive in any order \n");
for (source = 1; source < npes; source++) {
MPI_Recv(x, max_size, MPI_FLOAT, MPI_ANY_SOURCE, 0, comm, &status);
}
printf ("Master received from all\n");
}
else {
MPI_Send(x, max_size, MPI_FLOAT, 0, 0, comm);
}
BAILOUT:
MPI_Finalize();
}
Web-Journal on Performance Evaluation and Modeling
We have recently learned of the "Journal of Performance Evaluation and Modeling for Computer Systems" (PEMCS). It is at:
http://hpc-journals.ecs.soton.ac.uk/PEMCS/
and currently has these postings:
- PERFORM - A Fast Simulator For Estimating Program Execution Time, By Alistair Dunlop and Tony Hey, Department Electronics and Computer Science, The University of Southampton, Southampton, SO17 1BJ, U.K.
- Performance Comparison of MPI, PGHPF/CRAFT and HPF Implementations of the Cholesky Factorization on the Cray T3E and IBM SP-2, By Glenn R. Luecke and Ying Li, Iowa State University, Ames, Iowa, 50011-2251, USA.
- Comparing The Performance of MPI on the Cray T3E-900, THE Cray Origin2000 And The IBM P2SC, by Glenn R. Luecke and James J. Coyle Iowa State University, Ames, Iowa 50011-2251, USA.
- 4.EuroBen Experiences with the SGI Origin 2000 and the Cray T3E, by A.J. van der Steen, Computational Physics, Utrecht University, Holland
The followng introduction is taken from the PEMCS site:
Starting at the beginning of 1997, the journal aims to publish--on the Web--high quality, peer reviewed, original scientific papers alongside review articles and short notes in the rapidly developing area of performance evaluation and modeling of computer systems with special emphasis on high performance computing.
The rush for higher and higher performance has always been one of the main goals of electronic computers. Currently, high performance computing is moving rapidly from an era of `Big Iron' to a future that will be dominated by systems built from commodity components. Very soon, users will be able to construct high-performance systems by clustering off-the-shelf processing modules using widely available high-speed communication switches. Alternatively, the World Wide Web itself represents the largest available `parallel' computer, with more than 20 million potential nodes worldwide. This makes the innovative Web technologies particularly attractive for distributed computing. Equally exciting is the goal to achieving Petaflop computing rates on real production codes.
All this makes the performance evaluation and modeling of emerging hybrid shared/distributed memory parallel architectures with complex memory hierarchies and corresponding applications a natural area of priority for science, research and development.
The main objectives of this journal are, therefore, to provide a focus for performance evaluation activities and to establish a flexible environment on the Web for reporting and discussing performance modeling and Petaflop computing.
Quick-Tip Q & A
A: {{ What's your favorite shell alias? }}
Thanks to all!
---
On the T3E it is:
alias pss '(ps -do "ruser,pid,ppid,himem,npes,vpe,addr,time,command" \\
egrep -v "root
daemon
<defunct>
1 0 .../.... 00:00:0[0-9] " \\
grep "^.\{39\}[P ][E03]" \\
sort -k 5n,5n -k 1,1 \\
; date)'
This is how it appears in my .cshrc, the \\ gets converted to \ in the
actual alias. It lists all user commands that have used at least 10
seconds of processor time, and for multi-PE commands it only lists
PE 0 and PE 3.
For example:
seymour 3> pss
RUSER PID PPID HIMEM NPES VPE ADDRESS TIME COMMAND
smith 92836 92590 42363 32 0 180/0000 00:02:08 sdd
smith 113188 113186 42347 32 3 183/0000 00:04:31 sdd
jones 89995 89985 16447 48 0 d0/0000 02:32:39 a.out
jones 91354 91348 16457 48 0 100/0000 01:59:35 a.out
jones 112695 89995 16447 48 3 d3/0000 02:32:28 a.out
jones 113000 91354 16441 48 3 103/0000 01:59:53 a.out
Fri Jun 26 21:41:38 CDT 1998
---
Regarding your Q&A tips, my favorite 2 aliases:
alias x 'xterm -sb -sk -sl 3000 -geometry 80x45 -n \!$ -title \!$ -e telnet \!$ &'
alias quake 'finger quake@andreas.wr.usgs.gov'
I use my "x" (and a similar "ssh" alias) alias more than any other alias, to
open an xterm window on a particular machine. My ssh alias is similar, where
I change my x alias "-e telnet" to "-e ssh". The x and ssh aliases run xterm
with all my preferences (e.g., scroll bars, a large number of lines in the
scrolling buffer, the geometry of the window, the font, a window title, icon
name, and more). It is used by typing, for example, "x yukon" (telnet) or
"ssh yukon" (ssh) to open a session on a machine called yukon.
The quake alias displays the latest northern California earthquake
information. Sites for other areas are described in the output.
---
I have rmcore to get rid of corefiles.
alias rmcore rm -f core
note you can say rmcore * and delete all files! Danger!
---
whom='who
sort -u +0 -1'
psg='ps -eaf
egrep'
---
These two korn shell aliases extract information from "qstat -f". The
output records look like this:
NQS 3.2.1.4 BATCH REQUEST: qsub.cmd Status: RUNNING
NQS Identifier: 21712.yukon Target User: smith
Created: Wed Jul 8 1998 Queued: Wed Jul 8 1998
16:04:47 AKDT 16:04:47 AKDT
Name: xlarge@yukon Priority: 61
MPP Processor Elements 50 50
MPP Time Limit <28800sec> 28800sec 24365sec
The first alias reports on all running NQS requests, the second on all
other requests (waiting, queued, checkpointed, etc.).
qfr='(qstat -f $(qstat -sa
egrep "^[0-9].*[R]..$"
cut -f1 -d" ")
egrep "BATCH R
MPP Time
MPP Proc
Created
:..:
NQS Id
Name:")'
qfq='(qstat -f $(qstat -sa
egrep "^[0-9].*[^R]..$"
cut -f1 -d" ")
egrep "BATCH R
MPP Time
MPP Proc
Created
:..:
NQS Id
Name:")'
Q: When I use:
chmod -R go+rx ./
it makes EVERYTHING group/other executable, even text files!
Is there a way to add the execute permissions to only those files
that were originally executable? (This drives me batty!)
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
