ARSC T3E Users' Newsletter 138, March 6, 1998

On-Line T3E Tutorial

ARSC's self-guided T3E user tutorial is now on-line, at:

The tutorial is appropriate for new users, or experienced users who are novices in a particular skill area, such as MPI or SHMEM. Everyone is welcome to try some of the labs, and all feedback is appreciated.

Monte Carlo Methods: A Parallel Implementation

[[ This is the third in a four part series on Monte Carlo methods. ]]

In the previous two issues, Giray Okten introduced the history and theory behind Monte Carlo methods. The goal of the second half of this series is to introduce parallel implementation.

A parallel implementation must:

  • distribute tasks and recombine results
  • generate non-correlated sequences of "random" numbers on the processors.
  • and, ideally, produce identical results in serial and parallel execution.

The example below is a simple SPMD code consisting of two modules written in C. The main module is a front-end which handles the I/O and parallel work. It reads the user's request for the overall number of simulation events and, using a formula printed in newsletter #132:

/arsc/support/news/t3enews/t3enews132/index.xml ,

apportions these across the PEs. All PEs, including the "master," call the "simulate" module, which simulates the events and returns the result. Given that the tasks are nearly identical in size, and the work required to farm them out and recombine them is absolutely minimal, a "master-slave" approach would be of no benefit and would lead to lots of idle time on the "master." (In programs like this, where all PEs do similar work, but one of them is singled out for additional managerial activities, that PE is often referred to as the "overseer" instead of the "master.")

Once the "simulate" modules return, the PEs synchronize. The results are gathered to the "master," which computes the global average and prints a report for the user, and all PEs loop to process more simulations.

Starting with the main module, one can easily swap "simulate" modules. These can test pseudo- versus quasi-random sequences, try different schemes for avoiding correlation between sequences, and (the fun part), solve different problems.

The "simulate" module in this example computes PI using Buffon's needle experiment (see Giray's article, newsletter #136 ). As a reminder, this is a 2-dimensional model which requires two random variables for each event simulation. The example uses Cray's drand48() linear congruential pseudo-random number generator and parallelizes it using the "leap-frog" approach mentioned in newsletter #133 .

An important feature of the sample is that it produces identical results in serial or parallel runs. For this to occur, it is necessary that all PEs use the same seed value, and that elements from the pseudo-random number sequence be paired identically, whether running in serial or parallel. The following table gives an example of possible correct pairing of random values:

  event   rand vals    serial  parallel (NPES=3)
  -----   ---------    ------  -----------------
     1       r[0]        PE 0     PE 0 
     2       r[2]        PE 0     PE 1
     3       r[4]        PE 0     PE 2 
     4       r[6]        PE 0     PE 0

Each PE grabs its pair of random values for its current event, "leap-frogs" over those to be used by the other PEs, grabs the next pair, etc. Unfortunately, this is accomplished by actually computing the intervening values and throwing them away, which, for a relatively trivial computation like the buffon experiment, significantly decreases efficiency. For this reason, "leap-frog" is a poor solution for this problem, and speed-up on double the number of processors is on the order of 60%. For a simple example, however, it seemed a good place to start.

"Sequence splitting," as described in-depth in newsletter #133 would be a better solution. A quasi-random sequence, in which one can compute the n-th element in the sequence without computing (and throwing away) the n-1 preceding elements, would also be better. (Reread Giray's articles for more on quasi-random sequences.)

Here is a sample run of the program. The parameters specify 10 reports per run, 100,000 events per report, for a total of 1,000,000 events, and a random seed of 1. A run on one processor ("mpprun -n1") is, again, identical.

  yukon$ mpprun -n7 ./pbuffon 100000 10 1
  [0] 100000 31825.000000: avg:  0.318250000
  [1] 200000 63713.000000: avg:  0.318565000
  [2] 300000 95753.000000: avg:  0.319176667
  [3] 400000 127498.000000: avg:  0.318745000
  [4] 500000 159238.000000: avg:  0.318476000
  [5] 600000 191120.000000: avg:  0.318533333
  [6] 700000 223020.000000: avg:  0.318600000
  [7] 800000 254518.000000: avg:  0.318147500
  [8] 900000 286358.000000: avg:  0.318175556
  [9] 1000000 318063.000000: avg:  0.318063000

To estimate PI from this simulation, one simply substitutes the result from the experiment (0.3180630) into the analytical solution (see newsletter #136 ):

    P = (2 * L) / (PI * D).

and solves for PI.

This yields an estimate for PI of: 3.14403.

This is certainly not the best way to estimate PI, but it's of historical interest, as a number of early experimenters did indeed conduct the physical version of this experiment, for that purpose.

Here's the C-code, starting with the main module. The "simulate" module starts with the comment: "FILE: pbuffon.c".

FILE: pmc.c

  Parallel front end for simple monte-carlo simulation functions.

  Sets up an environment on the Cray T3E to run a single function
  on a variable number of processors and accumulate and print results
  in stages.

  The function, called "simulate", is expected to make the number 
  of attempts, or event simulations prescribed by user arguments.
  It must return the total value, not the average, of "successes." 
  For example, "simulate" might return the value 100 after 600 
  throws of a single die, if a success were defined as a "1" coming 
  up.   All averaging is done later in this main module, after 
  results from all processors are accumulated.
  The simulate function must itself obtain independent sequences
  of "random" values on each PE.  This front end passes through 
  the PE number, total PEs, and a user supplied "seed" which may be 

  1) Total number of trials to make per step (or line of output).
  2) Total number of steps.
  3) Random seed.  

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <math.h>

/* DIE quits all PEs providing a bit of feedback. */
#define DIE(str,mype) error_fatal(__LINE__,__FILE__,str,mype)

void error_fatal(int line, char *file, char *msg, int mype );
int simulate (int mype, int totpes, int usrSeed, int nruns, double *resTot);

main (int argc, char **argv) {
  int mype,totpes,mproc;
  int err,usrSeed;
  int myRunsPerRep,runsPerRep,nReps,penum,subtot;
  int repNum,runsDone; 
  int *nrunsArr,nrunsAccum;
  double resTotal, *resArr,resAccum,resAvg;
  int n;

  /* Set up MPI */
  if (MPI_Init(&argc,&argv) != 0) 
    DIE ("MPI_Init err.",mype);
  if (MPI_Comm_rank(MPI_COMM_WORLD, &mype) != 0) 
    DIE ("MPI_Comm_rank err.",mype);

  if (MPI_Comm_size(MPI_COMM_WORLD, &totpes) != 0) 
    DIE ("MPI_Comm_size err.",mype);

  /* define master PE */

  /* verify correct number command line args */
  if (mype==mproc && argc!=4)
    DIE ("SYNTAX ERR, EXPECTED: runsPerRep nReps usrSeed",mype);

  /* All processors synchronize */
  if (MPI_Barrier (MPI_COMM_WORLD) != 0) 
    DIE ("MPI_Barrier err.",mype);

  sscanf (argv[1], "%d", &runsPerRep);
  sscanf (argv[2], "%d", &nReps);
  sscanf (argv[3], "%d", &usrSeed);

  /* Each PE Calculates number of actual runs it will make per report */
  myRunsPerRep = (runsPerRep + totpes - mype - 1) / totpes;

  /* Master creates arrays for gathering results */
  if (mype==mproc) {
    if ((nrunsArr = (int *) malloc (totpes * sizeof (int))) == NULL) 
      DIE ("Malloc failed",mype);

    if ((resArr = (double *) malloc (totpes * sizeof (double))) == NULL) 
      DIE ("Malloc failed",mype);

    /* Initialize result accumulators */
    resAccum = 0.0;
    nrunsAccum = 0;

  /* Iterate through requeste reports */
  for (repNum=0; repNum<nReps; repNum++)  {

    /* Each PE does its share of the simulations for the 
    *  current report */
    runsDone = simulate (mype, totpes, usrSeed, myRunsPerRep, &resTotal);

    /* Gather results back to master */
    if (MPI_Barrier (MPI_COMM_WORLD) != 0) 
      DIE ("MPI_Barrier err.",mype);
    if (MPI_Gather (&runsDone,1,MPI_INT,
        nrunsArr,1,MPI_INT,mproc,MPI_COMM_WORLD) != 0) 
      DIE ("MPI_Gather err.",mype);

    if (MPI_Gather (&resTotal,1,MPI_DOUBLE,
       resArr,1,MPI_DOUBLE,mproc,MPI_COMM_WORLD) != 0) 
      DIE ("MPI_Gather err.",mype);

    /* master computes totals and prints report */
    if (mype==mproc) {

      for (n=0; n<totpes; n++) {
        resAccum +=   *(resArr + n);
        nrunsAccum += *(nrunsArr + n);

      resAvg = resAccum / nrunsAccum;
      printf ("[%d] %d %lf: avg: %12.9lf\n", 
                        repNum, nrunsAccum, resAccum, resAvg);

  /* All done... */
  if (MPI_Finalize() != 0) 
    DIE ("MPI_Finalize err.",mype);

error_fatal:    cleanly exits all PEs, prints diagnostic info.
error_fatal(int line, char *file, char *msg,int mype )
  fprintf (stderr,
    "FATAL ERROR in file: %s line: %d PE: %d  message:\n*****\"%s\"\n",
    file, line, mype, msg);

  fflush (stderr);

  globalexit (1);

FILE:  pbuffon.c

  Use simulation to try to obtain a solution to the Buffon needle problem.  
  A needle of length l is tossed into a grid of parallel lines with spacing
  d, and the probability of an intersection computed.


  This model assumes: 
  d = 2*l, and uses a length, l, of 1.
  Simulation method:
  The displacement of the center of the needle from the closest line
  is chosen from a random distribution on the range [0..1].  Then 
  the angle theta of the needle with respect to a perpendicular to the
  parallel lines is chosen, from the range PI/2.  A "success" occurs
  when if the intersection of the resulting line and the parallel grid
  line is at a distance less than half the length of the needle.

  "Simulate" function intended as computation module for "pmc.c".
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <math.h>

#pragma _CRI inline (get_next_rand_pair)

/* globals variables to communicate pseudo-random numbers. One
   per dimension. */
static double rand1, rand2;

/* Seed the Cray pseudo-random number generator. */
void seed_generator (int usrSeed) {
  unsigned short seed[3];

  seed[0] = seed[1] = seed[2] = usrSeed;
  (void) seed48 (seed); 

/* Each PE extracts adjacent pairs of pseudo-random numbers from the sequence
*  and throws away the pairs used by the other PEs.  This function must 
*  be called first and only once.  It advances to mype's first  pair */
void skip_to_first_rand_pair (int mype) { 
  int n;  
  for (n = 0; n < mype ; n++)  {
    (void) drand48();
    (void) drand48();

/* Assumes that "skip_to_first_rand_pair" has been called.  Returns the 
*  next pair of numbers from the pseudo-random sequence, and advances
*  to the next pair.  */
void get_next_rand_pair (int totpes) {
  int n;  
  double my_rand;

  rand1 = drand48();
  rand2 = drand48();

  for (n = 0; n < totpes - 1; n++) {
    (void) drand48();
    (void) drand48();



  [input]  mype, totpes: can run on a serial machine... set to 0 and 1.
  [input]  usrSeed: All PEs use same seed.
  [input]  nruns:  Total number of attempts to make.
  [output] resTot: Total number of successes.

  Total number of attempts (should equal input value of nruns). 
int simulate (int mype, int totpes, int usrSeed, int nruns, double *resTot) {
  static int doInit = 1;
  int runN;

  /* On the first call to simulate, initialize the psuedo-random number
  *  sequence.  */
  if (doInit == 1) {
    seed_generator (usrSeed);
    skip_to_first_rand_pair (mype);
    doInit = 0;

  for (runN=0; runN<nruns; runN++) {
    double h, theta;

    get_next_rand_pair (totpes); 

    h = rand1;
    theta = rand2 * M_PI / 2.0;

    if (h <= cos (theta) * (1.0/2.0))
      *resTot += 1.0;

  return nruns;

Next Week: "Monte Carlo Methods: Scientific Applications and Advanced MPI Support"


[[ We're forwarding this excerpt from a message received from Greg Morrow and Robert van de Geijn "for the PLAPACK team." PLAPACK is not installed at ARSC, but interested ARSC users are encouraged to submit a software request form:

or e-mail " ". ]]


 During the last two years the PLAPACK project at UT-Austin has

 developed an MPI based Parallel Linear Algebra Package (PLAPACK)

 designed to provide a user friendly infrastructure for building

 parallel dense linear algebra libraries.  Now that this infrastructure

 has stabilized, we have started developing libraries.  The first

 solver available, PLA_Gesv, solves general linear systems.


 PLAPACK provides a high level abstraction for coding dense linear

 algebra algorithms.  It is often believed that such a high level

 abstraction comes at the price of lower performance.  We are happy to

 report that while this is temporarily true for smaller problem sizes

 (where the fact that we have not yet optimized the infrastructure is

 still noticeable), the higher level of abstraction allows for more

 sophisticated algorithms to be implemented, which attain higher

 performance for larger problem sizes.


 More information: 


 The PLAPACK Users' Guide is available from The MIT Press:

        Robert van de Geijn, 

        "Using PLAPACK: Parallel Linear Algebra Package",

        The MIT Press, 1997

Quick-Tip Q & A

A: {{ You use "qsub my_baby" to submit a job to NQS... 
      and it vanishes! }}

   If the normal NQS channels seem broken, check e-mail on the T3E
   host.  NQS sends e-mail (all by itself) !!! A common situation is
   when your request exceeds the restrictions on all queues.

   Here is a quick NQS tutorial.  Issuing this command:

       t3e%  qstat

   shows all queue names.  Executing:

       t3e%  qstat -f <queue-name>

   shows the restrictions on the named queue.  On yukon, the "grand"
   queue is the largest (though not the longest), and the command:

        yukon%  qstat -f grand

   shows (among other things):

        ----------------------   -----------     -----------
        MPP Processor Elements:                    90
        MPP Time Limit:             14400sec       14400sec
   If, on yukon, I made this (unsatisfiable) request:

        qsub -q mpp -l mpp_p=128  -l mpp_t=4:00:00   ./my_baby
   NQS would e-mail the (indecipherable) job log and this (reasonably
   articulate) explanation:
        The following request limit(s) are too large
        for any destination queue:
          Per-request MPP processor element limit;
          Last destination queue attempted: <grand@yukon>
   This indicates that NQS tested my request against every destination
   queue serviced by the pipe queue, "mpp."  The last queue it
   attempted was "grand@yukon,"  and my "Per-request MPP processor
   element limit" of 128 PEs was too large for any of the queues.

   If you get a message like this, and don't understand it, forward it to
   your center's help desk. 

Q: You've ftp'd a file to the Cray from your PC.  There is a load of ^M
   characters at the ends of the lines.  How can you get rid of these?

[ Answers, questions, and tips graciously accepted. ]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top