ARSC T3E Users' Newsletter 138, March 6, 1998
OnLine T3E Tutorial
ARSC's selfguided T3E user tutorial is now online, at:
http://www.arsc.edu/support/howtos/t3e/T3ETutorial.html
The tutorial is appropriate for new users, or experienced users who are novices in a particular skill area, such as MPI or SHMEM. Everyone is welcome to try some of the labs, and all feedback is appreciated.
Monte Carlo Methods: A Parallel Implementation
[[ This is the third in a four part series on Monte Carlo methods. ]]
In the previous two issues, Giray Okten introduced the history and theory behind Monte Carlo methods. The goal of the second half of this series is to introduce parallel implementation.
A parallel implementation must:
 distribute tasks and recombine results
 generate noncorrelated sequences of "random" numbers on the processors.
 and, ideally, produce identical results in serial and parallel execution.
The example below is a simple SPMD code consisting of two modules written in C. The main module is a frontend which handles the I/O and parallel work. It reads the user's request for the overall number of simulation events and, using a formula printed in newsletter #132:
/arsc/support/news/t3enews/t3enews132/index.xml ,
apportions these across the PEs. All PEs, including the "master," call the "simulate" module, which simulates the events and returns the result. Given that the tasks are nearly identical in size, and the work required to farm them out and recombine them is absolutely minimal, a "masterslave" approach would be of no benefit and would lead to lots of idle time on the "master." (In programs like this, where all PEs do similar work, but one of them is singled out for additional managerial activities, that PE is often referred to as the "overseer" instead of the "master.")
Once the "simulate" modules return, the PEs synchronize. The results are gathered to the "master," which computes the global average and prints a report for the user, and all PEs loop to process more simulations.
Starting with the main module, one can easily swap "simulate" modules. These can test pseudo versus quasirandom sequences, try different schemes for avoiding correlation between sequences, and (the fun part), solve different problems.
The "simulate" module in this example computes PI using Buffon's needle experiment (see Giray's article, newsletter #136 ). As a reminder, this is a 2dimensional model which requires two random variables for each event simulation. The example uses Cray's drand48() linear congruential pseudorandom number generator and parallelizes it using the "leapfrog" approach mentioned in newsletter #133 .
An important feature of the sample is that it produces identical results in serial or parallel runs. For this to occur, it is necessary that all PEs use the same seed value, and that elements from the pseudorandom number sequence be paired identically, whether running in serial or parallel. The following table gives an example of possible correct pairing of random values:
event rand vals serial parallel (NPES=3)     1 r[0] PE 0 PE 0 r[1] 2 r[2] PE 0 PE 1 r[3] 3 r[4] PE 0 PE 2 r[5] 4 r[6] PE 0 PE 0 r[7]
Each PE grabs its pair of random values for its current event, "leapfrogs" over those to be used by the other PEs, grabs the next pair, etc. Unfortunately, this is accomplished by actually computing the intervening values and throwing them away, which, for a relatively trivial computation like the buffon experiment, significantly decreases efficiency. For this reason, "leapfrog" is a poor solution for this problem, and speedup on double the number of processors is on the order of 60%. For a simple example, however, it seemed a good place to start.
"Sequence splitting," as described indepth in newsletter #133 would be a better solution. A quasirandom sequence, in which one can compute the nth element in the sequence without computing (and throwing away) the n1 preceding elements, would also be better. (Reread Giray's articles for more on quasirandom sequences.)
Here is a sample run of the program. The parameters specify 10 reports per run, 100,000 events per report, for a total of 1,000,000 events, and a random seed of 1. A run on one processor ("mpprun n1") is, again, identical.
yukon$ mpprun n7 ./pbuffon 100000 10 1 [0] 100000 31825.000000: avg: 0.318250000 [1] 200000 63713.000000: avg: 0.318565000 [2] 300000 95753.000000: avg: 0.319176667 [3] 400000 127498.000000: avg: 0.318745000 [4] 500000 159238.000000: avg: 0.318476000 [5] 600000 191120.000000: avg: 0.318533333 [6] 700000 223020.000000: avg: 0.318600000 [7] 800000 254518.000000: avg: 0.318147500 [8] 900000 286358.000000: avg: 0.318175556 [9] 1000000 318063.000000: avg: 0.318063000
To estimate PI from this simulation, one simply substitutes the result from the experiment (0.3180630) into the analytical solution (see newsletter #136 ):
P = (2 * L) / (PI * D).
and solves for PI.
This yields an estimate for PI of: 3.14403.
This is certainly not the best way to estimate PI, but it's of historical interest, as a number of early experimenters did indeed conduct the physical version of this experiment, for that purpose.
Here's the Ccode, starting with the main module. The "simulate" module starts with the comment: "FILE: pbuffon.c".
/* FILE: pmc.c PURPOSE: Parallel front end for simple montecarlo simulation functions. DESCRIPTION: Sets up an environment on the Cray T3E to run a single function on a variable number of processors and accumulate and print results in stages. The function, called "simulate", is expected to make the number of attempts, or event simulations prescribed by user arguments. It must return the total value, not the average, of "successes." For example, "simulate" might return the value 100 after 600 throws of a single die, if a success were defined as a "1" coming up. All averaging is done later in this main module, after results from all processors are accumulated. The simulate function must itself obtain independent sequences of "random" values on each PE. This front end passes through the PE number, total PEs, and a user supplied "seed" which may be used. COMMAND LINE ARGS: 1) Total number of trials to make per step (or line of output). 2) Total number of steps. 3) Random seed. */ #include <stdio.h> #include <stdlib.h> #include <mpi.h> #include <math.h> /* DIE quits all PEs providing a bit of feedback. */ #define DIE(str,mype) error_fatal(__LINE__,__FILE__,str,mype) void error_fatal(int line, char *file, char *msg, int mype ); int simulate (int mype, int totpes, int usrSeed, int nruns, double *resTot); /**/ main (int argc, char **argv) { int mype,totpes,mproc; int err,usrSeed; int myRunsPerRep,runsPerRep,nReps,penum,subtot; int repNum,runsDone; int *nrunsArr,nrunsAccum; double resTotal, *resArr,resAccum,resAvg; int n; /* Set up MPI */ if (MPI_Init(&argc,&argv) != 0) DIE ("MPI_Init err.",mype); if (MPI_Comm_rank(MPI_COMM_WORLD, &mype) != 0) DIE ("MPI_Comm_rank err.",mype); if (MPI_Comm_size(MPI_COMM_WORLD, &totpes) != 0) DIE ("MPI_Comm_size err.",mype); /* define master PE */ mproc=0; /* verify correct number command line args */ if (mype==mproc && argc!=4) DIE ("SYNTAX ERR, EXPECTED: runsPerRep nReps usrSeed",mype); /* All processors synchronize */ if (MPI_Barrier (MPI_COMM_WORLD) != 0) DIE ("MPI_Barrier err.",mype); sscanf (argv[1], "%d", &runsPerRep); sscanf (argv[2], "%d", &nReps); sscanf (argv[3], "%d", &usrSeed); /* Each PE Calculates number of actual runs it will make per report */ myRunsPerRep = (runsPerRep + totpes  mype  1) / totpes; /* Master creates arrays for gathering results */ if (mype==mproc) { if ((nrunsArr = (int *) malloc (totpes * sizeof (int))) == NULL) DIE ("Malloc failed",mype); if ((resArr = (double *) malloc (totpes * sizeof (double))) == NULL) DIE ("Malloc failed",mype); /* Initialize result accumulators */ resAccum = 0.0; nrunsAccum = 0; } /* Iterate through requeste reports */ for (repNum=0; repNum<nReps; repNum++) { /* Each PE does its share of the simulations for the * current report */ runsDone = simulate (mype, totpes, usrSeed, myRunsPerRep, &resTotal); /* Gather results back to master */ if (MPI_Barrier (MPI_COMM_WORLD) != 0) DIE ("MPI_Barrier err.",mype); if (MPI_Gather (&runsDone,1,MPI_INT, nrunsArr,1,MPI_INT,mproc,MPI_COMM_WORLD) != 0) DIE ("MPI_Gather err.",mype); if (MPI_Gather (&resTotal,1,MPI_DOUBLE, resArr,1,MPI_DOUBLE,mproc,MPI_COMM_WORLD) != 0) DIE ("MPI_Gather err.",mype); /* master computes totals and prints report */ if (mype==mproc) { for (n=0; n<totpes; n++) { resAccum += *(resArr + n); nrunsAccum += *(nrunsArr + n); } resAvg = resAccum / nrunsAccum; printf ("[%d] %d %lf: avg: %12.9lf\n", repNum, nrunsAccum, resAccum, resAvg); } } /* All done... */ if (MPI_Finalize() != 0) DIE ("MPI_Finalize err.",mype); } /* error_fatal: cleanly exits all PEs, prints diagnostic info. */ void error_fatal(int line, char *file, char *msg,int mype ) { fprintf (stderr, "FATAL ERROR in file: %s line: %d PE: %d message:\n*****\"%s\"\n", file, line, mype, msg); fflush (stderr); globalexit (1); } /* FILE: pbuffon.c PURPOSE: Use simulation to try to obtain a solution to the Buffon needle problem. A needle of length l is tossed into a grid of parallel lines with spacing d, and the probability of an intersection computed. DESCRIPTION: This model assumes: =================== d = 2*l, and uses a length, l, of 1. Simulation method: ================== The displacement of the center of the needle from the closest line is chosen from a random distribution on the range [0..1]. Then the angle theta of the needle with respect to a perpendicular to the parallel lines is chosen, from the range PI/2. A "success" occurs when if the intersection of the resulting line and the parallel grid line is at a distance less than half the length of the needle. USAGE: "Simulate" function intended as computation module for "pmc.c". */ #include <stdio.h> #include <stdlib.h> #include <mpi.h> #include <math.h> #pragma _CRI inline (get_next_rand_pair) /* globals variables to communicate pseudorandom numbers. One per dimension. */ static double rand1, rand2; /* Seed the Cray pseudorandom number generator. */ void seed_generator (int usrSeed) { unsigned short seed[3]; seed[0] = seed[1] = seed[2] = usrSeed; (void) seed48 (seed); } /* Each PE extracts adjacent pairs of pseudorandom numbers from the sequence * and throws away the pairs used by the other PEs. This function must * be called first and only once. It advances to mype's first pair */ void skip_to_first_rand_pair (int mype) { int n; for (n = 0; n < mype ; n++) { (void) drand48(); (void) drand48(); } } /* Assumes that "skip_to_first_rand_pair" has been called. Returns the * next pair of numbers from the pseudorandom sequence, and advances * to the next pair. */ void get_next_rand_pair (int totpes) { int n; double my_rand; rand1 = drand48(); rand2 = drand48(); for (n = 0; n < totpes  1; n++) { (void) drand48(); (void) drand48(); } } /* Simulate: PARAMS: [input] mype, totpes: can run on a serial machine... set to 0 and 1. [input] usrSeed: All PEs use same seed. [input] nruns: Total number of attempts to make. [output] resTot: Total number of successes. RETURNS: Total number of attempts (should equal input value of nruns). */ int simulate (int mype, int totpes, int usrSeed, int nruns, double *resTot) { static int doInit = 1; int runN; /* On the first call to simulate, initialize the psuedorandom number * sequence. */ if (doInit == 1) { seed_generator (usrSeed); skip_to_first_rand_pair (mype); doInit = 0; } *resTot=0; for (runN=0; runN<nruns; runN++) { double h, theta; get_next_rand_pair (totpes); h = rand1; theta = rand2 * M_PI / 2.0; if (h <= cos (theta) * (1.0/2.0)) *resTot += 1.0; } return nruns; }
Next Week: "Monte Carlo Methods: Scientific Applications and Advanced MPI Support"
PLAPACK
[[ We're forwarding this excerpt from a message received from Greg Morrow and Robert van de Geijn "for the PLAPACK team." PLAPACK is not installed at ARSC, but interested ARSC users are encouraged to submit a software request form:
http://www.arsc.edu/support/resources/softwarerequest.html
or email " consult@arsc.edu ". ]]
During the last two years the PLAPACK project at UTAustin has developed an MPI based Parallel Linear Algebra Package (PLAPACK) designed to provide a user friendly infrastructure for building parallel dense linear algebra libraries. Now that this infrastructure has stabilized, we have started developing libraries. The first solver available, PLA_Gesv, solves general linear systems. PLAPACK provides a high level abstraction for coding dense linear algebra algorithms. It is often believed that such a high level abstraction comes at the price of lower performance. We are happy to report that while this is temporarily true for smaller problem sizes (where the fact that we have not yet optimized the infrastructure is still noticeable), the higher level of abstraction allows for more sophisticated algorithms to be implemented, which attain higher performance for larger problem sizes. More information: http://www.cs.utexas.edu/users/plapack plapack@cs.utexas.edu The PLAPACK Users' Guide is available from The MIT Press: Robert van de Geijn, "Using PLAPACK: Parallel Linear Algebra Package", The MIT Press, 1997 http://mitpress.mit.edu/
QuickTip Q & A
A: {{ You use "qsub my_baby" to submit a job to NQS... and it vanishes! }} If the normal NQS channels seem broken, check email on the T3E host. NQS sends email (all by itself) !!! A common situation is when your request exceeds the restrictions on all queues. Here is a quick NQS tutorial. Issuing this command: t3e% qstat shows all queue names. Executing: t3e% qstat f <queuename> shows the restrictions on the named queue. On yukon, the "grand" queue is the largest (though not the longest), and the command: yukon% qstat f grand shows (among other things): <REQUEST LIMITS> PERPROCESS PERREQUEST    MPP Processor Elements: 90 MPP Time Limit: 14400sec 14400sec If, on yukon, I made this (unsatisfiable) request: qsub q mpp l mpp_p=128 l mpp_t=4:00:00 ./my_baby NQS would email the (indecipherable) job log and this (reasonably articulate) explanation: The following request limit(s) are too large for any destination queue: Perrequest MPP processor element limit; Last destination queue attempted: <grand@yukon> This indicates that NQS tested my request against every destination queue serviced by the pipe queue, "mpp." The last queue it attempted was "grand@yukon," and my "Perrequest MPP processor element limit" of 128 PEs was too large for any of the queues. If you get a message like this, and don't understand it, forward it to your center's help desk. Q: You've ftp'd a file to the Cray from your PC. There is a load of ^M characters at the ends of the lines. How can you get rid of these?
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.