ARSC T3E Users' Newsletter 136, February 06, 1998

Beowulf Seminar

Thomas Sterling of JPL/Caltch will be the featured guest at a seminar on Beowulf, to be held at ARSC, on March 5th and 6th. More information to follow.

ARSC Spring Training Schedule

        February 11: Introduction to ARSC: Is there a supercomputer in 
                     your future?
        February 18: Introduction to the Y-MP
        February 25: Introduction to Parallel Programming
        March 4:     Introduction to NQS
        March 11:    Introduction to CRL
        March 25:    Scientific Visualization at ARSC

For details and registration forms, see: .

Monte Carlo Methods: Introduction

This is the first in a four part series. The series will introduce the following aspects of Monte Carlo Methods: history and mathematical theory (contributed by Dr. Giray Okten); a sample parallel implementation (Tom Baring); and scientific applications and advanced MPI support (Guy Robinson).

[ Many thanks to Dr. Giray Okten, Visiting Professor of Mathematics, University of Alaska, Fairbanks, for contributing the first two articles in this series. ]

Monte Carlo methods were mainly developed in the 1940s, by mathematicians and scientists, including von Neumann, Fermi, Ulam and Metropolis, working on the development of nuclear weapons in Los Alamos during World War II. It is believed that the term "Monte Carlo" was coined for these methods as a code word, by Von Neumann and Ulam, suggesting the "probabilistic" nature of them. Indeed, these methods were first applied to problems with apparent probabilistic structure.

The first documented application, where we see the ideas behind Monte Carlo methods, appears in a paper by G. Comte de Buffon in 1777. He described an experiment in which a needle of length "L" is thrown randomly onto a floor ruled with parallel straight lines a distance "D" apart. He then suggested estimating the probability "P," that the needle will intersect a line, by throwing the needle many times and calculating the ratio of the number of throws intersecting a line to the total number of throws. He also carried out the mathematical analysis and calculated the exact probability.

Later, Laplace suggested that this idea could be used to evaluate "PI" from the throws of the needle. This is referred to as "Buffon's Needle Experiment," and proceeds as follows:

For L <= D, the exact probability, P, that the needle intersects one of the parallel lines is:

    P = (2 * L) / (PI * D).

If we drop the needle (or simulate dropping it) N times, and count R intersections, we obtain an experimental estimate for P,

    P = (R / N),

and thus, an estimate for PI:

    PI = (2 * L * N) / (R * D).

As explained below, the estimate improves as the number of throws increases.

(Visit the web site:

if you want to make this experiment on-line, and see what the estimate for PI is for various numbers of throws!)

Monte Carlo methods can be used not only in naturally stochastic problems, but also in deterministic problems. In general, any quantity that can be written as the expected value of a random variable defined on a probability space can be estimated by these methods, by calculating the average value of the random variable.

Using a less mathematical language, we can restate this in the following way. Suppose you have a problem and you want to estimate a certain quantity. Simulate the problem using random numbers (we will discuss random numbers later) and obtain an estimate for the quantity of interest. Do this simulation a number of times and take the average of the different estimates you get as the outcomes of these simulations. This average will be a good estimate, if the number of times you do the simulation (sample size) is large .

The reason why we can use averages to estimate the quantity of interest comes from the strong law of large numbers, which states that the sequence of sample means of a random variable converges to its expected value almost surely, provided the expected value is finite. In addition, from the central limit theorem, the error of this approximation satisfies the probabilistic bound, O(1/Sqrt(N)), where N is the sample size .

A simple application is numerical integration, where one uses the averages of function values evaluated at random coordinates to estimate the definite integral of the function, say, on the unit cube. By "random" numbers, we mean independent and identically distributed realizations of the random variable with the uniform distribution on the unit cube.

The concepts random variable , independence , and identically distributed , are well-defined concepts of the probability theory. However, when we carry these concepts to the physical world, especially in an attempt to generate random numbers, we face several difficulties, confusion, and and ongoing debate. When we think of a random process, we often think of events whose outcomes cannot be determined, such as physical outcomes of the scattering of an electron by an atom, or simply the outcome of flipping a fair coin. However, the fact that we cannot determine the outcome of flipping a fair coin is not a property of the physical process itself, but merely an indication of our understanding of the process, making it difficult to define uncertainty , or randomness in the physical world. In spite of this difficulty (or maybe, because of it), several mathematicians and philosophers, including Kolmogorov, Martin-Lof, Chaitin, and Popper, have suggested various definitions of random numbers.

Whatever our understanding of a random event or a random sequence of numbers might be, we go ahead and try to find algorithms, which mainly use arithmetical operations, that would generate numbers that look like random . These imitations are called pseudorandom numbers and the success of this imitation is determined by the use of statistical tests, often called tests of randomness . Some of the methods that generate pseudorandom numbers as well as some statistical tests can be found at:

There are also other interesting links on this page.

"Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number-there are only methods to produce random numbers, and a strict arithmetical procedure is of course not such a method. (It is true that a problem we suspect of being solvable by random methods may be solvable by some rigorously defined sequence, but this is a deeper mathematical question than we can go into now.)" [von Neumann]

Perhaps what von Neumann stated as a deeper mathematical question was solved, at least partly, by the theory of uniform distribution (a branch of "Number Theory"), which claims that the reason our random methods work does not have anything to do with their randomness , but uniformity . The uniformity of a sequence is a number theoretic concept: it is defined without using probabilistic concepts. This leads us to the notion of quasi-Monte Carlo methods or number-theoretic methods.

Next week: "Monte Carlo Methods: Quasi-Monte Carlo and Low-Discrepancy Sequences"

Data Types and Sizes

One common problem encountered when porting code between different systems is the size of the data representation being used. Language standards and definitions rarely specify the number of words which make up each data type. Indeed with the rapid pace at which processors change and memory sizes increase this is not the concern of a language. However a language must provide a mechanism by which the user can specify or at least determine the different number of bytes in a particular data type and this is typically where some confusion often occurs. Below are listed the sizes of various data types on some common systems:

Fortran                        C               Number of bytes
=================       ==========      ===============

INTEGER                        int             8
                       long            8

REAL                   float           8
                       double          8

DOUBLE PRECISION       long double     16

INTEGER                        int             8
                       long            8

REAL*4                 float           4
REAL                   double          8


  SGI Workstations
INTEGER                        int             4
                       long            4

REAL*4                 float           4
REAL                   double          8

DOUBLE PRECISION       long double     NOT AVAILABLE.(double assumed.)

When is the size of the type important?

  1. If the code passes data between C and Fortran, the machine representation (including size in bytes) must be the same. On the Y-MP/C90/J90 systems, code may exchange data between C and Fortran routines as Fortran type REAL and C type float or double. Yet this code will get into trouble on the T3E system where the C floats are only 4 bytes long.
  2. If the code reads an unformatted data file. If 4-byte floats were written, then 4-byte floats must be read.
  3. If the code requires 64 bits of numeric precision, for either internal stability, precision of results, accuracy of results, or all three.
  4. If the code is optimized for efficient memory usage by specifying 4-byte floats. In some cases, doubling the memory requirements (by doubling the precision) would exceed available memory.

Users occasionally ask about the sizes of various types on ARSC systems. The most satisfying answer is obtained by running the tests oneself. The codes, below, reports the sizes of common types in fortran and C, and can be run on any system.

C code
#include <stdio.h>

main(argc, argv)

int argc;
char **argv;

printf(" sizes are int  %d\n",sizeof(int));
printf(" sizes are long %d\n",sizeof(long));
printf(" sizes are float %d\n",sizeof(float));
printf(" sizes are double %d\n",sizeof(double));
printf(" sizes are long_double  %d\n",sizeof(long double));


Fortran Code
      program prog

      implicit none

      integer i
      real r
      double precision d
      logical j

      write(6,*)  ' integers are ',kind(i)
      write(6,*)  ' reals are ',kind(r)
      write(6,*)  ' double precision are ',kind(d)
      write(6,*)  ' logicals are ',kind(j)


Fortran users might like to inspect the size of logicals on various systems to see how much space can be saved by using these in a program!

Side note:

There is often a compiler flag which allows selected variables to be treated as double precision or not. For example with the f90 compiler on the T3E, users can use -i 64 to specify all integers are 64 bits, -i 32 specifies 32 bits. This is most useful for programs which have many large integer arrays where there will be a significant increase in storage between a workstation and a Y-MP/C90/J90/T3E. However users must be careful that 32 bits provide sufficient resolution.


[ One of our readers requested some hard-core information on E-registers.

    Having perused the literature, he claimed that, while often touted,
    E-registers were inadequately explained.  This article, we hope,
    shines more light into this "black-box."

    The following is an excerpt from the article,
      "Performance of the CRAY T3E Multiprocessor"

      Ed Anderson, Jeff Brooks, Charles Grassl, and Steve Scott of

    which appears on the SC97 CD-ROM (under Ed Anderson's talk) and is
    available on ARSC's ftp server, at:

    The T3E benchmarker's guide has a little E-register discussion as 

    Thanks to Jeff Brooks for his help. ]

3.3 E-registers

As mentioned in Section 2, all remote communication and synchronization in the CRAY T3E is done between E-registers and memory. E-registers provide two primary benefits over a more conventional load/store mechanism for accessing global memory: they extend the physical address space of the microprocessor to allow a very large, directly addressable memory, and they dramatically increase the degree of pipelining attainable for global memory requests. They also provide efficient non-unit-stride memory access and gather/scatter capabilities.

3.3.1 Description

E-registers employ a novel address translation mechanism. The microprocessor implements a cacheable memory space and a non-cacheable I/O space, distinguished by the most significant bit of the physical address. Local memory loads and stores use cacheable memory space. Address translation takes place on the processor in the usual fashion, and physical addresses are passed to the memory. Memory-mapped registers, including the E-registers, use noncacheable I/O space.

There are two primary types of operations that can be performed on E-registers:

  • Direct loads and stores between E-registers and processor registers.
  • Global E-register operations.

Direct loads/stores are used to store operands into E-registers and load results from E-registers. Global E-register operations are used to transfer data to/from global (meaning remote or local) memory. The global operations to read memory into E-registers or write E-registers to memory are called Gets and Puts, respectively. Gets and Puts are triggered via I/O space stores, in which the address supplies the command and target/source E-register, and the data bus supplies the global memory address. Virtual-to-physical address translation thus occurs outside the processor in the system logic. Complete details are provided elsewhere [7].

Access to E-registers is implicitly synchronized by a set of state flags, one per E-register. A Get operation marks the target E-register(s) empty until the requested data arrives from memory, at which time they are marked full. A load from an E-register will stall if the E-register is empty until the data arrives. A Put from an E-register will also stall if the E-register is empty until the data becomes available. A global memory copy routine might perform Gets from the source area of memory into a block of E-registers and subsequently Put the data from the E-registers to the target memory area. The implicit state flag synchronization protects against the read-after-write hazard in the E-registers.

Since there are a large number of E-registers (512 user + 128 system), Gets and Puts may be highly pipelined. A large number of Gets can be issued, for example, before the first result is accessed. The processor bus interface allows up to four properly-aligned Get or Put commands to be issued in a two-cycle bus transaction, allowing 256 bytes worth of Gets or Puts to be issued in 26.7 ns. This issue bandwidth is far greater than the sustainable data transfer bandwidth, so the processor is not a bottleneck. Data in a memory-to-memory transfer using E-registers does not cross the processor bus; it flows from memory into E-registers and out to memory again.

In addition to providing a highly-pipelined memory interface, the E-registers provide special support for single-word load bandwidth. A variant of Gets and Puts moves a series of eight 64-bit words separated by an arbitrary stride between memory and E-registers. Thus, row accesses in Fortran, for example, can be fetched into contiguous E-registers using strided Gets. The resulting blocks of E-registers can then be loaded "broadside" into the processor, making significantly more efficient use of the bus than would be possible with normal cache line fills, in which 7/8 of the loaded data would not be used. Single-word Gets and Puts are also provided, which can be used for gather/scatter operations.

When E-register operations are used to transfer data to/from local memory, the cache is bypassed. Non-cached data access may be preferred for copying blocks of data that are too large to fit in the cache and for any large-stride or irregular access patterns that would bring in unwanted data if loaded a cache line at a time. The use of E-register operations for local data access can be suggested to the compiler by means of the loop-level CACHE_BYPASS directive, or it can be done directly by calls to Shmem library routines, specifying the local processor number instead of a remote processor number in the argument list.

Book Review: "Scientific Visualization - Overviews, Methodologies, Techniques"

"Scientific Visualization - Overviews, Methodologies, Techniques" Editors: Gregory M. Nielson, Hans Hagan, Heinrich Muller IEEE Computer Society (1997) 577 pages ISBN 0-8186-7777-5

This book comprises 21 papers on state-of-the-art scientific visualization topics, written by 45 contributing authors. Contents are organized in three sections:

Part 1 - Overviews and Surveys Part 2 - Frameworks and Methodologies Part 3 - Techniques and Algorithms

Part 1 contains overview and survey papers on some important visualization topics. The first paper gives a concise historical overview of methods for multivariate, multidimensional data, from the pencil and paper era to to the present. The authors (Pak Chung Wong and R. Daniel Bergeron) divide visualization history into 4 stages: (1) pre-1976 Searching, (2) 1977-1985 Awakening, (3) 1986-1991 Discovery, and (4) 1992-present Elaboration and Assessment. The late-1980's were a period of huge growth in the field, fueled by advances in algorithms, software, and supercomputing, computer graphics, and data storage hardware. The remaining chapters in part 1 cover immersive investigation techniques, grid generation methods, computational steering and methods for large-scale unsteady fluid flows.

Chapter 3, "Survey of Grid Generation Methodologies and Scientific Visualization Efforts" (Bernd Hamann and Robert J. Moorhead II) was of particular interest. The authors describe visualization work underway at ERC-CFS at Mississippi State University. They document development and use of several interesting methods and tools for oceanographic data (OVIRT, SCIRT,AGP). They also discuss data compression issues for oceanographic data sets. They describe a wavelet-based representation scheme that is fast, efficient, and has good data reconstruction qualities. The authors also discuss interactive, concurrent visualization techniques, including a variation of the FAST scheme adapted for their CFS codes.

Part 2 contains papers on formal foundations: integrated volume rendering and data analysis in wavelet space, intelligent design of color visualizations, engineering perceptually effective methods for abstract data, comparative visualization of flow features, systematic analysis for design, and controlled interpolation.

Part 3 contains in-depth, review articles on both established and emerging algorithms and techniques. Topics include: feature visualization, flow surface probes for vector fields, particle tracing on 3D curvilinear grids, adaptive generation of volume contours, deformation tensor fields, a primer on reaction-diffusion patterns, relativistic strings, hierarchical representation of large data sets using wavelets, tools for triangulations and tetrahedrizations, and tools for computing tangent curves and topological graphs. Feature extraction and analysis and data compression/re-construction are two important emerging techniques, particularly in the treatment of gigabyte and terabyte data sets.

In summary, the book is a excellent general introduction to state-of-the-art visualization methods for practicing scientists, as well as an excellent reference for programmers or computer scientists.


Quick-Tip Q & A

A: {{ Say you submit a single, 8-hour batch request using qsub and that
      the qsub script launches a series of short jobs. Is there a way
      to prevent one of the short jobs from using more than its share
      of the 8 hours?  }}

Assuming that each job is entitled to about the same amount of time,
use NQS.

In your qsub script, the option

   #QSUB -l mpp_t  hh:mm:ss  

specifies the limit to the total execution time of the entire NQS
request (which is the sum of the time of the included processes).  The

   #QSUB -l p_mpp_t  hh:mm:ss  

however, limits the time of each process. The following example assumes
that a normal run of "myprog" takes something less than 30 minutes.
The user expects that none of the four processes started will reach the
30 minute per process time limit and that the entire request will
complete in under 2 hours.  However, he/she may have inherited the
code, worry that it doesn't handle exceptions well, and that one of the
data sets might drop it into an infinite loop.  Were this to happen,
NQS would interrupt the offending process after 30 minutes, thus
leaving the other processes with their share of the 2 hours.

  #QSUB -q mpp
  #QSUB -l mpp_p=8
  #QSUB -l mpp_t=2:00:00
  #QSUB -l p_mpp_t=30:00
  cd ${TMPDIR}/progs
  mpprun -n8 ./myprog data_set.1
  mpprun -n8 ./myprog data_set.2
  mpprun -n8 ./myprog data_set.3
  mpprun -n8 ./myprog data_set.4

Q: You want to compile your parallel application and then run it on N
   T3E application PEs. What's the lazy man (or woman's) way to
   determine if adequate memory exists on those N PEs to run the
   program.  (Assume static allocation of all memory and that you're
   compiling/running on the same system.)

[ Answers, questions, and tips graciously accepted. ]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top