ARSC T3E Users' Newsletter 137, February 20, 1998

Hold, Checkpoint, and Restart of NQS Jobs on Yukon

ARSC holds jobs and stops queues on an occasional basis, for the following reasons:

  • To perform system maintenance during scheduled (or unscheduled) downtime.
  • To permit the timely execution of xlarge and grand jobs.
  • On rare occasion, to enable a large interactive session.

We will continue to contact all users whose jobs have been held, checkpointed, and restarted so that they may verify the results of their runs.

If you notice that the queues have been held, please submit NQS jobs as normal. The appearance of jobs in the queues is an important indication that users are waiting and demonstrates the nature of the load. We need this information!

You may monitor the status of your jobs using the UNICOS command, qstat -a. The field titled, ST, contains a three character job status code. Although all possible codes are explained in man qstat(1), the most common codes with regards to hold, checkpoint, and restart are:

The job had been running, but was checkpointed, held, and released. However, although it is ready to restart from the checkpoint image, it is now blocked waiting for PEs.
The job has not run yet and is waiting in a queue, but the queue has been stopped by the sys admin. Thus, no jobs in the queue will be allowed to run until the sys admin restarts the queue.
The job has been held by the operator (sys admin). Held jobs are prevented from making the transition to another state. Typically, this means that the job has not run yet, is waiting in a queue, but rather than the queue having been stopped (which would block all jobs in the queue), the particular job has been held.
The job was held by the operator, the system was later shut down and brought back up, and the job is still being held. Initially, the job would have had Hop status, but after the reboot, the system changed the status to Hsh from Hop to indicate the intervening shut down.

Monte Carlo Methods: Quasi-Monte Carlo & Low-Discrepancy Sequences

This is the second in a four part series on Monte Carlo methods.

[[ Many thanks to Dr. Giray Okten, Visiting Professor of Mathematics, University of Alaska, Fairbanks, for contributing the first two articles in this series. ]]

Quasi-Monte Carlo methods are often described as deterministic versions of Monte Carlo methods, although in practice, Monte Carlo methods are also strictly speaking, deterministic. Quasi-Monte Carlo methods emphasize "uniformity" of a number sequence, instead of its "randomness."

"It is suggested that, instead of clinging to vague concepts of randomness, it might be better to aim at working with sequences making no presence of random origin, but so devised as to give the best possible guarantee of accuracy in computations. Such methods of computation could be described as quasi-Monte Carlo." [Zaremba]

The theoretical basis of these methods is the Koksma-Hlawka inequality, which gives a "deterministic" upper bound for the absolute value of the error, when an integral is approximated by the average value of the integrand evaluated on a sequence of "deterministic" vectors {q(n)}, n=1,2,... . The error of this approximation is bounded by a product of two terms; the first term is the variation of the integrand, which measures the "smoothness" of it, and the second term is the discrepancy of the underlying number sequence q(n), which measures the deviation of that sequence from the "perfect" sequence which would be distributed as evenly as possible in the unit cube.

We wish not to go into technical definitions, but you are probably not happy with this vague description of the "perfect" sequence!

To elaborate on this point, consider the unit interval [0,1]. Suppose you have N numbers, q(1),...,q(N), in this interval, and you want to know how to determine the discrepancy of these numbers, i.e., their deviation from the "perfect" sequence. Choose any number, x, from the interval and consider the sub-interval [0,x). How many of your N numbers should lie in this sub-interval, if you want to claim that your numbers are perfectly evenly distributed? If x were equal to 1/2, then a "perfect sequence" would have exactly N/2 terms of its first N terms lying in the first half of the interval. In general, for any x, the perfect sequence would have x*N of its first N terms lying in the sub-interval [0,x).

Thus, if we count the number of q(i)'s that actually lie in this subinterval and then take its difference from x*N, we will have an idea of the deviation of our numbers from the perfect distribution. If we take the absolute value of this difference, divide it by N, and then find the greatest such number as x takes values between 0 and 1, we finally obtain the discrepancy of our numbers, q(1),...,q(N). The challenge is to find sequences with the smallest possible discrepancy. such sequences are called "low-discrepancy" sequences.

Some of the classical low-discrepancy sequences are the van der Corput, Halton, Sobol, Faure and Niederreiter sequences, named after their inventors. The discrepancy of these sequences satisfies O((log N)^s /N) where s is the dimension of the numbers (vectors) you generate. The following Mathematica code generates the van der Corput sequence in any base b. These numbers are one-dimensional numbers, i.e., numbers in [0,1).

  vanderCorput[n_Integer, b_Integer] :=
         digitexp = Reverse[IntegerDigits[n, b]];
           Inner[((#1*b^(-#2)))&, digitexp, Range[Length[digitexp]]]    ];

Here are the first 15 van der Corput numbers in base 2:

  1/2, 1/4, 3/4, 1/8, 5/8, 3/8, 7/8, 1/16, 9/16, 5/16, 
   13/16, 3/16, 11/16, 7/16, 15/16

And here are the first 15 van der Corput numbers in base 3:

  1/3, 2/3, 1/9, 4/9, 7/9, 2/9, 5/9, 8/9, 1/27, 10/27, 
   19/27, 4/27, 13/27, 22/27, 7/27

To generate higher dimensional vectors, we can use different van der Corput numbers in pairwise relatively prime bases, which gives the Halton sequence. The pairwise relatively prime requirement can be easily satisfied if we choose prime numbers to be our bases. For example to generate two-dimensional Halton vectors, we will generate van der Corput numbers in base 2 and base 3 and then construct vectors whose first components are given by the base 2 van der Corput numbers and the second components given by the base 3 van der Corput numbers. The following is a Mathematica code that generates s-dimensional Halton vectors:


Here are the first six 2-dimensional Halton vectors. They are obtained simply by coupling the above van der Corput numbers we generated in bases 2 and 3:

  {1/2, 1/3},{1/4, 2/3},{3/4, 1/9}, {1/8, 4/9}, {5/8, 7/9}, {3/8, 2/9}

We can see the difference between the two-dimensional Halton and pseudorandom vectors by plotting them. A graph of the first 1000 2-dimensional Halton vectors plotted in the unit square is shown here:

[1000 2-dimensional Halton vectors]

1000 2-dimensional Halton vectors

A graph of the first 1000 pseudorandom vectors:

[1000 pseudorandom vectors]

1000 psuedorandom vectors

These plots, when compared, show that the Halton vectors are distributed more evenly than the pseudorandom vectors. It is indeed this evenness that makes low-discrepancy vectors work better than the pseudorandom vectors in general.

The advantage of quasi-Monte Carlo methods is that they avoid the difficulties associated with "randomness." Moreover, they provide deterministic error bounds, as given by the Koksma-Hlawka inequality, with a deterministic order O((log N)^s /N). For Monte Carlo methods, however, the error bounds are probabilistic with order O(1/Sqrt(N)).

This means that asymptotically, low-discrepancy vectors give better estimates than the pseudorandom vectors. Moreover, unlike Monte Carlo estimates, the estimates produced by the low-discrepancy sequences satisfy deterministic error bounds. However, quasi-Monte Carlo methods have their own limitations and Monte Carlo methods have some desirable properties themselves!

Recently, several researchers have designed techniques to capture the best features of these methods. These techniques are usually called "hybrid-Monte Carlo methods." For additional information on hybrid-Monte Carlo methods, as well as some Monte Carlo and quasi-Monte Carlo related links, see my home page:


Monte Carlo and quasi-Monte Carlo methods are used by numerous disciplines, from computational physics to economic modeling & finance, and from operational research to life sciences. Advances made by researchers in the underlying theory of these methods provide more accurate solution techniques that would benefit all the application areas. Maybe because of this feature, the theory of Monte Carlo and quasi-Monte Carlo methods enjoys a very rich input of ideas coming from mathematicians and scientists. Interested researchers can follow the latest developments and present their ideas in the international conferences devoted to these methods. Information on the next conference can be found at:

Next issue:
"Monte Carlo Methods: A Parallel Implementation"

Debugging Advice for Programmers

[ Few of us have the luxury of writing code from scratch. More often, we inherit a code which must be parallelized or otherwise modified. This reality accepted, it may still be useful to step back from the minutiae of debugging for a moment. In newsletter #135: we offered "Debugging Tips for F90 Programmers." This article is a follow-up, and if "tips" are to "advice" as "trees" are to "forests", then the title says it all. This article is an excerpt from: Ellis, Philips, and Lahey. Fortran 90 Programming. Reading, MA: Addison-Wesley, 1994. ISBN 0-201-54446-6. p. 308. Reprinted by permission of Addison Wesley Longman Ltd. ]

"... you should always adopt an incremental approach to both writing and debugging a program. In other words, you should always break your code into small procedures, each one of which has a logically coherent, single, purpose. Do not write procedures that do too many unrelated tasks. The motivation is to keep different parts of a program from interacting with each other in subtle and obscure ways. Breaking a program into procedures means that such interactions can occur in a controlled way, only via procedure calling sequences. A good rule of thumb here is to keep procedures to no more than 50 lines of code. In that way, they can be printed on a single sheet of paper and more readily understood.

To debug a program incrementally, each procedure should be thoroughly tested by itself. This means that input to the procedure is generated by another part of the program, or by hand, and the output examined for correctness. The set of inputs used to test a procedure should exercise all branches of the code it contains.

It is often very tempting not to test every procedure but, instead, to start trying to make a complete program function. This almost always results in errors being looked for in the wrong place--which is probably the most time-consuming and frustrating part of debugging a program.

Do not build your house without foundations. The incremental debugging approach means that the lowest-level procedures are tested first, then the procedures that use those procedures, and so on until the whole program is verified. The process of developing a set of test problems can often take half of the debugging effort.

Finally, you should keep all the test problems and the results produced during testing. In the course of time almost all non-trivial programs will be modified, and when that day comes it will both save time, and provide an added degree of confidence, if the modified program can be shown to perform correctly on the same sets of test data as the original version. This does not mean however, that you should not develop additional tests for an new features added to the program. Thus over the life of a program the test suite will gradually grow with each new modification."

Book Review: "Parallel Programming with MPI"

By: Peter S. Pacheco
Morgan Kaufmann Publishers, Inc., San Francisco (1997), 418 pages
ISBN 1-55860-339-5

Well-written in a friendly style, "Parallel Programming with MPI" (PPMPI) provides an introduction to both parallel programming and MPI. From the preface: "It was written for students and professionals who have no prior experience in programming parallel systems. It was designed for use both as a self-paced tutorial and as a text in a more conventional classroom/computer laboratory setting."

PPMPI opens with a brief motivation and overview of parallel computation, and proceeds quickly to its first sample MPI program, parallel "hello world."

Subsequent chapters introduce the essential MPI routines concurrent with parallel programming concepts, and make use of increasingly sophisticated sample programs. For instance, as an introduction to the MPI_Bcast routine, the book discusses why a tree-structured rather than list-structured algorithm for data distribution might be desirable. It then codes it using the simple point-to-point routines, MPI_Send and MPI_Recv, introduced in the previous chapters. This incremental approach flows logically and helps dispel the mystery of collective operations.

Consistent with PPMPI suggested use as a textbook, every chapter ends with exercises and programming assignments and, for code samples, it uses C exclusively.

There is, however, a nice concession to the Fortran world, as source code for all examples given in the book is available on-line in both C and Fortran (visit the URL given above).

I down loaded both the C and Fortran archives, randomly picked chapter 12's sample program, ping_pong.f and compiled/ran it on the T3E with absolutely no glitches. These on-line archives stand alone as an excellent collection of introductory level MPI programs, but better yet, they are a companion to the book.

PPMPI concludes with a few chapters which brush the surface of more advanced subjects: non-blocking routines, debugging, performance, Amdahl's law, scalability, parallel libraries, and parallel algorithms (as distinct from parallelized serial algorithms). An appendix provides the C syntax--but no Fortran--for all routines in the MPI standard. If you are interested, you are encouraged to visit the PPMPI home page given above.

Quick-Tip Q & A

A: {{ You want to compile your parallel application and then run it on
      N T3E application PEs. What's the lazy man (or woman's) way to
      determine, at compile time, if adequate memory exists on those N
      PEs to run the program.  (Assume static allocation of all memory
      and that you're compiling/running on the same system.)  }}
    Compile it as a fixed (non-malleable) executable for N PEs.  The
    compiler verifies that N PEs (on the current host) would provide
    sufficient memory.


      f90 -X N prog.f
      cc  -X N proc.c
      CC  -X N proc.c++

Q: You use "qsub my_baby" to submit a job to NQS... and it vanishes!

   qsub acts innocent, returning, not an error message, but a
   request-id number (as if everything were "normal").  Posthaste, you
   run "qstat -a".  Nothing.  With growing alarm, you type, "qstat -f
   <request-id>".  Your job, with the id number JUST returned by qsub,
   is missing!  Starting to panic (but feigning nonchalance) you count
   slowly to 10 and run "ls my_baby.*<request-id>", expecting to see
   the job's stdout/stderr files,  but there is no trace.  What
   happened?  Was it the butler?

[ Answers, questions, and tips graciously accepted. ]

Back to Top