| Newsletter Index | Quick-Tip Index | Search Newsletters |
Many users of HPC systems require multiple long runs to complete a single simulation or experiment, and, often, the separate runs must be processed in sequence.
One method for creating a sequence of batch jobs is what ARSC has in the past termed, "chaining." (See issues: 176, 259, and 297.) To create a "chain" of jobs, each batch script, as its final act before terminating, executes the "qsub" or "llsubmit" command to submit its successor. We strongly discourage recursive, or "self-submitting," scripts because we've seen them go awry too many times and flood the system. Instead we recommend a simple, finite chain in which "A" submits "B," "B" submits "C", and "C" stops. (This could go up to "Z" or even "ZZ," of course.)
For some jobs, chaining isn't an option. It's an unpleasant way to end a run, but codes which write their own frequent restart files are sometimes allowed to simply run out of time. The batch system kills them when they hit the time limit, and the user submits a follow-on job which picks up again at the most recent restart file. Chaining won't work for such jobs because the batch script is halted when the application code is killed. Thus, subsequent script commands, like the "qsub" or "llsubmit" which might create the chain are not processed.
Fortunately, LoadLeveler and PBS both allow users to move the logic for chaining from the script and into the scheduler. The LoadLeveler feature was discussed in the article "Using LoadLeveler Job Steps" in issue #307:
http://www.arsc.edu/support/news/HPCnews/HPCnews307.shtmlIn PBS on the X1, you use the "qsub -W depend=..." option to create dependencies between jobs.
The three most useful types of supported dependencies are probably "afterany," "afterok," and "afternotok." These are used as follows, where "<JOB-ID>" is the PBS ID number of a previously submitted job, and "<QSUB SCRIPT>" is a regular qsub script:
qsub -W depend=afterany:<JOB-ID> <QSUB SCRIPT> qsub -W depend=afterok:<JOB-ID> <QSUB SCRIPT> qsub -W depend=afternotok:<JOB-ID> <QSUB SCRIPT>
From "man qsub," here's the description of these attributes:
afterany:jobid[:jobid...]
This job may be scheduled for execution
after jobs jobid have terminated, with or
without errors.
afterok:jobid[:jobid...]
This job may be scheduled for execution only
after jobs jobid have terminated with no
errors. See the csh warning under "Extended
Description".
afternotok:jobid[:jobid...]
This job may be scheduled for execution only
after jobs jobid have terminated with
errors. See the csh warning under "Extended
Description".
From these descriptions, it's obvious that you can use the error condition of the predecessor job to either halt or perpetuate the sequencing of jobs, as needed. In the scenario described earlier, it is planned that jobs will run into the time limit and be killed (which creates an error condition). Presumably, if a job completed without error, it would signify the clean end of the entire sequence of runs (e.g., the solution converged, the final timestep was processed, etc.). Thus the goal would be to continue the sequence on error but halt it if there's no error.
In the next issue, I'll give more details and an example.
ARSC is currently in the process of deploying checkpointing functionality on our two IBM systems: iceberg and iceflyer. Checkpointing allows a running program to be saved to a file. At a later time the program can be restarted from the previous point of execution using the checkpoint file. Checkpointing should allow for increased utilization of the systems prior to downtimes and benefit long jobs which have no built in checkpointing facilities.
The loadleveler keyword 'checkpoint' specifies whether or not a job should be considered for checkpointing.
E.g.
# The following specifies that a job can be checkpointed. # @ checkpoint = yes
Unlike standard loadleveler scripts, jobs with checkpointing enabled must be executable. Script that are not executable and request checkpointing will be rejected by Loadleveler.
Below are a few limitations to checkpointing which may apply in particular to codes running on ARSC IBM systems.
We are looking for several volunteers to help assist in testing the checkpointing functionality. Please contact the ARSC help desk consult@arsc.edu for more details. Projects with little or no remaining allocation are especially encouraged to inquire.
In May, ARSC began installing a Cray XD1 system. The system, named Nelchina, is currently being configured for use as an academic resource and should be available by the end of the year. Nelchina consists of 3 chassis with 6 nodes in each. Each node has two 2.4 GHz Opteron 250 processors and 4 GB of RAM. Additionally, one chassis has 6 field programmable gate arrays (FPGAs).
This summer several Ph.D. candidates from George Washington University are visiting the Arctic Region Supercomputing Center to investigate the FPGA technology on the Cray XD1. From their experiences we hope to get a better understanding of the problems that the system is best suited to solve.
A:[[ Sometimes I'll run an X1 PBS script interactively instead of
[[ through "qsub," to test the basic syntax of the shell script.
[[ Here's a sample script (with just the basics remaining):
[[
[[ #PBS -l walltime=4:00:00
[[ #PBS -l mppe=8
[[ #PBS -q default
[[
[[ cd $PBS_O_WORKDIR
[[ aprun -n 8 ./a.out
[[
[[ I'm totally annoyed, though, because I usually forget that in an
[[ interactive run, the PBS variable PBS_O_WORKDIR doesn't get set!
[[ So, when the script hits this line:
[[
[[ cd $PBS_O_WORKDIR
[[
[[ it cd's my session to my home directory and everything fails until
[[ I remember to go back and comment out the "cd $PBS_O_WORKDIR".
[[ Then, of course, when I'm done with interactive tests and submit
[[ the real, batch, run, I forget to UNcomment the "cd", and everything
[[ fails again!
[[
[[ Any ideas to help me out?
#
# Martin Luthi
#
A very simple way to do this would be to test for the program name,
available in the script as $0. The exact syntax depends on the shell
language, but here is an example for Bourne-Shell:
======================
#!/bin/sh
if [[ $0 != "qsub" ]] then
cd $PBS_O_WORKDIR
fi
======================
#
# Lee Higbie
#
In the script you use to enter your interactive session, or as soon as
you enter it, type:
export PBS_O_WORKDIR=`pwd`
depending on the system and shell, you may have to type setenv instead
of export or separate setting the variable and exporting it onto two
lines.
#
# Ed Kornkven
#
We can test $PBS_O_WORKDIR to see if it is a non-empty string. If it
is, we assume that it contains the directory that we want to change to.
If $PBS_O_WORKDIR is not set then the test fails and the "cd" is not
executed. Using the usual block-if, we write:
if [[ -n "$PBS_O_WORKDIR" ]] ; then
cd $PBS_O_WORKDIR
fi
Alternatively, in the Korn shell (and probably others), one can put the
command in a list construct where the first list item is the test which,
if successful, allows the command to execute:
[[ -n "$PBS_O_WORKDIR" ]] && cd $PBS_O_WORKDIR
Q: Here's a question from former editor, Guy Robinson: Read any good
computation/parallel programing/science books recently? If so,
send title and short review.
[[ Answers, Questions, and Tips Graciously Accepted ]]
Contact:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619 Donald Bahls ARSC User Consultant ph: 907-450-8674 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.Email Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 |
voice: 907-474-6935 |
email:
home | search | about | support | news | science | resources