ARSC T3E Users' Newsletter 171, June 18, 1999
UNICOS/mk Upgrade to 2.0.4, on Yukon, June 22
ARSC users, read news UNICOSmk_2.0.4 on yukon for more information.Visiting Alaska? ARSC Summer Tours, Wednesdays at 2:00
ARSC staff are giving tours to the general public at 2:00 every Wednesday, through Aug. 25. Just show up. Bring your Uncle and your kids!The tours start in the basement of the Butrovich building, at the kiosk in front of the machine room viewing windows.
We start with an introduction to ARSC, Arctic science and engineering, and the supercomputers. We then proceed into the video production and training lab, give an ImmersaDesk demo using weather, Alaska terrain, or other data sets, and give a wild-card demo suited to the group and the guide.
ARSC tours are part of the UAF's larger program, which includes the Poker Flats Rocket Range (launching barium to the ionosphere), the Large Animal Research Station (muskoxen! Rudolph!), the Geophysical Institute (earthquakes, volcanos, SAR) and the Botanical gardens (how will Alaska feed itself after Y2K? Can you say, "Sauerkraut & Spuds?").
For UAF tour schedules, see:
http://www.uaf.edu/univrel/Tour/tours.html
Restarting Previously Checkpointed NQS Jobs After a Crash
In issue #164 ( /arsc/support/news/t3enews/t3enews164/index.xml ), we ran the article, "Restart" vs "Rerun" and Preventing the Latter , and defined these terms:restart:
the NQS request runs for a while, is held, and a checkpoint image is created. Later, using the checkpoint image, it starts up again from the point at which it was interrupted.
rerun:the request runs for a while and is interrupted by a crash. Later, it starts again FROM THE BEGINNING.
Occasionally, we experience a third situation, as follows:
A job runs for a while, is held and checkpointed, is later released, restarts, runs for a while, but then, while the job is still running, the system crashes!
To understand what happens next, you must know that NQS always saves the latest checkpoint image until the job exits the system.
Thus, when the T3E is rebooted, NQS discovers the old checkpoint image and, instead of rerunning from the beginning, restarts from that image. For clarity, here's a depiction of the sequence, where:
______
indicates job is running,
......
indicates job not running:
job check- 1st crash 2nd
starts pointed restart ;-( restart
__________________
............
__________
.............
______->
( Wall-Clock Time increasing to right )
By reusing the old checkpoint file, NQS has saved all the initial work. From the program's point of view, here's the sequence:
job check-
starts pointed
_________________________
1st crash
restart ;-(
__________
2nd
restart
_____________________________->
( Simulated Time increasing to right )
This seems good, but, depending on the code, trouble may lurk in the repeated section of the run:
- If the code continuously updates a results file (on every iteration, for instance), it may duplicate some output, thus corrupting the file.
- If the code uses its own system of restart files to create contiguous runs, then some of the work completed between "1st restart" and "crash" might have been salvageable. This effort will have been wasted.
- Other??? Again, code dependent.
Your individual control lies in two qsub flags (and their qalter analogues):
#QSUB -nr <or> qalter -r n
This prevents the job from ever rerunning from the beginning. This wouldn't prevent the situation described above because the 2nd restart is from a checkpoint file, not from the beginning.
#QSUB -nc <or> qalter -c nThis blocks any attempt to checkpoint the job. This would prevent the situation, but would also force the system administrators to kill your job every time they would normally just hold it. This happens at scheduled test-time. It also happens when the system is dedicated to very large jobs -- which, at ARSC, happens every evening and sometimes during the day (see: news chkpnt_sched ).
We don't recommend use of qsub -nc.
The more common problem is the simple rerun. Thus, we do recommend use of "qsub -nr", unless you're confident that your job can repeat portions of its run safely.The qalter command can be executed from within the executable portion of your qsub script, or from the command line, to temporarily disallow checkpointing or restarting. See: man qalter .
New TOP500 list published
http://www.top500.org/lists/1999/06/top500.list.html shows that after recent upgrades, ARSC is now listed as number 45 alongside several other sites with the same equipment.This list of 500 systems is well worth a glance through as it shows an interesting collection of different sites covering University, National Labs, to some interesting commercial sites ranging from Banks, Airlines, Online services etc. Worthy of longer inspection is some very interesting commentary on the latest list and how computing has changed since the first TOP500 list was published in 1993. This, plus much other useful information, can be found at: http://www.top500.org
Orbs On the Cray: Distributed Objects Programming
On Wednesday, June 23rd, from 1:30-2:30 pm, in Butrovich 109, Mr. Logan Colby will describe distributed programming, CORBA, and his work porting two different CORBA implementations to the Cray J90 at ARSC. Logan recently earned his MS in computer science at the University of Alaska Fairbanks.Abstract:
Distributed programming is finally coming into its own, especially as described by the CORBA (Common Object Request Broker Architecture) specifications.
In this talk, Logan Colby will describe distributed computing and show how it is realized by CORBA. He'll present the results of his work during the last year porting two implementations of CORBA (DSTC's Fnorb and Xerox-Parc's ILU) to ARSC's Cray supercomputers. He'll also demonstrate a couple of applications developed using the Cray orbs.
Quick-Tip Q & A
A:{{ C header files always end in ".h". Why the different conventions
for Fortran? For instance:
include 'mpif.h'
include 'VT.inc'
include "mpp/shmem.fh"
}}
Here's a longer list of packages on the T3E which use the three different
conventions:
*.inc : netcdf, HDF, VAMPIR
*f.h : MPI, pghpf
*.fh : SHMEM, PAT
My Search of Fortran texts and a number of web docs revealed no
suggestions for naming include files. And no-one made it easy by
sending us "the answer." So... the question remains open.
Q: Sometimes I need to duplicate entire "branches" of an existing
directory tree. For instance, I might need this structure:
./v12/19990201/weather/Burmuda/restart/2301/
./v12/19990201/weather/Triangle/restart/2303/
As you might guess, it's really annoying to type these nine
commands:
mkdir ./v12
mkdir ./v12/19990201
mkdir ./v12/19990201/weather
mkdir ./v12/19990201/weather/Burmuda
mkdir ./v12/19990201/weather/Burmuda/restart
mkdir ./v12/19990201/weather/Burmuda/restart/2301
mkdir ./v12/19990201/weather/Triangle
mkdir ./v12/19990201/weather/Triangle/restart
mkdir ./v12/19990201/weather/Triangle/restart/2303
Is there an easier way?
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
