ARSC T3E Users' Newsletter 171, June 18, 1999

UNICOS/mk Upgrade to 2.0.4, on Yukon, June 22

ARSC users, read news UNICOSmk_2.0.4 on yukon for more information.

Visiting Alaska? ARSC Summer Tours, Wednesdays at 2:00

ARSC staff are giving tours to the general public at 2:00 every Wednesday, through Aug. 25. Just show up. Bring your Uncle and your kids!

The tours start in the basement of the Butrovich building, at the kiosk in front of the machine room viewing windows.

We start with an introduction to ARSC, Arctic science and engineering, and the supercomputers. We then proceed into the video production and training lab, give an ImmersaDesk demo using weather, Alaska terrain, or other data sets, and give a wild-card demo suited to the group and the guide.

ARSC tours are part of the UAF's larger program, which includes the Poker Flats Rocket Range (launching barium to the ionosphere), the Large Animal Research Station (muskoxen! Rudolph!), the Geophysical Institute (earthquakes, volcanos, SAR) and the Botanical gardens (how will Alaska feed itself after Y2K? Can you say, "Sauerkraut & Spuds?").

For UAF tour schedules, see:

http://www.uaf.edu/univrel/Tour/tours.html

Restarting Previously Checkpointed NQS Jobs After a Crash

In issue #164 ( /arsc/support/news/t3enews/t3enews164/index.xml ), we ran the article, "Restart" vs "Rerun" and Preventing the Latter , and defined these terms:

restart:

the NQS request runs for a while, is held, and a checkpoint image is created. Later, using the checkpoint image, it starts up again from the point at which it was interrupted.

rerun:

the request runs for a while and is interrupted by a crash. Later, it starts again FROM THE BEGINNING.

Occasionally, we experience a third situation, as follows:

A job runs for a while, is held and checkpointed, is later released, restarts, runs for a while, but then, while the job is still running, the system crashes!

To understand what happens next, you must know that NQS always saves the latest checkpoint image until the job exits the system.

Thus, when the T3E is rebooted, NQS discovers the old checkpoint image and, instead of rerunning from the beginning, restarts from that image. For clarity, here's a depiction of the sequence, where:


      ______   
indicates job is running,
 
      ......   
indicates job not running:



   job              check-         1st       crash          2nd             
  starts            pointed      restart      ;-(         restart                
    
__________________
............
__________
.............
______->


    
( Wall-Clock Time increasing to right )
By reusing the old checkpoint file, NQS has saved all the initial work. From the program's point of view, here's the sequence:



   job                     check-
  starts                   pointed
    
_________________________


                             1st       crash
                           restart      ;-(
                              
__________


                             2nd             
                           restart                
                              
_____________________________->


    
( Simulated Time increasing to right )
This seems good, but, depending on the code, trouble may lurk in the repeated section of the run:
  1. If the code continuously updates a results file (on every iteration, for instance), it may duplicate some output, thus corrupting the file.
  2. If the code uses its own system of restart files to create contiguous runs, then some of the work completed between "1st restart" and "crash" might have been salvageable. This effort will have been wasted.
  3. Other??? Again, code dependent.
The combination of events leading to this situation is quite uncommon. This is fortunate, because there's no good way you can prevent it. You'll likely be relying on the T3E system administrators to notice it, to manually hold your job after the crash, and contact you for a decision on the 2nd restart. They ask you to qdel the job yourself or approve releasing it to proceed with the restart from where it was checkpointed.

Your individual control lies in two qsub flags (and their qalter analogues):

#QSUB -nr <or> qalter -r n

This prevents the job from ever rerunning from the beginning. This wouldn't prevent the situation described above because the 2nd restart is from a checkpoint file, not from the beginning.

#QSUB -nc <or> qalter -c n

This blocks any attempt to checkpoint the job. This would prevent the situation, but would also force the system administrators to kill your job every time they would normally just hold it. This happens at scheduled test-time. It also happens when the system is dedicated to very large jobs -- which, at ARSC, happens every evening and sometimes during the day (see: news chkpnt_sched ).

We don't recommend use of qsub -nc.

The more common problem is the simple rerun. Thus, we do recommend use of "qsub -nr", unless you're confident that your job can repeat portions of its run safely.

The qalter command can be executed from within the executable portion of your qsub script, or from the command line, to temporarily disallow checkpointing or restarting. See: man qalter .

New TOP500 list published

http://www.top500.org/lists/1999/06/top500.list.html shows that after recent upgrades, ARSC is now listed as number 45 alongside several other sites with the same equipment.

This list of 500 systems is well worth a glance through as it shows an interesting collection of different sites covering University, National Labs, to some interesting commercial sites ranging from Banks, Airlines, Online services etc. Worthy of longer inspection is some very interesting commentary on the latest list and how computing has changed since the first TOP500 list was published in 1993. This, plus much other useful information, can be found at: http://www.top500.org

Orbs On the Cray: Distributed Objects Programming

On Wednesday, June 23rd, from 1:30-2:30 pm, in Butrovich 109, Mr. Logan Colby will describe distributed programming, CORBA, and his work porting two different CORBA implementations to the Cray J90 at ARSC. Logan recently earned his MS in computer science at the University of Alaska Fairbanks.

Abstract:

Distributed programming is finally coming into its own, especially as described by the CORBA (Common Object Request Broker Architecture) specifications.

In this talk, Logan Colby will describe distributed computing and show how it is realized by CORBA. He'll present the results of his work during the last year porting two implementations of CORBA (DSTC's Fnorb and Xerox-Parc's ILU) to ARSC's Cray supercomputers. He'll also demonstrate a couple of applications developed using the Cray orbs.

Quick-Tip Q & A



A:{{ C header files always end in ".h".  Why the different conventions
     for Fortran?  For instance:

      include 'mpif.h'
      include 'VT.inc'
      include "mpp/shmem.fh"
  }}



  Here's a longer list of packages on the T3E which use the three different
  conventions:
    *.inc  :  netcdf, HDF, VAMPIR
    *f.h   :  MPI, pghpf
    *.fh   :  SHMEM, PAT

  My Search of Fortran texts and  a number of web docs revealed no
  suggestions for naming include files.  And no-one made it easy by
  sending us "the answer."  So... the question remains open.




Q:  Sometimes I need to duplicate entire "branches" of an existing
    directory tree.  For instance, I might need this structure:

      ./v12/19990201/weather/Burmuda/restart/2301/
      ./v12/19990201/weather/Triangle/restart/2303/

    As you might guess, it's really annoying to type these nine
    commands:

      mkdir ./v12
      mkdir ./v12/19990201
      mkdir ./v12/19990201/weather
      mkdir ./v12/19990201/weather/Burmuda
      mkdir ./v12/19990201/weather/Burmuda/restart
      mkdir ./v12/19990201/weather/Burmuda/restart/2301
      mkdir ./v12/19990201/weather/Triangle
      mkdir ./v12/19990201/weather/Triangle/restart
      mkdir ./v12/19990201/weather/Triangle/restart/2303

    Is there an easier way?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top