ARSC T3E Users' Newsletter 135, January 23, 1998

Job Migration

ARSC upgraded yukon's OS to UNICOS/mk 2.0 on Dec. 11, 1997. This version provides reliable migration of jobs within the torus. We use this facility regularly.

Migrating is the process of shifting an executing job from one block of PEs to a different block of PEs, elsewhere on the torus. A migrate command is issued. The system shuts the job down, determines its best possible new location (although the sysadmin does have the option of specifying a target location), uses the T3E's fast internal network to copy process images from their original logical PEs to their corresponding destination PE's, and restarts the job in its new location. Only one job can undergo migration at a time.

There is one primary use of migration:

To reduce fragmentation of available portions of the torus, gathering as many available application PEs as possible into a single contiguous block.

UNICOS/mk requires that the logical PEs used by a single application be contiguous within the torus. Thus, it is possible for a job to be blocked, not because insufficient PEs are available, but because the available PEs are scattered about. Migration of all executing jobs into one contiguous block within the torus can allow such a job to run in the freed region.

For instance, here's the current job configuration on yukon (at the moment of writing):


       UID  USERNAME SIZE BasePE      COMMAND       
       ==== ======== ==== =========== ==============
   A - 4336 user1    40   0   [    0] a.out                         
   B - 4262 user2    10   44  [ 0x2c] a.out                         
   C - 4162 user3    20   54  [ 0x36] a.out                         
  --------------------------------------------------
   0  
    A....A....A....A....A....A....A....A..
   8  
  ..A....A....A....A....A....A....A....A..
   16 
  ..A....A....A....A....A....A....A....A..
   24 
  ..A....A....A....A....A....A....A....A..
   32 
  ..A....A....A....A....A....A....A....A  
   40 
    -....-....-....-    B....B....B....B..
   48 
  ..B....B....B....B....B....B    C....C..
   56 
  ..C....C....C....C....C....C....C....C..
   64 
  ..C....C....C....C....C....C....C....C..
   72 
  ..C....C    -....-....-....-....-....-..
   80 
  ..-....-....-....-....-....-....-....-..
   88 
  ..-....-....-....-....-....-....-....-..
   96 
  ..-....-....-....-....-....-....-....- 
  --------------------------------------------------

Imagine the configuration if jobs A and C terminated at once. We'd be left with this:


  --------------------------------------------------
   0  
    -....-....-....-....-....-....-....-..
   8  
  ..-....-....-....-....-....-....-....-..
   16 
  ..-....-....-....-....-....-....-....-..
   24 
  ..-....-....-....-....-....-....-....-..
   32 
  ..-....-....-....-....-....-....-....-  
   40 
    -....-....-....-    B....B....B....B..
   48 
  ..B....B....B....B....B....B    -....-..
   56 
  ..-....-....-....-....-....-....-....-..
   64 
  ..-....-....-....-....-....-....-....-..
   72 
  ..-....-....-....-....-....-....-....-..
   80 
  ..-....-....-....-....-....-....-....-..
   88 
  ..-....-....-....-....-....-....-....-..
   96 
  ..-....-....-....-....-....-....-....- 
  --------------------------------------------------

On yukon, logical PEs 88-104 are OS and CMD PEs, so the largest contiguous block of available application PEs would be of size 44. Any job requiring more than 44 PEs would be blocked, although 78 application PEs are available.

Migration to the rescue!


  --------------------------------------------------
   0  
    B....B....B....B....B....B....B....B..
   8  
  ..B....B....-....-....-....-....-....-..
   16 
  ..-....-....-....-....-....-....-....-..
   24 
  ..-....-....-....-....-....-....-....-..
   32 
  ..-....-....-....-....-....-....-....-..
   40 
  ..-....-....-....-....-....-....-....-..
   48 
  ..-....-....-....-....-....-....-....-..
   56 
  ..-....-....-....-....-....-....-....-..
   64 
  ..-....-....-....-....-....-....-....-..
   72 
  ..-....-....-....-....-....-....-....-..
   80 
  ..-....-....-....-....-....-....-....-..
   88 
  ..-....-....-....-....-....-....-....-..
   96 
  ..-....-....-....-....-....-....-....-  
  --------------------------------------------------

By migrating job B to logical PE 0, we have created a block of 78 contiguous, available application PEs. Thus, we could run a job from the xlarge or Qxlarge queue:


  --------------------------------------------------
   0  
    B....B....B....B....B....B....B....B..
   8  
  ..B....B    C....C....C....C....C....C..
   16 
  ..C....C....C....C....C....C....C....C..
   24 
  ..C....C....C....C....C....C....C....C..
   32 
  ..C....C....C....C....C....C....C....C..
   40 
  ..C....C....C....C....C....C....C....C..
   48 
  ..C....C....C....C....C....C....C....C..
   56 
  ..C....C....C....C....C....C....C....C..
   64 
  ..C....C....C....C....C....C....C....C..
   72 
  ..C....C....C....C....C....C....C....C..
   80 
  ..C....C....C....C....C....C....C....C..
   88 
  ..-....-....-....-....-....-....-....-..
   96 
  ..-....-....-....-....-....-....-....-  
  --------------------------------------------------

The migration process occurs, memory-to-memory, entirely within the T3E's internal network, and is fast. However, you might see it in progress if you execute grmview at the right moment. The grmview Note field is new in U/mk 2.0, and will indicate, among other things, if the job is currently being migrated. For instance:


  yukon$ grmview -q
  
  Exec Queue: 3 entries total. 1 running, 0 queued
    uid   gid  acid    Label Size  BasePE   ApId Command       Note    
   4262  4207  4207     -      10    -     69116 a.out       Migrating    

At ARSC, we have a cron job which analyzes GRM output and NQS loading, seeking allocation holes. Under the right circumstances, it will call on the system to migrate jobs.

Debugging Tips For F90 Programmers

Your program dumps core or garbage, and you don't know why.

The first tool to consider is the compiler itself. Cray's Fortran compilers are rich with options to help you isolate problems. Here are our favorites:


f90 -g   

    -g is the "debugging" flag.  The compiler includes information 
    in the executable so that you may:
  
    1) Do post-mortem analysis of core files.  After crashing, run:
      debugview -c core <Executable Name>.  See "man debugview."
    
    2) Run the executable from the debugger, "totalview." This lets you
      (among other things) execute your program 1 step at a time and check
      the values of variables on different PEs.
  
f90 -ei  

    -ei  generates a run-time error when an uninitialized local real 
    or integer variable is used in a floating-point operation or array
    subscript.

f90 -eI 

    -eI  applies the IMPLICIT NONE constraint to all files.  This will
    catch typo's which were syntactically interpretable as variable
    declarations under fortran's implied typing rules.

f90 -R abcns

    -R abcns  Enables several run-time checks of your program.

               The runchk arguments are as follows:

               runchk    Checking performed

               a         Compares the number and types of arguments passed
                         to a procedure with the number expected.
               b         Enables checking of array bounds.

               c         Enables conformance checking of array operands in
                         array expressions.

               n         Compares the number of arguments passed to a
                         procedure with the number expected.  Does not make
                         comparisons with regard to argument data type (see
                         -R a).

               s         Enables checking of character substring bounds.


f90 -rl

    -rl  enables cflint.  The cflint command checks Fortran programs
    for program constructs that may need further investigation (for
    example, arguments that are never used or local variables used
    before they are defined).  Here's the type of output you'll get:

      85) <42>  No references to the INCLUDEd PARAMETER "MPI_CART"
      86) <11>  No references to COMMON /CPROCI/ value
      89) <2>   Local Variable "MPIREQ" is declared but never used
      91) <1>   Local Variable "MYPE" may be used before it is 
               assigned a value

    

f90 -r0

    -r0  enables cflist. The cflist command lists a Fortran program
    with cross-references, loop and parallel indicators, and reports
    from a static call-tree analysis.  It includes the cflint output
    and more.  For example:

      f90 Compiler - 3 messages:
  
       1) <cf90-6001,Scalar,Line=51> An exponentiation was replaced by
       optimization.  This may cause numerical differences.
    
       2) <cf90-6001,Scalar,Line=52> An exponentiation was replaced by
       optimization.  This may cause numerical differences.
    
       3) <cf90-1110,Warning,Line=75> DOUBLE PRECISION is not supported
       on this platform.  REAL will be used.

   And:


            LastMod:  14:17 Thu Nov06,1997
        Compiled by:  f90

             1  Subprogram
             0  Syntax errors
             2  COMMON Blocks
            10  Externals
             1  INCLUDE file (1 Ref)
            85  Lines  (2321 Chars)
            52  Statements
                   32  Executable Statements
                    7  Assignment Statements
                    4  DO-Loops
                   13  CALL Statements
                    4  I/O Statements
                    3  IF Statements
                    0  CASE Constructs
                    1  STOP Statement
                    0  GOTO Statements
             4  DO-Loops
                   (   0 Vector         0 Parallel)
                   (   2 Innermost   0.0% Vector  )
                   (   3 MaxNest     1.75 AvgNest )
                   (  31 MaxLines   20.00 AvgLines)
             1  Statement Label
                   (   0 DO-Terms       1 FORMAT  )
            25  Comments (251 Chars, 10.8%)
             0  Directives
            92  Subprogram considerata - from 1 Subprogram
             0  Global call-chain considerata


f90 -Omsgs 

    -Omsgs writes optimization messages during compilation.  

    It is important to remember that several of the above options  (-g,
    for instance) inhibit optimization. Thus,  beware of bugs/problems
    which only occur with optimization enabled. These may be due to the
    compiler or minor numerical differences due to the changed order of
    the code action.

    For information on what the compiler is doing as it optimizes your
    code use the -Omsgs option, for instance, f90 -Omsgs -O3 prog.f:


            b1(n)=n*2.5
     cf90-6009 f90: SCALAR PROG1, File = prog1.f, Line = 27
      A floating point expression involving an induction variable was
      strength reduced by optimization.  This may cause numerical
      differences.

          b1b(1:nsize)=b1
     cf90-6004 f90: SCALAR PROG1, File = prog1.f, Line = 35
      A loop starting at line 35 was fused with the loop starting at
      line 34.

          xresp=1.0/x
     cf90-6010 f90: SCALAR PROG1, File = prog1.f, Line = 146
      A divide was turned into a multiply by a reciprocal

This output shows that some loops have been merged together and a couple of SCALAR references have been improved, in particular a divide has been changed into a multiply by a reciprocal operation. (The Alpha processor has no divide hardware.) Note also the warning about possible numerical differences.

Book Review:

"Debugging and Performance Tuning for Parallel Computing Systems"

[ This article initiates a new, semi-regular feature of the T3E Newsletter, reviews of what we hope are relevant books. ]


Title:    Debugging and Performance Tuning for Parallel Computing Systems.
Editors:  Simmins, Hayes, Brown and Reed.
Published by IEEE, ISBN 0-8186-7412-1

This book presents a collection of papers from a workshop at which the developers of tools met with both vendors and application programmers to discuss the state of the art. The papers themselves are usefully grouped into research, vendor and application sections. In the introduction it is commented that these groups speak a different language and that one of the goals of the meeting was to bring these groups together with the aim of developing real commercial tools to solve the problems facing the exploitation of highly parallel computing.

The first group of papers presents research work on tools and how it is proposed to deal with the large volume of information generated in the debugging and analysis of teraflop scale computers. These are mostly of use to those planning to develop tools of their own or to monitor their own programs internally. Several researchers comment on the perceived lack of interest in tool usage amongst application programmers and present this as the motivation for improvements to simplify use.

The second group of papers are from vendors and describe the various products available. Since the workshop was held, some vendors and tools have changed significantly and it is interesting to see the changes.

In the final sections, programmers present experiences of using tools with large applications. Papers presented range from a list of guidelines for tool usage and what was required by an, admittedly, skilled and knowledgeable programming team working on a single set of applications to the tools being used by an analyst at a computer center who might be faced with a wide range of applications to assess. Here the discussion broadens into general project management issues such as source code management and validation.

In a series of short sections toward the end of the book a number of topics associated with MPP systems are discussed. A particularly useful section details the differences between workstation clusters and MP systems, both in terms of performance and debugging/tuning needs.

The work presented is selected from numerous projects within the past five-ten years or so. This serves both to show the improvements made in this area and that the basic programmer skills used in finding bugs and tuning programs rely on a set of fundamental principles and skills, the careful and conscientious application of which leads to successful programming.

Several tools featured are acknowledged as making life easier for the programmer and the need for a consistent interface across platforms is acknowledged. The editors conclude that there is still much work needed in the development and acceptance of tools within the parallel programming community.

Quick-Tip Q & A


A: {{ In F90, how can you check whether an ALLOCATE has completed
      correctly or prevent your program from halting if it attempts to
      claim more memory than is available? }}

ALLOCATE and DEALLOCATE can return error status. For example,

       ALLOCATE (a(100),STAT=istatus)

All is well if istatus is zero. Positive values indicate that there has
been an error. If there is no STAT= variable, the program terminates
when an error is encountered.

The numerical value returned can be decoded using "explain." For example
the two errors numbers most likely are 1205 and 1411. These decode as,

  yukon% explain lib-1205
  
    The program was unable to request more memory space.

    A routine in the run-time library was unable to request additional
    memory space.  This usually occurs when the process has exceeded
    its memory allocation.

    Review the use of memory space by your program, and if possible,
    reduce its memory use.

    See the description of the limit(1) command in the UNICOS User
    Commands Reference Manual, publication SR-2011.

    The error class is UNRECOVERABLE (issued by the run-time library).


  yukon% explain lib-1411
  
    An allocatable array in the ALLOCATE statement is already
    allocated.

    An allocatable array in an ALLOCATE statement must not be currently
    allocated.  The ALLOCATED intrinsic function can be used to
    determine if the allocatable array is currently allocated.

    See the description of the ALLOCATE statement in the Fortran
    reference manual.

    The error class is UNRECOVERABLE (issued by the run-time library).
    stdin: END

Q: Say you submit a single, 8-hour batch request using qsub and that the 
   qsub script launches a series of short jobs. Is there a way to
   prevent one of the short jobs from using more than its share of the
   8 hours?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top