ARSC T3E Users' Newsletter 135, January 23, 1998
Job Migration
ARSC upgraded yukon's OS to UNICOS/mk 2.0 on Dec. 11, 1997. This version provides reliable migration of jobs within the torus. We use this facility regularly.
Migrating is the process of shifting an executing job from one block of PEs to a different block of PEs, elsewhere on the torus. A migrate command is issued. The system shuts the job down, determines its best possible new location (although the sysadmin does have the option of specifying a target location), uses the T3E's fast internal network to copy process images from their original logical PEs to their corresponding destination PE's, and restarts the job in its new location. Only one job can undergo migration at a time.
There is one primary use of migration:
To reduce fragmentation of available portions of the torus, gathering as many available application PEs as possible into a single contiguous block.
UNICOS/mk requires that the logical PEs used by a single application be contiguous within the torus. Thus, it is possible for a job to be blocked, not because insufficient PEs are available, but because the available PEs are scattered about. Migration of all executing jobs into one contiguous block within the torus can allow such a job to run in the freed region.
For instance, here's the current job configuration on yukon (at the moment of writing):
UID USERNAME SIZE BasePE COMMAND
==== ======== ==== =========== ==============
A - 4336 user1 40 0 [ 0] a.out
B - 4262 user2 10 44 [ 0x2c] a.out
C - 4162 user3 20 54 [ 0x36] a.out
--------------------------------------------------
0
A....A....A....A....A....A....A....A..
8
..A....A....A....A....A....A....A....A..
16
..A....A....A....A....A....A....A....A..
24
..A....A....A....A....A....A....A....A..
32
..A....A....A....A....A....A....A....A
40
-....-....-....- B....B....B....B..
48
..B....B....B....B....B....B C....C..
56
..C....C....C....C....C....C....C....C..
64
..C....C....C....C....C....C....C....C..
72
..C....C -....-....-....-....-....-..
80
..-....-....-....-....-....-....-....-..
88
..-....-....-....-....-....-....-....-..
96
..-....-....-....-....-....-....-....-
--------------------------------------------------
Imagine the configuration if jobs A and C terminated at once. We'd be left with this:
--------------------------------------------------
0
-....-....-....-....-....-....-....-..
8
..-....-....-....-....-....-....-....-..
16
..-....-....-....-....-....-....-....-..
24
..-....-....-....-....-....-....-....-..
32
..-....-....-....-....-....-....-....-
40
-....-....-....- B....B....B....B..
48
..B....B....B....B....B....B -....-..
56
..-....-....-....-....-....-....-....-..
64
..-....-....-....-....-....-....-....-..
72
..-....-....-....-....-....-....-....-..
80
..-....-....-....-....-....-....-....-..
88
..-....-....-....-....-....-....-....-..
96
..-....-....-....-....-....-....-....-
--------------------------------------------------
On yukon, logical PEs 88-104 are OS and CMD PEs, so the largest contiguous block of available application PEs would be of size 44. Any job requiring more than 44 PEs would be blocked, although 78 application PEs are available.
Migration to the rescue!
--------------------------------------------------
0
B....B....B....B....B....B....B....B..
8
..B....B....-....-....-....-....-....-..
16
..-....-....-....-....-....-....-....-..
24
..-....-....-....-....-....-....-....-..
32
..-....-....-....-....-....-....-....-..
40
..-....-....-....-....-....-....-....-..
48
..-....-....-....-....-....-....-....-..
56
..-....-....-....-....-....-....-....-..
64
..-....-....-....-....-....-....-....-..
72
..-....-....-....-....-....-....-....-..
80
..-....-....-....-....-....-....-....-..
88
..-....-....-....-....-....-....-....-..
96
..-....-....-....-....-....-....-....-
--------------------------------------------------
By migrating job B to logical PE 0, we have created a block of 78 contiguous, available application PEs. Thus, we could run a job from the xlarge or Qxlarge queue:
--------------------------------------------------
0
B....B....B....B....B....B....B....B..
8
..B....B C....C....C....C....C....C..
16
..C....C....C....C....C....C....C....C..
24
..C....C....C....C....C....C....C....C..
32
..C....C....C....C....C....C....C....C..
40
..C....C....C....C....C....C....C....C..
48
..C....C....C....C....C....C....C....C..
56
..C....C....C....C....C....C....C....C..
64
..C....C....C....C....C....C....C....C..
72
..C....C....C....C....C....C....C....C..
80
..C....C....C....C....C....C....C....C..
88
..-....-....-....-....-....-....-....-..
96
..-....-....-....-....-....-....-....-
--------------------------------------------------
The migration process occurs, memory-to-memory, entirely within the T3E's internal network, and is fast. However, you might see it in progress if you execute grmview at the right moment. The grmview Note field is new in U/mk 2.0, and will indicate, among other things, if the job is currently being migrated. For instance:
yukon$ grmview -q
Exec Queue: 3 entries total. 1 running, 0 queued
uid gid acid Label Size BasePE ApId Command Note
4262 4207 4207 - 10 - 69116 a.out Migrating
At ARSC, we have a cron job which analyzes GRM output and NQS loading, seeking allocation holes. Under the right circumstances, it will call on the system to migrate jobs.
Debugging Tips For F90 Programmers
Your program dumps core or garbage, and you don't know why.
The first tool to consider is the compiler itself. Cray's Fortran compilers are rich with options to help you isolate problems. Here are our favorites:
f90 -g
-g is the "debugging" flag. The compiler includes information
in the executable so that you may:
1) Do post-mortem analysis of core files. After crashing, run:
debugview -c core <Executable Name>. See "man debugview."
2) Run the executable from the debugger, "totalview." This lets you
(among other things) execute your program 1 step at a time and check
the values of variables on different PEs.
f90 -ei
-ei generates a run-time error when an uninitialized local real
or integer variable is used in a floating-point operation or array
subscript.
f90 -eI
-eI applies the IMPLICIT NONE constraint to all files. This will
catch typo's which were syntactically interpretable as variable
declarations under fortran's implied typing rules.
f90 -R abcns
-R abcns Enables several run-time checks of your program.
The runchk arguments are as follows:
runchk Checking performed
a Compares the number and types of arguments passed
to a procedure with the number expected.
b Enables checking of array bounds.
c Enables conformance checking of array operands in
array expressions.
n Compares the number of arguments passed to a
procedure with the number expected. Does not make
comparisons with regard to argument data type (see
-R a).
s Enables checking of character substring bounds.
f90 -rl
-rl enables cflint. The cflint command checks Fortran programs
for program constructs that may need further investigation (for
example, arguments that are never used or local variables used
before they are defined). Here's the type of output you'll get:
85) <42> No references to the INCLUDEd PARAMETER "MPI_CART"
86) <11> No references to COMMON /CPROCI/ value
89) <2> Local Variable "MPIREQ" is declared but never used
91) <1> Local Variable "MYPE" may be used before it is
assigned a value
f90 -r0
-r0 enables cflist. The cflist command lists a Fortran program
with cross-references, loop and parallel indicators, and reports
from a static call-tree analysis. It includes the cflint output
and more. For example:
f90 Compiler - 3 messages:
1) <cf90-6001,Scalar,Line=51> An exponentiation was replaced by
optimization. This may cause numerical differences.
2) <cf90-6001,Scalar,Line=52> An exponentiation was replaced by
optimization. This may cause numerical differences.
3) <cf90-1110,Warning,Line=75> DOUBLE PRECISION is not supported
on this platform. REAL will be used.
And:
LastMod: 14:17 Thu Nov06,1997
Compiled by: f90
1 Subprogram
0 Syntax errors
2 COMMON Blocks
10 Externals
1 INCLUDE file (1 Ref)
85 Lines (2321 Chars)
52 Statements
32 Executable Statements
7 Assignment Statements
4 DO-Loops
13 CALL Statements
4 I/O Statements
3 IF Statements
0 CASE Constructs
1 STOP Statement
0 GOTO Statements
4 DO-Loops
( 0 Vector 0 Parallel)
( 2 Innermost 0.0% Vector )
( 3 MaxNest 1.75 AvgNest )
( 31 MaxLines 20.00 AvgLines)
1 Statement Label
( 0 DO-Terms 1 FORMAT )
25 Comments (251 Chars, 10.8%)
0 Directives
92 Subprogram considerata - from 1 Subprogram
0 Global call-chain considerata
f90 -Omsgs
-Omsgs writes optimization messages during compilation.
It is important to remember that several of the above options (-g,
for instance) inhibit optimization. Thus, beware of bugs/problems
which only occur with optimization enabled. These may be due to the
compiler or minor numerical differences due to the changed order of
the code action.
For information on what the compiler is doing as it optimizes your
code use the -Omsgs option, for instance, f90 -Omsgs -O3 prog.f:
b1(n)=n*2.5
cf90-6009 f90: SCALAR PROG1, File = prog1.f, Line = 27
A floating point expression involving an induction variable was
strength reduced by optimization. This may cause numerical
differences.
b1b(1:nsize)=b1
cf90-6004 f90: SCALAR PROG1, File = prog1.f, Line = 35
A loop starting at line 35 was fused with the loop starting at
line 34.
xresp=1.0/x
cf90-6010 f90: SCALAR PROG1, File = prog1.f, Line = 146
A divide was turned into a multiply by a reciprocal
This output shows that some loops have been merged together and a couple of SCALAR references have been improved, in particular a divide has been changed into a multiply by a reciprocal operation. (The Alpha processor has no divide hardware.) Note also the warning about possible numerical differences.
Book Review:
"Debugging and Performance Tuning for Parallel Computing Systems"
[ This article initiates a new, semi-regular feature of the T3E Newsletter, reviews of what we hope are relevant books. ]
Title: Debugging and Performance Tuning for Parallel Computing Systems. Editors: Simmins, Hayes, Brown and Reed. Published by IEEE, ISBN 0-8186-7412-1
This book presents a collection of papers from a workshop at which the developers of tools met with both vendors and application programmers to discuss the state of the art. The papers themselves are usefully grouped into research, vendor and application sections. In the introduction it is commented that these groups speak a different language and that one of the goals of the meeting was to bring these groups together with the aim of developing real commercial tools to solve the problems facing the exploitation of highly parallel computing.
The first group of papers presents research work on tools and how it is proposed to deal with the large volume of information generated in the debugging and analysis of teraflop scale computers. These are mostly of use to those planning to develop tools of their own or to monitor their own programs internally. Several researchers comment on the perceived lack of interest in tool usage amongst application programmers and present this as the motivation for improvements to simplify use.
The second group of papers are from vendors and describe the various products available. Since the workshop was held, some vendors and tools have changed significantly and it is interesting to see the changes.
In the final sections, programmers present experiences of using tools with large applications. Papers presented range from a list of guidelines for tool usage and what was required by an, admittedly, skilled and knowledgeable programming team working on a single set of applications to the tools being used by an analyst at a computer center who might be faced with a wide range of applications to assess. Here the discussion broadens into general project management issues such as source code management and validation.
In a series of short sections toward the end of the book a number of topics associated with MPP systems are discussed. A particularly useful section details the differences between workstation clusters and MP systems, both in terms of performance and debugging/tuning needs.
The work presented is selected from numerous projects within the past five-ten years or so. This serves both to show the improvements made in this area and that the basic programmer skills used in finding bugs and tuning programs rely on a set of fundamental principles and skills, the careful and conscientious application of which leads to successful programming.
Several tools featured are acknowledged as making life easier for the programmer and the need for a consistent interface across platforms is acknowledged. The editors conclude that there is still much work needed in the development and acceptance of tools within the parallel programming community.
Quick-Tip Q & A
A: {{ In F90, how can you check whether an ALLOCATE has completed
correctly or prevent your program from halting if it attempts to
claim more memory than is available? }}
ALLOCATE and DEALLOCATE can return error status. For example,
ALLOCATE (a(100),STAT=istatus)
All is well if istatus is zero. Positive values indicate that there has
been an error. If there is no STAT= variable, the program terminates
when an error is encountered.
The numerical value returned can be decoded using "explain." For example
the two errors numbers most likely are 1205 and 1411. These decode as,
yukon% explain lib-1205
The program was unable to request more memory space.
A routine in the run-time library was unable to request additional
memory space. This usually occurs when the process has exceeded
its memory allocation.
Review the use of memory space by your program, and if possible,
reduce its memory use.
See the description of the limit(1) command in the UNICOS User
Commands Reference Manual, publication SR-2011.
The error class is UNRECOVERABLE (issued by the run-time library).
yukon% explain lib-1411
An allocatable array in the ALLOCATE statement is already
allocated.
An allocatable array in an ALLOCATE statement must not be currently
allocated. The ALLOCATED intrinsic function can be used to
determine if the allocatable array is currently allocated.
See the description of the ALLOCATE statement in the Fortran
reference manual.
The error class is UNRECOVERABLE (issued by the run-time library).
stdin: END
Q: Say you submit a single, 8-hour batch request using qsub and that the
qsub script launches a series of short jobs. Is there a way to
prevent one of the short jobs from using more than its share of the
8 hours?
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
