ARSC HPC Users' Newsletter 293, June 11, 2004

Tracking Memory Use On the IBMs

[ Thanks to Jeff McAllister. ]

The amount of memory a job needs is often a more important consideration than CPU performance. Fortunately there are several ways to track a job's memory use on the IBM machines.

One tool all IBM users should be familiar with is ARSC's "llmap" utility, recently rewritten, and improved, by Shawn Houston.

Among the many useful things llmap summarizes is "minMB" and "maxMB", the minimum and maximum memory use, by node. In the following sample output, job J run by user jimjoe is running on 25 nodes. The total memory per node claimed by his job ranges from 3946 to 4043 MB. About 10 GB remains available on each node. Thus, if this were an MPI job, each processor could comfortably increase its memory usage by 1 GB.

iceberg1 627% llmap
    jid      uname      class        usage        #n    ld minMB    maxMB
A   13720    bigbob     single       shared       1    8.4 6323     6323
B   13747    bigbob     single       shared       1    8 7031     7031
C   13845    bigbob     single       shared       1    8 7513     7513
D   13908    honey      standard     shared       6    8.7 7475     7651
E   13924    punkin     standard     not_shared   4    8.1 4932     4967
F   13926    punkin     standard     not_shared   4    8.2 3490     3552
G   13937    honey      standard     shared       6    8.2 7524     8553
H   13938    slim       standard     shared       4    8 5185     5216
I   13939    slim       standard     shared       4    8.2 5169     5522
J   13940    jimjoe     standard     shared       25   6.8 3946     4043
K   13941    jimjoe     standard     shared       9    0 2210     2418
L   13942    jimjoe     standard     shared       9    0 2100     2739

[ ... output truncated. See the next issue for more on llmap ... ]

For assessing memory usage, llmap is an excellent, quick start. However, it uses data obtained from LoadLeveler, which is not tied very closely to what's happening on the nodes. Thus, llmap can't provide the greatest accuracy.

A better estimate can be obtained from IBM's tool, "hpmcount." (See Newsletters Newsletter 281 and Newsletter 251 for more info on this utility.)

For single-processor codes, the following command will run hpmcount against your executable:

    hpmcount ./a.out

For MPI codes, use this command:

    poe hpmcount ./a.out -procs N

In a LoadLeveler script the number of processors is already known to POE, so the -procs argument is unnecessary.

When run on multiple processors, each creates its own hpmcount output. Setting the "MP_STDOUTMODE" environment variable to "ordered" ensures that the outputs appear in order instead of interlaced, as they appear when several processors try to write simultaneously. The following commands will produce ordered hpmcount output from a loadleveler job:

    setenv MP_STDOUTMODE ordered
    poe hpmcount ./a.out

Alternately, each process can output to a separate file (see Newsletter 281 ).

The hpmcount utility outputs an astounding variety of performance-related counters and metrics. For the purpose of this article, we're only interested in one: memory usage (by processor). This value is reported as "Maximum resident set size."

In the example below, I've run a code known to use about 1.3 GB total across 8 processors on a single node:

    b1n1 315% cat Output 
 grep 'Maximum resident set size'
     Maximum resident set size                    : 156912 Kbytes
     Maximum resident set size                    : 159736 Kbytes
     Maximum resident set size                    : 161364 Kbytes
     Maximum resident set size                    : 161364 Kbytes
     Maximum resident set size                    : 161372 Kbytes
     Maximum resident set size                    : 161368 Kbytes
     Maximum resident set size                    : 162668 Kbytes
     Maximum resident set size                    : 162664 Kbytes

8 * ~160 MB ~= 1280 MB

Using this method to reconstruct activity across several nodes is slightly more complicated. In theory, if you set MP_STDOUTMODE, the hpmcount output will be written in the order assigned by LoadLeveler, which is printed by this command,

    llq -l jobid
under the 'Task' section.

Here is the task numbering scheme for a 2-node, 2 tasks/node job:


      Num Task Inst: 4
      Task Instance: b7n1:0:(2474,MPI,US,1M)
      Task Instance: b7n1:1:(2472,MPI,US,1M)
      Task Instance: b7n4:2:(2476,MPI,US,1M)
      Task Instance: b7n4:3:(2472,MPI,US,1M)

Of course if memory use is fairly constant just multiplying an eyeballed average by the number of processors will be a close enough estimate.

Memory Limits on the IBMs

[ Another thanks to Jeff.]

As a quick reminder, there are two barriers to using the full memory of the IBM machines. You must deal with both :

  1. The "data" shell user limit. Increase this limit with the correct command for your shell:
              limit data unlimited
              ulimit -dS unlimited (ksh/sh)
  2. The executable file's data address space limit. Increase this limit in one of the following two ways:
    1. Compile with either of the compiler options: -bmaxdata:2000000000 or -q64
    2. Use the ldedit command to alter the limit on an existing executable file, as shown here:
              ldedit -bmaxdata:2000000000' <<NAME OF EXECUTABLE>>

Note, the above limits apply to codes using primarily dynamic/allocated memory. If arrays are declared with "hard-coded" sizes, the stack limits should be adjusted instead.

Applications running into any of these limits will appear to be out of memory, even if plenty of physical memory actually remains available on the nodes.

If you're having trouble getting adequate memory for your job, contact ARSC consulting for assistance.

Debugging a Program Started by aprun on the Cray X1

[ Thanks to Ed Kornkven of ARSC! ]

TotalView is a well-known debugger made by Etnus and available on Klondike, ARSC's Cray X1. It is a full-featured debugger, with both a GUI-oriented mode and a command-oriented mode. The former is selected by entering the command "totalview", while the command-line interface is executed by the "totalviewcli" command.

TotalView documentation includes a reference manual and user guide, both accessible from the CrayDocs web pages, as well as man pages available on Klondike via "man totalview" or "man totalviewcli".

Hidden in those man pages is a trick for debugging a program started by aprun. The basic idea is to pause the executing program long enough to allow for starting a TotalView session and connecting to the aprun'd process. To do this, we will use two windows -- one for the aprun command and the second for the totalviewcli command (assuming we want to use the command-line interface of TotalView):

Window 1:

1a. Using the command appropriate for your shell, set the SMA_TOTALVIEW environment variable to cause the zero-ranked process to pause for 30 seconds after startup. (The ">" symbol below is the shell prompt.)

For csh-like shells: > setenv SMA_TOTALVIEW "0,30 aprun ./a.out" For ksh-like shells: > export SMA_TOTALVIEW="0,30 aprun ./a.out"

1b. Run the program > aprun -n 2 ./a.out

Upon issuing these two commands, instructions will be displayed showing what to type in the other window in order to start TotalView. Type "totalviewcli" instead of "totalview" for the command-line interface.

Window 2:

2a. Run TotalView as prompted in Window 1. > totalviewcli -pid <pid from Window 1> ./a.out

2b. Continue execution with the TotalView "continue" command. The "d1.<>" is the TotalView prompt and indicates that the debugger is currently "focused" on process 1 (and thread 1).

d1.<> dcont

Window 1:

1c. If your program requires input from the terminal on startup, the prompt for it will appear in this window. Enter the input here.

If the program crashes, you can examine it using TotalView in the other window.

Window 2:

2c. To get a stack trace of where the program crashed, enter d1.<> dwhere -a

The "-a" means to show the parameters as well as the functions on the call stack.

2d. To change "focus" to another process or thread, e.g. to see the stack trace of another thread, use the "dfocus" command:

d1 <> dfocus 2.1 d2.1 <> dwhere -a

These two commands will display the stack trace of thread 1 in process 2. We could have simply said "dfocus 2" since process 2 has only one thread in this example.

If any readers have any more tips for debugging on the X1, please share them!

Betty Studebaker Retirement

Most anyone who's visited ARSC's offices has met our office manager, Betty Studebaker. Since it's inception, 11 years ago, Betty has helped keep ARSC running in more ways than most of us can imagine. Prior to starting at ARSC, she worked for 17 years at University of Alaska Computer Network.

Among her accomplishments, Betty won an NFS grant in 2003 to establish an HPC internship program that draws students annually from around the country. The 13 interns for 2004 are already at ARSC.

Betty's final day at UAF was today. Local folks will wish her a happy retirement at a going-away picnic, tomorrow. Others can reach her at:

Quick-Tip Q & A

A:[[ How can I find out what this means? 

     ftn-6205 ftn: VECTOR File = potnl1temp.F, Line = 392
       A loop starting at line 392 was vectorized with a single vector

     ftn-6003 ftn: SCALAR File = potnl1temp.F, Line = 393
       A loop starting at line 393 was collapsed into the loop starting
       at line 392.

  (This was output by Cray ftn on my code, when I gave ftn the "-rm"
  option, for loopmark listings.)

  # Thanks to Brad Chamberlain.

  For a quick start at understanding them, try:

          explain ftn-6205
          explain ftn-6003

  from an X1 command line.

  # Editor's note:

  Cray users should keep an eye open for "explain" codes. They're offered
  by many tools, not just ftn.  For the curious, here're the
  "explanations" for the above:

  klondike%       explain ftn-6205 

    VECTOR:  A loop starting at line %s was vectorized with a single
    vector iteration.

    Also known as a short vector loop, a loop with a single vector
    iteration is one in which the code to initialize the loop counter at
    the top of the loop and the code to test the loop counter at the
    bottom of the loop are eliminated, because it is known at compile
    time that only one iteration through the loop needs to be executed.
    This is true when there is a iteration count known at compile 
    time to be less than or equal to the maximum vector length of the
    target machine.  For example, the following would cause a short
    vector loop to be generated:

      DO I = 1,20
        A(I) = B(I) 

  klondike%        explain ftn-6003

    OPT_INFO:  A loop starting at line %s was collapsed into the loop
    starting at line num.

    The optimizer was able to replace a nest of two or more loops with a
    single loop.  The collapse eliminates some loop overhead, improving
    the overall performance of scalar and vector loops, and can greatly
    increase the speed of some vector loops by allowing greater vector
    lengths to be used.  A simple example of a collapsable loop nest:

      DO I = 1,30
        DO J = 1,20
          A(I,J) = B(I,J)

Q: How can I determine which shell I'm using?  Nothing seems to work! 

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top