ARSC HPC Users' Newsletter 293, June 11, 2004
- Tracking Memory Use On the IBMs
- Memory Limits on the IBMs
- Debugging a Program Started by aprun on the Cray X1
- Betty Studebaker Retirement
- Quick Tip
Tracking Memory Use On the IBMs
[ Thanks to Jeff McAllister. ]
The amount of memory a job needs is often a more important consideration than CPU performance. Fortunately there are several ways to track a job's memory use on the IBM machines.
One tool all IBM users should be familiar with is ARSC's "llmap" utility, recently rewritten, and improved, by Shawn Houston.
Among the many useful things llmap summarizes is "minMB" and "maxMB", the minimum and maximum memory use, by node. In the following sample output, job J run by user jimjoe is running on 25 nodes. The total memory per node claimed by his job ranges from 3946 to 4043 MB. About 10 GB remains available on each node. Thus, if this were an MPI job, each processor could comfortably increase its memory usage by 1 GB.
iceberg1 627% llmap RUNNING JOBS: jid uname class usage #n ld minMB maxMB A 13720 bigbob single shared 1 8.4 6323 6323 B 13747 bigbob single shared 1 8 7031 7031 C 13845 bigbob single shared 1 8 7513 7513 D 13908 honey standard shared 6 8.7 7475 7651 E 13924 punkin standard not_shared 4 8.1 4932 4967 F 13926 punkin standard not_shared 4 8.2 3490 3552 G 13937 honey standard shared 6 8.2 7524 8553 H 13938 slim standard shared 4 8 5185 5216 I 13939 slim standard shared 4 8.2 5169 5522 J 13940 jimjoe standard shared 25 6.8 3946 4043 K 13941 jimjoe standard shared 9 0 2210 2418 L 13942 jimjoe standard shared 9 0 2100 2739 [ ... output truncated. See the next issue for more on llmap ... ]
For assessing memory usage, llmap is an excellent, quick start. However, it uses data obtained from LoadLeveler, which is not tied very closely to what's happening on the nodes. Thus, llmap can't provide the greatest accuracy.
For single-processor codes, the following command will run hpmcount against your executable:
For MPI codes, use this command:
poe hpmcount ./a.out -procs N
In a LoadLeveler script the number of processors is already known to POE, so the -procs argument is unnecessary.
When run on multiple processors, each creates its own hpmcount output. Setting the "MP_STDOUTMODE" environment variable to "ordered" ensures that the outputs appear in order instead of interlaced, as they appear when several processors try to write simultaneously. The following commands will produce ordered hpmcount output from a loadleveler job:
setenv MP_STDOUTMODE ordered poe hpmcount ./a.out
Alternately, each process can output to a separate file (see Newsletter 281 ).
The hpmcount utility outputs an astounding variety of performance-related counters and metrics. For the purpose of this article, we're only interested in one: memory usage (by processor). This value is reported as "Maximum resident set size."
In the example below, I've run a code known to use about 1.3 GB total across 8 processors on a single node:
b1n1 315% cat Output grep 'Maximum resident set size' Maximum resident set size : 156912 Kbytes Maximum resident set size : 159736 Kbytes Maximum resident set size : 161364 Kbytes Maximum resident set size : 161364 Kbytes Maximum resident set size : 161372 Kbytes Maximum resident set size : 161368 Kbytes Maximum resident set size : 162668 Kbytes Maximum resident set size : 162664 Kbytes
8 * ~160 MB ~= 1280 MB
Using this method to reconstruct activity across several nodes is slightly more complicated. In theory, if you set MP_STDOUTMODE, the hpmcount output will be written in the order assigned by LoadLeveler, which is printed by this command,
llq -l jobidunder the 'Task' section.
Here is the task numbering scheme for a 2-node, 2 tasks/node job:
Task ---- Num Task Inst: 4 Task Instance: b7n1:0:(2474,MPI,US,1M) Task Instance: b7n1:1:(2472,MPI,US,1M) Task Instance: b7n4:2:(2476,MPI,US,1M) Task Instance: b7n4:3:(2472,MPI,US,1M)
Of course if memory use is fairly constant just multiplying an eyeballed average by the number of processors will be a close enough estimate.
Memory Limits on the IBMs
[ Another thanks to Jeff.]
As a quick reminder, there are two barriers to using the full memory of the IBM machines. You must deal with both :
The "data" shell user limit.
Increase this limit with the correct command for your shell:
[csh/tcsh]: limit data unlimited [sh/ksh]: ulimit -dS unlimited (ksh/sh)
The executable file's data address space limit.
Increase this limit in one of the following two ways:
- Compile with either of the compiler options: -bmaxdata:2000000000 or -q64
- Use the ldedit command to alter the limit on an existing executable file, as shown here:
ldedit -bmaxdata:2000000000' <<NAME OF EXECUTABLE>>
Note, the above limits apply to codes using primarily dynamic/allocated memory. If arrays are declared with "hard-coded" sizes, the stack limits should be adjusted instead.
Applications running into any of these limits will appear to be out of memory, even if plenty of physical memory actually remains available on the nodes.
If you're having trouble getting adequate memory for your job, contact ARSC consulting for assistance.
Debugging a Program Started by aprun on the Cray X1
[ Thanks to Ed Kornkven of ARSC! ]
TotalView is a well-known debugger made by Etnus and available on Klondike, ARSC's Cray X1. It is a full-featured debugger, with both a GUI-oriented mode and a command-oriented mode. The former is selected by entering the command "totalview", while the command-line interface is executed by the "totalviewcli" command.
TotalView documentation includes a reference manual and user guide, both accessible from the CrayDocs web pages, as well as man pages available on Klondike via "man totalview" or "man totalviewcli".
Hidden in those man pages is a trick for debugging a program started by aprun. The basic idea is to pause the executing program long enough to allow for starting a TotalView session and connecting to the aprun'd process. To do this, we will use two windows -- one for the aprun command and the second for the totalviewcli command (assuming we want to use the command-line interface of TotalView):Window 1:
1a. Using the command appropriate for your shell, set the SMA_TOTALVIEW environment variable to cause the zero-ranked process to pause for 30 seconds after startup. (The ">" symbol below is the shell prompt.)
For csh-like shells: > setenv SMA_TOTALVIEW "0,30 aprun ./a.out" For ksh-like shells: > export SMA_TOTALVIEW="0,30 aprun ./a.out"
1b. Run the program > aprun -n 2 ./a.out
Upon issuing these two commands, instructions will be displayed showing what to type in the other window in order to start TotalView. Type "totalviewcli" instead of "totalview" for the command-line interface.Window 2:
2a. Run TotalView as prompted in Window 1. > totalviewcli -pid <pid from Window 1> ./a.out
2b. Continue execution with the TotalView "continue" command. The "d1.<>" is the TotalView prompt and indicates that the debugger is currently "focused" on process 1 (and thread 1).
d1.<> dcontWindow 1:
1c. If your program requires input from the terminal on startup, the prompt for it will appear in this window. Enter the input here.
If the program crashes, you can examine it using TotalView in the other window.Window 2:
2c. To get a stack trace of where the program crashed, enter d1.<> dwhere -a
The "-a" means to show the parameters as well as the functions on the call stack.
2d. To change "focus" to another process or thread, e.g. to see the stack trace of another thread, use the "dfocus" command:d1 <> dfocus 2.1 d2.1 <> dwhere -a
These two commands will display the stack trace of thread 1 in process 2. We could have simply said "dfocus 2" since process 2 has only one thread in this example.
If any readers have any more tips for debugging on the X1, please share them!
Betty Studebaker Retirement
Most anyone who's visited ARSC's offices has met our office manager, Betty Studebaker. Since it's inception, 11 years ago, Betty has helped keep ARSC running in more ways than most of us can imagine. Prior to starting at ARSC, she worked for 17 years at University of Alaska Computer Network.
Among her accomplishments, Betty won an NFS grant in 2003 to establish an HPC internship program that draws students annually from around the country. The 13 interns for 2004 are already at ARSC.
Betty's final day at UAF was today. Local folks will wish her a happy retirement at a going-away picnic, tomorrow. Others can reach her at: firstname.lastname@example.org.
Quick-Tip Q & A
A:[[ How can I find out what this means? ftn-6205 ftn: VECTOR File = potnl1temp.F, Line = 392 A loop starting at line 392 was vectorized with a single vector iteration. ftn-6003 ftn: SCALAR File = potnl1temp.F, Line = 393 A loop starting at line 393 was collapsed into the loop starting at line 392. (This was output by Cray ftn on my code, when I gave ftn the "-rm" option, for loopmark listings.) # # Thanks to Brad Chamberlain. # For a quick start at understanding them, try: explain ftn-6205 explain ftn-6003 from an X1 command line. # # Editor's note: # Cray users should keep an eye open for "explain" codes. They're offered by many tools, not just ftn. For the curious, here're the "explanations" for the above: klondike% explain ftn-6205 VECTOR: A loop starting at line %s was vectorized with a single vector iteration. Also known as a short vector loop, a loop with a single vector iteration is one in which the code to initialize the loop counter at the top of the loop and the code to test the loop counter at the bottom of the loop are eliminated, because it is known at compile time that only one iteration through the loop needs to be executed. This is true when there is a iteration count known at compile time to be less than or equal to the maximum vector length of the target machine. For example, the following would cause a short vector loop to be generated: DO I = 1,20 A(I) = B(I) ENDDO klondike% explain ftn-6003 OPT_INFO: A loop starting at line %s was collapsed into the loop starting at line num. The optimizer was able to replace a nest of two or more loops with a single loop. The collapse eliminates some loop overhead, improving the overall performance of scalar and vector loops, and can greatly increase the speed of some vector loops by allowing greater vector lengths to be used. A simple example of a collapsable loop nest: DO I = 1,30 DO J = 1,20 A(I,J) = B(I,J) ENDDO ENDDO Q: How can I determine which shell I'm using? Nothing seems to work! Nothing!
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.