ARSC HPC Users' Newsletter 244, April 24, 2002
Malloc Debugging with Electric Fence
[ Jim Long of ARSC contributed this article. ]
This article looks at a method of debugging memory problems with malloc-ed memory using a library called "Electric Fence", available free from: http://perens.com/FreeSoftware
From the README for Electric Fence:
Electric Fence is a different kind of malloc() debugger. It uses the virtual memory hardware of your system to detect when software overruns the boundaries of a malloc() buffer. It will also detect any accesses of memory that have been released by free(). Because it uses the VM hardware for detection, Electric Fence stops your program on the first instruction that causes a bounds violation. It's then trivial to use a debugger to display the offending statement.
This version will run on:
- Linux kernel version 1.1.83 and above. Earlier kernels have problems with the memory protection implementation.
-
All System V Revision 4 platforms (and possibly earlier revisions) including:
- Every 386 System V I've heard of.
- Solaris 2.x
- SGI IRIX 5.0 (but not 4.x)
- IBM AIX on the RS/6000.
- SunOS 4.X (using an ANSI C compiler and probably static linking).
- HP/UX 9.01, and possibly earlier versions.
- OSF 1.3 (and possibly earlier versions) on a DECalpha.
On icehawk, ARSC's IBM SP system (RS/6000), libefence.a compiled successfully with both gcc version 2.95.2, and IBM's xlc version 5. For gcc, the only change to the Makefile is CC=gcc, and for xlc, CC=xlc in addition to uncommenting additional CFLAGS that also must be used when compiling your code.
Let's look at a simple sample of how it works (this assumes the user, "username," has stored libefence.a to his or her /u1/uaf/username/lib directory):
/* e.c - to test Electric Fence */
#include <stdio.h>
int main()
{
int i;
int * ptr = (int *) malloc(10*sizeof(int));
for(i=0; i<11; i++)
ptr[i] = 2*i;
}
icehawk1:> gcc -g e.c -L/u1/uaf/username/lib -lefence
icehawk1:> ./a.out
Electric Fence 2.0.5 Copyright (C) 1987-1995 Bruce Perens.
Segmentation fault
icehawk1:>
Electric Fence put a "fence" of inaccessible memory right after the memory allocated by malloc, resulting in a core dump when ptr[10] tried to access the memory. Invoking dbx on the core file reveals the offending statement and value of i:
icehawk1:> dbx a.out core
Type 'help' for help.
warning: The core file is truncated. You may need to increase the
ulimit for file and coredump, or free some space on the filesystem.
reading symbolic information ...internal error: unexpected value 120
at line 400
5 in file stabstring.c
[using memory image in core]
Segmentation fault in main at line 9 in file "e.c"
9 ptr[i] = 2*i;
(dbx) print i
10
(dbx)
Important environment variables available with Electric Fence:
- EF_PROTECT_BELOW
- The default behavior of Electric Fence is to only place the inaccessible page after the allocated memory. To instead place the inaccessible page before the allocated memory, set the environment variable EF_PROTECT_BELOW to 1.
- EF_PROTECT_FREE
- If you suspect that freed memory is being accessed, set EF_PROTECT_FREE to 1. This prevents Electric Fence from returning freed memory to the system so that you can get a segmentation fault if some statement tries to access that memory.
- EF_ALLOW_MALLOC_0
- Electric Fence normally traps calls to malloc with size zero. To allow malloc calls of size 0, set EF_ALLOW_MALLOC_0 .
- EF_ALIGNMENT
- EF_ALIGNMENT is used to set word alignment, see the man page for details.
A few caveats when using Electric Fence. It can create huge core dumps, so it may be better to run your code in the debugger without a core file. Electric Fence can also consume a large amount of memory for big dynamic structures, so you may have to add additional swap space. All in all, though, Electric Fence is a great free tool for those occasional nasty debugs.
"llmap": Handy IBM SP Loadleveler Utility
Jeff McAllister of ARSC has written a perl script, called "llmap," to display loadleveler work on the IBM SP. This is similar to "grmap" for the Cray T3E. "llmap" is available in /usr/local/bin.
Here's sample output (names have been changed, of course) from a run on icehawk. It's pretty self-explanatory:
ICEHAWK1$ llmap
NONRUNNING JOBS:
JID uname jname usage min max status
4121 bigbob icehawk1.4121 not_shared 32 32 Idle
RUNNING JOBS:
JID uname jname usage #n ld minMB maxMB
A 4104 honey myjob.x not_shared 12 4.1 1556 1717
B 4105 punkin icehawk2.4105 not_shared 13 4.1 164 365
C 4116 bigbob icehawk1.4116 not_shared 8 4.1 163 357
D 4127 slim myjob not_shared 8 0.2 97 291
free load free load free load
node job MB avg node job MB avg node job MB avg
--------------------------------------------------------------------
i1n1 B 1575 4.1 i2n17 B 1401 4.0 i3n33 B 1504 4.1
i1n2 D 1449 0.9 i2n18 A 23 4.1 i3n34 C 1459 4.2
i1n3 i2n19 A 27 4.1 i3n35 C 1389 4.2
i1n4 B 1420 4.0 i2n20 i3n36 C 1400 4.1
i1n5 A 23 4.1 i2n21 A 23 4.2 i3n37 C 1402 4.1
i1n6 B 1489 4.0 i2n22 A 101 4.0 i3n38 C 1383 4.1
i1n7 i2n23 A 50 4.1 i3n39 D 1495 0.1
i1n8 B 1510 4.1 i2n24 A 23 4.2 i3n40 D 1519 0.1
i1n9 B 1410 4.0 i2n25 B 1576 4.0 i3n41 C 1492 4.1
i1n10 C 1561 4.0 i2n26 i3n42 C 1577 4.0
i1n11 A 184 4.0 i2n27 B 1401 4.0 i3n43 D 1643 0.1
i1n12 B 1375 4.0 i2n28 A 23 4.1 i3n44 D 1623 0.1
i1n13 B 1413 4.0 i2n29 A 105 4.1 i3n45 D 1532 0.1
i1n14 D 1629 0.1 i2n30 B 1419 4.4 i3n46 D 1622 0.2
i1n15 i2n31 A 23 4.0
i1n16 A 23 4.1 i2n32 B 1405 4.1
--------------------------------------------------------------------
46 nodes
5 free, 41 working, 0 sharing
If you've submitted a loadlever job, you might run llmap to see if it has started, how much memory it's using, etc. If it hasn't started, llmap might give you clues why, but this can be tricky. To start a big, high-priority job, loadleveler might be in the process of draining the nodes. The appearance of idle nodes doesn't necessarily mean that your is just about to start. (As a reminder, specifying the time and minimum number of nodes your job will need can help your job start sooner under loadleveler's backfill algorithm.)
Please send feedback on llmap.
Any additional features you need or want?
Cray Support for HDF and HDF5 May Be Dropped
The latest HDF newsletter from NCSA, available at:
http://hdf.ncsa.uiuc.edu/newsletters/newsletter67.html#cray
contained this note:
Cray Support:
The Cray computers that we use for building HDF/HDF5 are being
retired. We are therefore considering dropping support for the
Crays. If you are a Cray user and you use HDF or HDF5, please let us
know as soon as possible.
Contact info is given on the above page. ARSC users concerned about this should let us know also, email, consult@arsc.edu .
Rob Bell Talk Rescheduled to Monday April 28
Our NEC of the Woods: Oz, CSIRO, HPCC and NEC
Dr. Robert Bell Deputy Manager CSIRO High Performance Computing and Communications Centre Monday, April 29th 2:30pm - 4:00pm 109 Butrovich, UAF CampusDetails:
http://www.arsc.edu/pubs/bulletins/HPCCCTalk2002.shtml
Irregular "Schedule" of this Newsletter
This, "Bi-Weekly on Fridays" newsletter has been a bit irregular lately.
We hope this isn't too troublesome to you!
Travel by the editors is often, this week for instance, the cause (we attend CUG, SPSCICOMP, SC, DoD User Group meetings, etc., and even take vacations.)
Sometimes, we confess... we just miss the "deadline." Articles submitted by readers are a big help, and you're ALWAYS invited to share your work, knowledge, discoveries, and, of course, tips.
Quick-Tip Q & A
A:[[ Am I going nuts?
[[
[[ $ ls -d DATA
[[ DATA
[[ $ ls DATA
[[ D2001 index.txt
[[ $ cd DATA
[[ sh-56 ksh: DATA: not found.
[[
[[ Why is it doing this to me?
#
# Thanks to Richard Griswold:
#
Something similar to this happens to me when I do not have execute
permission to the directory. However, on Linux, Solaris, and HP-UX I
get the message "permission denied" instead of "not found". The message
may be different on other OSes. See if you execute permission for DATA
using
ls -ld DATA
#
# From editors:
#
$ ls -ld DATA
drw------- 3 nutso thegroup 4096 Apr 11 11:19 DATA
Given read permission on a directory, ls can open the directory inode
and read the names of the files within. Lacking execute permission on
the directory, however, ls can't open the files' inodes, and thus,
can't access the extra information about the files provided by, for
instance, "ls -l". In addition, cd is prohibited from making a
non-executable directory the present working directory.
Note, the error messages reported in this question are different in
detail on different systems.
Q:From time to time I'm instructed to append some path to "PATH",
some file to "LM_LICENSE_FILE", or some such thing. For instance,
in the last issue, regarding Totalview:
"To use it, add this:
/usr/local/adm/pkg/flexlm/license.dat
to the settings of your LM_LICENSE_FILE environment variable."
How specifically, would you suggest I do that?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
