ARSC T3D Users' Newsletter 7, October 7, 1994
List of Differences Between T3D and Y-MP
I'm assembling a list of differences between the T3D and the Y-MP for user's reference. The current list looks like:
- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (this Newsletter #7)
The Effect of the -a Static Compiler Switch
On the Y-MP, the compiler switch, -a static, is usually benign but sometimes very useful, but on the T3D it must be handled carefully. By default the allocation method on the T3D and Y-MP is "stack mode". Local variables are allocated on the stack when a function or routine is executed and the stack space is returned on the exit of the function or routine. The default "stack mode" can be overridden by using the -a static flag. In this "static mode", each local variable will be allocated to the data partition of the memory. It will have that address for the life of the program, not just on the stack address for the duration of the call.The -a static flag is useful for initializing local variables with the value zero on the Y-MP (see Newsletter #6), and for preserving a local variable between calls. Many old Fortran codes actually depend on this "feature". Even several of the LAPACK routines do not perform correctly unless compiled with the -a static flag.
While working with LAPACK public domain sources, I noticed big changes in the compile times and the library size when using the -a static flag.
-a stack(the default) -a static
Y-MP
compile time(seconds)
user 537 508
system 156 155
library size(bytes) 2,996,428 3,435,124
T3D
compile time(seconds)
user 1444 2474
system 275 373
library size(bytes) 4,596972 52,303,988
It almost doubles the compile times. It is a big surprise to see a factor of 12 increase for the size of the library.
Also in execution speeds the effect of the -a static flag is dramatic. Here are some execution times for the Livermore Loops on the Y-MP and the T3D:
-a stack(the default) -a static Y-MP(M98) program time(seconds) 46.30 47.24 harmonic mean(mflops) 15.69 15.91 T3D program time(seconds) 53.75 149.78 harmonic mean(mflops) 8.33 2.35So whereas the -a static had little effect on compile times, program sizes and execution speeds on the Y-MP, it has large negative effects on the T3D and should be used carefully. But it is also important to get the the right answers regardless of the cost.
Y-MP System Activity Generated on the T3D.
At least three users in the past two weeks have experienced the somewhat hidden effects of the client for a T3D job. Each T3D is launched from the Y-MP by one job running on the Y-MP and maybe several of these jobs will service the T3D job throughout its execution. These jobs are called the clients. These clients provide several services to the T3D job:- Launch
-
All I/O activity
- file I/O
- I/O to the user's screen
- PVM communication between partitions
- CPU limit enforcement
- Termination
- Probably a lot more
denali : Top Process Display 09:11:43 Interval: 10 Passes: INDEF
Times : System(=) User(*)
Total (all CPUs) 397% +0--------20--------40--------60--------80------100+
mppexec 35228 31% M
============***
mppexec 35227 31% M
============***
lapwit 63724 27% M
=*****
mppexec 5225 26% M
===========**
a.out 15349 24%
=******
m553 34867 24%
=******
11oneway.e 43838 24%
=******
irs 26457 12%
*******
NOMAD3DNEW 25099 9%
****
nfc3d.x 500 6% M
=**
acous 81366 6%
***
na631D_hib6E 16309 6% M
***
p133B_hib6Ew 53498 6% M
***
na631D_hib6E 16273 5% M
**
standard.x 58444 5%
**
dtns3d.test 28340 5%
**
p133B_hib6Ew 53499 5% M
**
p131A_tib6Ep 17655 5% M
**
p131A_tib6Ep 17656 3% M
=
cylf2d 27158 1%
=
p131A_tib6Ep 17649 1% M
(To get this display you must enter a x once the initial display is shown, a q will exit from the display.) Using csam, a user can optimize an application to reduce the system overhead on the Y-MP. This is important because the Y-MP at ARSC is already close to 400% utilization. I went over my own PVM application and made several changes that reduced the activity from sometimes 40% of a CPU to 1 or 2% of a CPU. Here are the changes I made:
- The master program of the master/slave implementation was moved from the Y-MP to the T3D. This reduces the number of slaves by one. But it serves the purpose of moving work load from the Y-MP to the T3D.
- The number of pvm_probes was reduced by putting a sleep before the probe to reduce the number of probes per second.
- Originally the I/O to an output file and to the screen was unbuffered with the setbuf command. This was removed and some of the capability was restored with the fflush command.
Memory Usage on the T3D
In the last newsletter, we looked at the two fixed size partitions of the Private PE memory model, code and data. These two areas do not change size during execution, so tools such as mppsize and memsize are very useful for monitoring them. There are two additional areas of the Private PE memory model, the stack and the heap. These two partitions can change size during the execution. Below is a picture copied from the Cray MPP Loader User's Guide SG-2514 1.1, showing the layout of memory and the directions that the heap and the stack grow.
--------------------------
Code
--------------------------
Data
--------------------------
Heap
g
r
o
w
t
h
\ /
v
--- ---
^
/ \
g
r
o
w
t
h
Stack
--------------------------
The boundary between the stack and heap partition actually is the unused memory of the PE processor.
In the default stack mode of compiling, local variables will be allocated on the stack during the execution of a routine or function. So in the example below, the storage for the array a[ ] will come from the stack partition when the function fromstack() is invoked and will be returned to the stack on exit from fromstack().
#define XXXX 1000000
main()
{
fromstack();
}
fromstack()
{
int a[ XXXX ];
blahblahblah();
}
Using a similar script as in the last newsletter we find that the array a[] could have about 1,500,000 elements and fails with the error message:
User core dump completed (./mppcore) Operand range errorLocal variables in main also are allocated from the stack but they remain there for the entire execution. Variables allocated with malloc (and its relatives) are placed on the heap.
main( )
{
fromheap( );
}
fromheap( )
{
int *a;
int i;
int sum = 0;
a = (int *)malloc( sizeof( int ) * XXX );
if( a != NULL ) {
for( i = 1; i < XXX; i++ ) a[ i ] = i;
for( i = 1; i < XXX; i++ ) sum = sum + a[ i ];
printf( "Sum of first XXX integers = %d\n", sum );
} else {
printf( "Could allocate enough space for XXX integers\n" );
}
}
We can investigate the behavior of allocation from the heap using the above program and a shell script similar to the one used in the last newsletter. Again we find that about 1,500,000 integer values can be allocated on the 2MW T3D node. This method of allocation has the benefit that a user is notified within the program that memory is exhausted and it can react before the program aborts.
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
