ARSC T3D Users' Newsletter 7, October 7, 1994

List of Differences Between T3D and Y-MP

I'm assembling a list of differences between the T3D and the Y-MP for user's reference. The current list looks like:

  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (this Newsletter #7)
I encourage users to e-mail in their favorite difference so we all can benefit from each other's experience.

The Effect of the -a Static Compiler Switch

On the Y-MP, the compiler switch, -a static, is usually benign but sometimes very useful, but on the T3D it must be handled carefully. By default the allocation method on the T3D and Y-MP is "stack mode". Local variables are allocated on the stack when a function or routine is executed and the stack space is returned on the exit of the function or routine. The default "stack mode" can be overridden by using the -a static flag. In this "static mode", each local variable will be allocated to the data partition of the memory. It will have that address for the life of the program, not just on the stack address for the duration of the call.

The -a static flag is useful for initializing local variables with the value zero on the Y-MP (see Newsletter #6), and for preserving a local variable between calls. Many old Fortran codes actually depend on this "feature". Even several of the LAPACK routines do not perform correctly unless compiled with the -a static flag.

While working with LAPACK public domain sources, I noticed big changes in the compile times and the library size when using the -a static flag.


  -a stack(the default)        -a static

  Y-MP
  compile time(seconds)        
        user                      537             508
        system                    156             155
  library size(bytes)       2,996,428       3,435,124

  T3D
  compile time(seconds)
        user                     1444            2474
        system                    275             373
  library size(bytes)        4,596972      52,303,988
It almost doubles the compile times. It is a big surprise to see a factor of 12 increase for the size of the library.

Also in execution speeds the effect of the -a static flag is dramatic. Here are some execution times for the Livermore Loops on the Y-MP and the T3D:


  -a stack(the default)        -a static

  Y-MP(M98)
  program time(seconds)      46.30               47.24
  harmonic mean(mflops)      15.69               15.91

  T3D
  program time(seconds)      53.75              149.78
  harmonic mean(mflops)       8.33                2.35
So whereas the -a static had little effect on compile times, program sizes and execution speeds on the Y-MP, it has large negative effects on the T3D and should be used carefully. But it is also important to get the the right answers regardless of the cost.

Y-MP System Activity Generated on the T3D.

At least three users in the past two weeks have experienced the somewhat hidden effects of the client for a T3D job. Each T3D is launched from the Y-MP by one job running on the Y-MP and maybe several of these jobs will service the T3D job throughout its execution. These jobs are called the clients. These clients provide several services to the T3D job:
  1. Launch
  2. All I/O activity
    1. file I/O
    2. I/O to the user's screen
  3. PVM communication between partitions
  4. CPU limit enforcement
  5. Termination
  6. Probably a lot more
With PVM and read/writes on the T3D, the T3D processors can generate a lot of system requests that must be serviced by the clients. Mike Dority, a CRI analyst at ARSC, showed me how to monitor this system activity by using the command "csam". It provides a detailed breakdown of user's jobs as a percentage of a single CPU and refreshes the display every 10 seconds. It also breaks down the CPU usage into user and system charges. The clients are all executing with the name "mppexec" but the pid is also displayed by csam, so that each T3D user can watch their client's activity as their T3D program is executing. Here is a sample of csam output showing clients taking up more than 60% of one CPU:

  denali        : Top Process Display     09:11:43  Interval:  10  Passes: INDEF 

  Times : System(=) User(*) 
  Total (all CPUs)    397%  +0--------20--------40--------60--------80------100+
  mppexec       35228  31% M
============***                                   

  mppexec       35227  31% M
============***                                   

  lapwit        63724  27% M
=*****                                            

  mppexec        5225  26% M
===========**                                     

  a.out         15349  24%  
=******                                           

  m553          34867  24%  
=******                                           

  11oneway.e    43838  24%  
=******                                           

  irs           26457  12%  
*******                                           

  NOMAD3DNEW    25099   9%  
****                                              

  nfc3d.x         500   6% M
=**                                               

  acous         81366   6%  
***                                               

  na631D_hib6E  16309   6% M
***                                               

  p133B_hib6Ew  53498   6% M
***                                               

  na631D_hib6E  16273   5% M
**                                                

  standard.x    58444   5%  
**                                                

  dtns3d.test   28340   5%  
**                                                

  p133B_hib6Ew  53499   5% M
**                                                

  p131A_tib6Ep  17655   5% M
**                                                

  p131A_tib6Ep  17656   3% M
=                                                 

  cylf2d        27158   1%  
=                                                 

  p131A_tib6Ep  17649   1% M
                                                  

(To get this display you must enter a x once the initial display is shown, a q will exit from the display.) Using csam, a user can optimize an application to reduce the system overhead on the Y-MP. This is important because the Y-MP at ARSC is already close to 400% utilization. I went over my own PVM application and made several changes that reduced the activity from sometimes 40% of a CPU to 1 or 2% of a CPU. Here are the changes I made:
  1. The master program of the master/slave implementation was moved from the Y-MP to the T3D. This reduces the number of slaves by one. But it serves the purpose of moving work load from the Y-MP to the T3D.
  2. The number of pvm_probes was reduced by putting a sleep before the probe to reduce the number of probes per second.
  3. Originally the I/O to an output file and to the screen was unbuffered with the setbuf command. This was removed and some of the capability was restored with the fflush command.
These changes weren't too hard, but not necessary in a PVM environment of networked workstations. The PVM environment on the T3D with its reliance on clients running on the Y-MP should be handled differently. In particular, using the Y-MP as the master is discouraged.

Memory Usage on the T3D

In the last newsletter, we looked at the two fixed size partitions of the Private PE memory model, code and data. These two areas do not change size during execution, so tools such as mppsize and memsize are very useful for monitoring them. There are two additional areas of the Private PE memory model, the stack and the heap. These two partitions can change size during the execution. Below is a picture copied from the Cray MPP Loader User's Guide SG-2514 1.1, showing the layout of memory and the directions that the heap and the stack grow.

  --------------------------
  
        Code            

  
                        

  --------------------------
  
                        

  
        Data            

  
                        

  --------------------------
  
        Heap            

  
          
             

  
          
g            

  
          
r            

  
          
o            

  
          
w            

  
          
t            

  
          
h            

  
          
             

  
         \ /            

  
          v             

  
                        

  
                        

  ---                    ---
  
          ^             

  
         / \            

  
          
g            

  
          
r            

  
          
o            

  
          
w            

  
          
t            

  
          
h            

  
          
             

  
        Stack           

  --------------------------
The boundary between the stack and heap partition actually is the unused memory of the PE processor.

In the default stack mode of compiling, local variables will be allocated on the stack during the execution of a routine or function. So in the example below, the storage for the array a[ ] will come from the stack partition when the function fromstack() is invoked and will be returned to the stack on exit from fromstack().


  #define XXXX 1000000
  main()
  {
    fromstack();
  }

  fromstack()
  {
    int a[ XXXX ];

    blahblahblah();
  }
Using a similar script as in the last newsletter we find that the array a[] could have about 1,500,000 elements and fails with the error message:

  User core dump completed (./mppcore)
  Operand range error
Local variables in main also are allocated from the stack but they remain there for the entire execution. Variables allocated with malloc (and its relatives) are placed on the heap.


  main( )
  {
    fromheap( );
  }

  fromheap( )
  {
    int *a;
    int i;
    int sum = 0;

    a = (int *)malloc( sizeof( int ) * XXX );
    if( a != NULL ) {
      for( i = 1; i < XXX; i++ ) a[ i ] = i;
      for( i = 1; i < XXX; i++ ) sum = sum + a[ i ];
      printf( "Sum of first XXX integers = %d\n", sum );
    } else {
      printf( "Could allocate enough space for XXX integers\n" );
    }
  }
We can investigate the behavior of allocation from the heap using the above program and a shell script similar to the one used in the last newsletter. Again we find that about 1,500,000 integer values can be allocated on the 2MW T3D node. This method of allocation has the benefit that a user is notified within the program that memory is exhausted and it can react before the program aborts.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top