ARSC T3D Users' Newsletter 44, July 14, 1995

More on Fixed and Plastic Executables

In last week's newsletter, I described one Craft Fortran program that did not show much of a speedup when compiled as a fixed executable as opposed to a "plastic" executable. Of course, a possible speed improvement is only one benefit from such a switch. Dr. Ming Jiang, a member of the UAF faculty and the ARSC staff sends in this further note:


  > > ARSC T3D Users' Newsletter Number 43 07/07/95
  > >    
  > >    plastic     128  0.011290  0.040066  0.011630
  > >    with -X128  128  0.011002  0.037600  0.010887
  > > 
  > > With this switch the timings are consistently affected in the
  > > right direction  but the effect is minimal. 
  > > 
  > > If any users have similar experiences in "switchology", I'd be
  > > happy to pass them on through this newsletter.
  > 
  > Mike,
  > 
  > -X pes will produce a smaller executable.
  > It doesn't require re-load every time, while "plastic" one calls
  > mppldr each time you run it.
  > 
  > Dr. Ming Jiang
  > Associate Professor
  > Dept. of Math Sciences
  > University of Alaska            Tel: (907) 474-6666 ext 3744
  > Fairbanks, AK 99775             Fax: (907) 474-5394
I also thought that the -X npes switch would make more efficient use of memory, but from the test case below, it seems that the plastic executable is implemented as efficiently as the fixed executable. To check out how much of a size advantage the fixed executable has over the plastic executable, I devised this small program:

          parameter( NMAX = 2048 )
          real a( NMAX, NMAX )
  cdir$ shared a( :, :block )
          real pesum( 0:127 )
  cdir$ shared pesum( :block )
          intrinsic my_pe
          me = my_pe()
          call barrier()
          if( me .eq. 0 ) t1 = second() 
  cdir$ doshared( j ) on a( i, j )
          do 10 j = 1, NMAX
             do 10 i = 1, NMAX
                a( i, j ) = i + ( j-1 ) * NMAX
    10    continue
          call barrier()
          if( me .eq. 0 ) t1 = second() - t1
          call barrier()
          if( me .eq. 0 ) t2 = second()
          pesum( me ) = 0.0
  cdir$ doshared( j ) on a( i, j )
          do 20 j = 1, NMAX
             do 20 i = 1, NMAX
                pesum( me ) = pesum( me ) + a( i, j )
    20    continue
          call barrier
          if( me .eq. 0 ) then
            t2 = second() - t2
            sum = 0.0
            do 30 i = 0, n$pes-1
               sum = sum + pesum( i )
    30      continue
            w = NMAX * NMAX / 1000000.0
            write(6,600)n$pes,w/t1,w/t2,sum,(NMAX*NMAX+1)*NMAX*NMAX/2
   600      format( i6, f10.3, f10.3, f20.1, i20 )
          endif
          end
This program has one large two dimensional array of size NMAX*NMAX which contains the integers 1, 2, 3, ... NMAX*NMAX. The program uses as many processors as it can to sum the elements of the two dimensional array and then checks the result with the identity:

  1 + 2 + 3 + ... + n = ( n + 1 ) * n / 2
By varying the value NMAX, we can see how large a 2 dimensional array is allowed when the program is compiled for a fixed executable. Finding the maximum value for NMAX doesn't require much searching because Craft Fortran requires that the shared array "a" have a leading dimension that is a power of 2. Just for fun, I also computed the speed of the shared initialization loop and summation loop. The table below summaries the results:

  Results for plastic and fixed executables on the T3D, sizes and executable times
  
   Number of PES           1      2      4      8      16      32      64      128
  
   Total physical memory   8     16     32     64     128     256     512     1024
   (in millions 64 bit words)
  
  
  Compiled as a plastic executable:
  
   Maximum Value        2048   2048   4096   4096    8192    8192   16384    16384
    for NMAX
  
   Size of array "a"     4.2    4.2   16.8   16.8    67.2    67.2   268.4    268.4
   (in millions of
     64 bit words)  
  
     Speed of 
   initialization        6.4   12.3   24.7   49.4   103.5   207.1   414.1    828.5
     (millions of
    results/second)
  
     Speed of
    summation            9.8   18.7   37.4   74.9   156.9   313.9   628.0   1255.1
     (millions of
     adds/second)
  
  
  Compiled for a fixed number of PEs:
  
   Maximum Value        2048   2048   4096   4096    8192    8192   16384    16384
    for NMAX
  
     Speed of 
   initialization        6.4   14.2   24.7   49.4   103.5   237.9   475.9    951.3
  
     Speed of
    summation            9.8   18.7   37.4   74.9   156.8   313.6   627.9   1256.0
Because each doubling of NMAX causes a quadrupling of the memory required by array "a", the pattern of largest problems that fits seems to make sense. The use of the -X npes flag shows that the largest problem that physically fits in memory can be solved is solved. And for this problem, the current implementation of compiling a "plastic" executable is as efficient in memory use as compiling for a fixed executable. But as in the last newsletter there is a slight speed improvement for executables targeted for a fixed number of PEs.

There is a significant advantage for the fixed executable in that the size of its a.out is much smaller than of the the plastic executable. For this program, the executable sizes are:


   985696 Jul 11 14:21 suma128 <- fixed executable
  2189104 Jul 11 14:23 sumb128 <- plastic executable
and the size of the fixed executable was the same whatever the value of npes in the compilation command cf77 -X npes ...

On the way to producing the above tables I ran into several interesting error messages:

  1. Operand Range Error When compiling a problem larger than what will physically fit either as a fixed or plastic executable, there is no error message (Although CRI sells only 8MW nodes, so there is a fixed limit for the case of the fixed executables). Only at execution time does the user get the catch-all "Operand Range Error". Increasing NMAX until this error message appears is how I determined the largest problem that fits.
  2. The fixed limit on PE_PRIVATE arrays If the array "a" is not declared SHARED then it is a PE_PRIVATE array and will be contained in the memory of each PE. In compiling both the plastic and fixed executable, the compiler gives a good error message for arrays too large for a single PE:
    
      2      2.         real a( nmax, nmax )
      cft77-424 cf77: WARNING $MAIN, Line = 2, File = suma.f, Line = 2
      Array "A" exceeds CPU targeted memory size of 8388608 words.
    
  3. Limit on the size of a shared array There is a compiler limit on the largest shared array. If the array "a" is shared and NMAX = 262144 (the array "a" is 6872GW) then the compilation aborts with:
    
      Total size of memory segment 03 exceeds compiler limit.
      cft77-9  cf77: CFT77 COMPILATION ABORTED
    
    So the size of a shared array seems to have no real limit except for maybe artifical programs like the "linpeak" benchmark.
  4. Fixed objects take precedence over plastic objects
    1. When dealing with object files that have been compiled as fixed and plastic, the executable that mppldr produces is always fixed. And mppldr believes the user knows what he is doing, because there is no warning message.
    2. In the case when the "fixedness" varies among objects, the mppldr knows the user needs helps and gives an appropriate error message:
      
        /mpp/bin/cf77 -X1 second.f
        /mpp/bin/cf77 -X128 suma.f second.o -o suma
        mppldr-302 cf77: WARNING 
        The number of PEs compiled into module 'SECOND' (1) differs
        from the number of PEs compiled into a prior module (128).
        mppldr-112 cf77: WARNING 
        Because of previous errors, file 'suma' is not executable.
      

    Release 1.2.2 of CrayLibs

    ARSC is planning to make available the 1.2.2 release of CrayLibs as soon as it is released. Watch this newsletter for further details.

    List of Differences Between T3D and Y-MP

    The current list of differences between the T3D and the Y-MP is:
    1. Data type sizes are not the same (Newsletter #5)
    2. Uninitialized variables are different (Newsletter #6)
    3. The effect of the -a static compiler switch (Newsletter #7)
    4. There is no GETENV on the T3D (Newsletter #8)
    5. Missing routine SMACH on T3D (Newsletter #9)
    6. Different Arithmetics (Newsletter #9)
    7. Different clock granularities for gettimeofday (Newsletter #11)
    8. Restrictions on record length for direct I/O files (Newsletter #19)
    9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
    10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
    11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
    12. RANF() and its manpage differ between machines (Newsletter #37)
    13. CRAY2IEG is available only on the Y-MP (Newsletter #40)
    14. Missing sort routines on the T3D (Newsletter #41)

    I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top