ARSC T3E Users' Newsletter 120, May 30, 1997

Purging of T3E core/mppcore files

During test time on June 10, ARSC will launch a daily cron job to delete core and mppcore files which have not been accessed in seven or more days from within user directories on yukon's /u1 and /u2 disks. The seven day /tmp purger has been running on yukon for a week now, and, as it deletes every file older than seven days on /tmp , removes old core and mppcore files as well. To save important core files on /u1 and /u2 (as if there were a lot of these hanging around!) you can simply rename them.

Handling Premature PE Exits on T3D/E

If we run the following code on 4 processors on the T3D what do you think will happen? On the T3E?


ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
      program prog
      implicit none

      integer, parameter :: maxproc =  256

      include "mpp/shmem.fh"

! local identity
      integer my_pe
      integer num_pe

      integer shmem_my_pe, shmem_n_pes
      integer mproc

! processor to stop
      integer hproc

! find out how many processors
      num_pe = shmem_n_pes()
! which one am i?
      my_pe = shmem_my_pe()

      write(6,*) '  pe ',my_pe,' of ',num_pe

      hproc=2

      if(my_pe.eq.hproc)  then
        write(6,*) ' processor ',hproc,' stopping '
        stop
      endif

      write(6,*) '  pe ',my_pe,' of ',num_pe,' wait on barrier '
      call barrier()

      end
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

Here is the output from a run on the ARSC T3D:


  c-denali<121> ./prog -npes 4
     pe  3  of  4
     pe  0  of  4
     pe  3  of  4  wait on barrier
     pe  0  of  4  wait on barrier
     pe  2  of  4
     pe  1  of  4
    processor  2  stopping
     pe  1  of  4  wait on barrier
   STOP (PE 2)   executed at line 39 in Fortran routine 'PROG'

At this point, the job will wait until the user's resources are consumed and then be terminated by the system, occupying the four PEs all the while.

Here is the output from a run on the ARSC T3E:


  yukon% mpprun -n 4 ./prog
     pe  1  of  4
     pe  0  of  4
     pe  1  of  4  wait on barrier
     pe  3  of  4
     pe  2  of  4
     pe  0  of  4  wait on barrier
     pe  3  of  4  wait on barrier
    processor  2  stopping
   STOP (PE 2)   executed at line 39 in Fortran routine 'PROG'
   FATAL ERROR IN BARRIER: User deadlock detected.
      Number of PEs that have exited:        1
      Number of PEs waiting for barriers:    3
      Total PEs:                             4
   Status of each PE:
      PE   0 is waiting for a hardware barrier
      PE   1 is waiting for a hardware barrier
      PE   2 has exited
      PE   3 is waiting for a hardware barrier
  SIGNAL: Abort ( from process 7959 )

   Beginning of Traceback (PE 3):
    Interrupt at address 0x800032130 in routine '_lwp_kill'.
    Called from line 30 (address 0x800031930) in routine 'raise'.
    Called from line 125 (address 0x800009784) in routine 'abort'.
    Called from line 321 (address 0x8000c5c20) in routine
  '_sma_register_blocked_pe'.
    Called from line 80 (address 0x8000c52e8) in routine '_sma_deadlock_wait'.
    Called from line 111 (address 0x8000c63c0) in routine 'barrier'.
    Called from line 43 (address 0x80000129c) in routine 'PROG'.
    Called from line 433 (address 0x800000b3c) in routine '$START$'.
   End of Traceback.
  Abort (core dumped)
  yukon%

The T3E handles the situation much better than the T3D. Naturally, the programmer should not have an exit or stop statement which doesn't perform a global syncronisation. A more realistic situation is whereeach processor might be reading from a different file and one processor encounters an error and stop's , or each is malloc'ing and one runs out of memory and stop's , and the programmer has done a poor job of exception handling. (In the case of the file read problem, Fortran provides an easy solution. Stay tuned...)

cache_bypass and shmem_put

At the CUG, Jeff Brooks mentioned a few features of the v3.0 f90 T3E compiler (see last week's newsletter), one of which is a compiler directive to bypass the cache, thus using the e-registers for memory transfers. For example:


!dir$ cache_bypass c,a
       do  i=1,n
        c(i) = a(i) 
       enddo

This directive is not available in v2 f90, but we can test the effect by using shmem_put on the same PE. The following command has the same effect as the above loop:


       shmem_put (c, a, n, mype)

This is Cray-specific code, of course, but if you are using shmem routines anyway, and can't wait for f90 v3.0, it might be worth it. The benefit increases with the volume of data being copied, but in a simple test run on the T3E, shmem_put was faster than a loop by up to 400%. The same test suggests that on the T3D you should stick with loops and array operations. Here's the test code followed by sample output:


cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
       program prog
       implicit none
       integer, parameter::SZ=1000000
       real c(SZ), a(SZ) 
       integer i, t1, t2, t3, t4, t5, t6
       integer irtc, ierr, index, iclktck, my_pe, shmem_my_pe
       integer copycnt, copysz

       ! Get machine clock ticks per second
       call pxfconst ('CLK_TCK',index,ierr)
       call pxfsysconf (index, iclktck, ierr)

       ! Get my pe
       my_pe = shmem_my_pe()


       do copycnt=1,6
         copysz=10**copycnt
         c = 1

! copy arrays using shmem_put.
         a = 0
         t1 = irtc ()
         call shmem_put (a, c, copysz, my_pe)
         t2 = irtc ()
         call dummy (a, c) 

! Verify copy completed.
         do  i=copysz,1,-1
           if (c(i) .NE. a(i)) stop "copy failed"
         enddo

! copy arrays using f90 array operation
         a = 0
         t3 = irtc ()
         a(1:copysz) = c(1:copysz)
         t4 = irtc ()
         call dummy (a, c) 

! copy arrays using loop
         a = 0
         t5 = irtc ()
         do  i=1,copysz
          a(i) = c(i)
         enddo
         t6 = irtc ()
         call dummy (a, c) 
  
         write (6,'("Copying ", i8, " words. Words per sec:")') copysz
         write (6,'("shmem_put ",f12.0)') copysz/((t2-t1)/real(iclktck))
         write (6,'("array op  ",f12.0)') copysz/((t4-t3)/real(iclktck))
         write (6,'("loop      ",f12.0)') copysz/((t6-t5)/real(iclktck))
         write (6,*) 
  
       enddo

       end

       subroutine dummy
       end
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

Here is output from this code on the T3E:


  yukon$ mpprun -n1 ./a.out
  Copying       10 words. Words per sec:
  shmem_put     1143729.
  array op      3064351.
  loop          2909796.
   
  Copying      100 words. Words per sec:
  shmem_put     8498584.
  array op      7767996.
  loop          8136697.
   
  Copying     1000 words. Words per sec:
  shmem_put    30618494.
  array op      9596622.
  loop          9015778.
   
  Copying    10000 words. Words per sec:
  shmem_put    43273184.
  array op      9065170.
  loop          9780493.
   
  Copying   100000 words. Words per sec:
  shmem_put    43497488.
  array op     10582787.
  loop         10324527.
   
  Copying  1000000 words. Words per sec:
  shmem_put    44401595.
  array op     10590687.
  loop         10429912.
     
  yukon$ 
  

At 100 array elements (or so) the cache_bypass/shmem_put/e-register method surpasses the loop and array operation methods, and never looks back.

The best rate of about 44 Mwords/sec translates to about 355 Mbytes/sec, surpassing the best shmem inter-PE bandwidth of about 330 Mbytes/sec we obtained in earlier bandwidth tests and printed in newsletter 117 . These rates fluctuate slightly, on the order of a percent, from run to run.

Here is output from this code on the T3D:


  denali$ mppexec ./a.out -npes 1
  Copying       10 words. Words per sec:
  shmem_put     1116183.
  array op      6173457.
  loop          5415704.
   
  Copying      100 words. Words per sec:
  shmem_put     5331023.
  array op     13182338.
  loop         12789003.
   
  Copying     1000 words. Words per sec:
  shmem_put     8108481.
  array op     10638607.
  loop         14611376.
   
  Copying    10000 words. Words per sec:
  shmem_put     9269283.
  array op      8931485.
  loop          8660224.
   
  Copying   100000 words. Words per sec:
  shmem_put     9786806.
  array op      8633544.
  loop          8607606.
   
  Copying  1000000 words. Words per sec:
  shmem_put     9770439.
  array op      8604423.
  loop          8601747.
   
  denali$ 
  

Programming Clusters of SMP Nodes


> [ One of our readers, David Bader, sent in this announcement. ]
> 
> ----------------------------------------------------------------------
>  SIMPLE: A Methodology for Programming High Performance Algorithms on
>           Clusters of Symmetric Multiprocessors (SMPs)
> ----------------------------------------------------------------------
> 
> We have released our technical report entitled ``SIMPLE: A Methodology
> for Programming High Performance Algorithms on Clusters of Symmetric
> Multiprocessors (SMPs),'' by David A. Bader and Joseph Ja'Ja'.
> Technical Report Number: CS-TR-3798 and UMIACS-TR-97-48. Institute for
> Advanced Computer Studies (UMIACS), University of Maryland, College
> Park, May 1997.
> 
> This report is available in PostScript format via the WWW:
> 
> 
http://www.umiacs.umd.edu/research/EXPAR

> 
http://www.umiacs.umd.edu/research/EXPAR/papers/3798.html

> 
> or via anonymous ftp from these locations:
> 
> 
ftp://ftp.cs.umd.edu/pub/papers/papers/3798/3798.ps.Z

> 
ftp://ftp.umiacs.umd.edu/pub/EXPAR/papers/3798.ps.gz

> 
ftp://ftp.umiacs.umd.edu/pub/EXPAR/papers/3798.ps.Z

> 
> If you prefer a hardcopy, please reply to this message and send me
> your mailing address.
> 
> ABSTRACT:
> 
>       We describe a methodology for developing high performance programs
>    running on clusters of SMP nodes. Our methodology is based on a small
>    kernel (SIMPLE) of collective communication primitives that make
>    efficient use of the hybrid shared and message passing environment. We
>    illustrate the power of our methodology by presenting experimental
>    results for sorting integers, two-dimensional fast Fourier transforms
>    (FFT), and constraint-satisfied searching. Our testbed is a cluster of
>    DEC AlphaServer 2100 4/275 nodes interconnected by an ATM switch.
> 
> ---
> 
> David A. Bader, Ph.D.                           Office: 301-405-6755   
> Institute for Advanced Computer Studies         FAX:    301-314-9658
> A.V. Williams Building               Internet: dbader@umiacs.umd.edu
> University of Maryland        WWW: http://www.umiacs.umd.edu/~dbader
> College Park, MD 20742
> 

Quick-Tip Q & A


A: {{ Is diamond a good conductor (of heat? of electricity?). How about
      silicon?  }}

  Diamond has nice properties for hardware engineers.  It is a
  fantastic conductor of heat but an electric insulator.  

  A speaker at the CUG described 
shrinking
 a 4 processor 1 Gbyte
  J90's weight by 83% and volume by 80% for an embedded application.
  Among other things, they used a substrate of synthetic diamond on the
  boards to aid heat dispersion.  They also used a coolant atomizer to
  directly spray them.  The coolant phase change keeps the boards at
  essentially constant temperature without the bulk of chillers.  The
  vapor is reclaimed into liquid via a simple 
radiator
.

  How about silicon?  As an aside, ARSC staff has a tradition of buying
  interesting little items for each other whenever we are on vacation
  or a business trip (and think of it).  Dale Clark returned from
  Silicon Valley with pieces of silicon for everyone. They look like
  shiny metallic chunks of coal, which makes sense, as 
Si
 is in the
  same valence group as 
C
.  Silicon, as we all know, is a
  semi-conductor.

Q: On the T3E, how can you limit the size of core files?


[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top