ARSC T3E Users' Newsletter 119, May 16, 1997

Yukon /tmp Purging Begins Next Week

Users helping with the testing of ARSC's T3E should be interested in this.

As noted in the motd on yukon, we plan to start purging files in the /tmp file system on yukon starting Tuesday, May 20. Currently, any files in /tmp (excluding /tmp/window) not accessed or modified in 7 days will be deleted. /tmp/window will be subject to a 21 day purge, as happens on denali. The script that will do this is already in the yukon root crontab, but it is only listing files that will be deleted instead of actually deleting them.

Most people active on yukon are likely to have something go purged. Beyond individuals, directories like /tmp/dump, /tmp/window/udb, /tmp/window/dot.files and /tmp/window/disk also have lots of files that will go away.

T3E Adaptive Routing and 'shmem_fence'

The previous newsletter quoted some bandwidth figures for PUT and GET on the T3D and T3E. The code given used SHMEM_WAIT to verify that all of the data had arrived.

On the T3D, routing of messages is deterministic, data arrives in the order it was sent. On the T3E, a system configuration option exists which allows adaptive routing and as the name suggests, when this is enabled, data may take different routes through the torus, avoiding any hotspots. Therefore, data need not arrive in order and testing the last data item is no longer a correct test for arrival of the complete message.

The following code uses shmem_put and shmem_wait to transfer arrays. It was run on the ARSC T3E which does have adaptive routing enabled. It counts the number of data items which arrive out of order, based on the shmem_wait test of the last array element, and prints out these "error" counts.


Example code1.
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
      program band
        
      include "mpp/shmem.fh"

      parameter (maxsize=1024*1024)
      common /a/ itdata(maxsize),isdata(maxsize)
      integer num_pe,my_pe
      integer shmem_my_pe, shmem_n_pes


      num_pe = shmem_n_pes();my_pe = shmem_my_pe()

! set master processor
      mproc=0;mtarg=1

! how many repeats
      ntdo=10

! count errors
      iperr=0;igerr=0;icerr=0

      do matarg=0,num_pe-1

        mtarg=mod(matarg+my_pe,num_pe)
        mbtarg=mod(matarg-my_pe+num_pe,num_pe)
      
        msize=maxsize
  
 99     continue

        itget=0;itrecv=0;itput=0

        do ido=1,ntdo

          call barrier()
          isdata(1:msize)=msize;itdata(1:msize)=-1

          call barrier()
          its=irtc()
          call shmem_put(itdata,isdata,msize,mtarg)
          itf=irtc();itput=itput+itf-its
          call shmem_wait(itdata(msize),-1)

          do ic=1,msize
            if(itdata(ic).ne.msize) iperr=iperr+1
          enddo
          icerr=icerr+msize

          do ic=1,msize-1
            call shmem_wait(itdata(ic),-1)
          enddo
          itf=irtc();itrecv=itrecv+itf-its

          isdata(1:msize)=msize;itdata(1:msize)=-1
          call barrier()
          its=irtc()
          call shmem_get(itdata,isdata,msize,mbtarg)

          do ic=1,msize
            if(itdata(ic).ne.msize) igerr=igerr+1
          enddo
          icerr=icerr+msize

          itf=irtc();itget=itget+itf-its
        enddo

        itget=itget/ntdo;itrecv=itrecv/ntdo;itput=itput/ntdo
        if(mproc.eq.my_pe) then
          write(6,*) ' put/get times ',msize, ' were ',
     !      itput,itrecv,itget
          write(6,*) ' bandwidth ',matarg,
     !                           float(msize)/(float(itput)/300.0E6),
     !                           float(msize)/(float(itget)/300.0E6)
 2000     format(1x,i10,i10,e13.6,e13.6)
        endif


        if(msize.gt.1) then
          msize=msize/2
          goto 99
        endif


      enddo

      write(6,*) ' number of data errors ',my_pe,' is put/get ',
     !       iperr,igerr,
     !       ' in ',icerr

      stop 
      end
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

This code passes data from each processor to a processor a fixed stride away. The test on arrival is necessary to know when all puts are finished, here WAIT is used instead of BARRIER since it allows computation to proceed when the data has arrived rather than waiting for all data exchanges to complete. Note that the WAIT tests the arrival of data being PUT from another processor. The code loops over all strides which generates a very busy network and it is more likely that there will be adaptive routing of messages. An error count is kept and same results below show the number of a run on 32 processors.

Example error counts.


  number of data errors  0  is put/get  1,  0  in  1342176640
  number of data errors  16  is put/get  2*0  in  1342176640
  number of data errors  5  is put/get  2*0  in  1342176640
  number of data errors  1  is put/get  2*0  in  1342176640
  number of data errors  15  is put/get  2*0  in  1342176640
  number of data errors  28  is put/get  33,  0  in  1342176640
  number of data errors  2  is put/get  2*0  in  1342176640
  number of data errors  21  is put/get  41,  0  in  1342176640
  number of data errors  11  is put/get  2*0  in  1342176640
  number of data errors  25  is put/get  1,  0  in  1342176640
  number of data errors  9  is put/get  2*0  in  1342176640
  number of data errors  23  is put/get  2,  0  in  1342176640
  number of data errors  14  is put/get  2*0  in  1342176640
  number of data errors  4  is put/get  2*0  in  1342176640
  number of data errors  18  is put/get  2*0  in  1342176640
  number of data errors  10  is put/get  10,  0  in  1342176640
  number of data errors  3  is put/get  6,  0  in  1342176640
  number of data errors  22  is put/get  2*0  in  1342176640
  number of data errors  6  is put/get  2*0  in  1342176640
  number of data errors  26  is put/get  18,  0  in  1342176640
  number of data errors  24  is put/get  2*0  in  1342176640
  number of data errors  17  is put/get  2*0  in  1342176640
  number of data errors  31  is put/get  2*0  in  1342176640
  number of data errors  20  is put/get  11,  0  in  1342176640
  number of data errors  27  is put/get  2*0  in  1342176640
  number of data errors  13  is put/get  2*0  in  1342176640
  number of data errors  30  is put/get  1,  0  in  1342176640
  number of data errors  29  is put/get  2,  0  in  1342176640
  number of data errors  8  is put/get  2*0  in  1342176640
  number of data errors  19  is put/get  8,  0  in  1342176640
  number of data errors  7  is put/get  1,  0  in  1342176640
  number of data errors  12  is put/get  15,  0  in  1342176640

As can be seen there are only a small number of data errors, however these might be important in your program.

A simple solution is to wait for the entire array to have changed. However this is a rather expensive option for large messages. The new SHMEM 'fence' routine is provided to ensure arrival order is correct. It is used along with the exchange of a token to make sure all data has arrived.

The following code uses shmem_fence to guarantee the arrival of the entire message and then tests it, as in the previous example.


Example code 2. 
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
      program band
        
      include "mpp/shmem.fh"

      parameter (maxsize=1024*1024)
      common /a/ itdata(maxsize),isdata(maxsize)

      parameter (mfsize=1)
      common /af/ itfdata(mfsize),isfdata(mfsize)

      integer num_pe,my_pe
      integer shmem_my_pe, shmem_n_pes

      num_pe = shmem_n_pes();my_pe = shmem_my_pe()

! set master processor
      mproc=0;mtarg=1

! how many repeats
      ntdo=10

!count errors
      iperr=0;igerr=0;icerr=0

      do matarg=0,num_pe-1

        mtarg=mod(matarg+my_pe,num_pe)
        mbtarg=mod(matarg-my_pe+num_pe,num_pe)
      
        msize=maxsize

 99     continue

        itget=0;itrecv=0;itput=0

        do ido=1,ntdo

          call barrier()
          isdata(1:msize)=msize;itdata(1:msize)=-1

          itfdata=-1;isfdata=msize
          call barrier()

          its=irtc()

          call shmem_put(itdata,isdata,msize,mtarg)
          call shmem_fence
          call shmem_put(itfdata,isfdata,mfsize,mtarg)
          call shmem_wait(itfdata(mfsize),-1)

          itf=irtc();itput=itput+itf-its

          do ic=1,msize
            if(itdata(ic).ne.msize) iperr=iperr+1
          enddo
          icerr=icerr+msize

          itf=irtc();itrecv=itrecv+itf-its

          isdata(1:msize)=msize;itdata(1:msize)=-1
          call barrier()
          its=irtc()
          call shmem_get(itdata,isdata,msize,mbtarg)

          do ic=1,msize
            if(itdata(ic).ne.msize) igerr=igerr+1
          enddo
          icerr=icerr+msize

          itf=irtc();itget=itget+itf-its

        enddo

        itget=itget/ntdo;itrecv=itrecv/ntdo;itput=itput/ntdo
  
        if(mproc.eq.my_pe) then
          write(6,*) ' put/get times ',msize, ' were ',
     !      itput,itrecv,itget
          write(6,*) ' bandwidth ',matarg,
     !                           float(msize)/(float(itput)/300.0E6),
     !                           float(msize)/(float(itget)/300.0E6)
 2000     format(1x,i10,i10,e13.6,e13.6)
        endif


        if(msize.gt.1) then
          msize=msize/2
          goto 99
        endif


      enddo


      write(6,*) ' number of data errors ',my_pe,' is put/get ',
     !       iperr,igerr,
     !       ' in ',icerr


      stop 
      end
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

Output from version using shmem_fence


  number of data errors  0  is put/get  2*0  in  1342176640
  number of data errors  18  is put/get  2*0  in  1342176640
  number of data errors  2  is put/get  2*0  in  1342176640
  number of data errors  5  is put/get  2*0  in  1342176640
  number of data errors  24  is put/get  2*0  in  1342176640
  number of data errors  6  is put/get  2*0  in  1342176640
  number of data errors  22  is put/get  2*0  in  1342176640
  number of data errors  11  is put/get  2*0  in  1342176640
  number of data errors  7  is put/get  2*0  in  1342176640
  number of data errors  29  is put/get  2*0  in  1342176640
  number of data errors  8  is put/get  2*0  in  1342176640
  number of data errors  27  is put/get  2*0  in  1342176640
  number of data errors  14  is put/get  2*0  in  1342176640
  number of data errors  19  is put/get  2*0  in  1342176640
  number of data errors  3  is put/get  2*0  in  1342176640
  number of data errors  9  is put/get  2*0  in  1342176640
  number of data errors  1  is put/get  2*0  in  1342176640
  number of data errors  17  is put/get  2*0  in  1342176640
  number of data errors  15  is put/get  2*0  in  1342176640
  number of data errors  4  is put/get  2*0  in  1342176640
  number of data errors  16  is put/get  2*0  in  1342176640
  number of data errors  26  is put/get  2*0  in  1342176640
  number of data errors  28  is put/get  2*0  in  1342176640
  number of data errors  30  is put/get  2*0  in  1342176640
  number of data errors  21  is put/get  2*0  in  1342176640
  number of data errors  31  is put/get  2*0  in  1342176640
  number of data errors  13  is put/get  2*0  in  1342176640
  number of data errors  10  is put/get  2*0  in  1342176640
  number of data errors  23  is put/get  2*0  in  1342176640
  number of data errors  25  is put/get  2*0  in  1342176640
  number of data errors  20  is put/get  2*0  in  1342176640
  number of data errors  12  is put/get  2*0  in  1342176640

The calls to shmem_fence ensure that data is passed in order. This only refers to the sequence of messages, not the data in a message. In this case we pass two messages and the call to SHMEM_FENCE ensures that the second does not arrive until the first is completed. The test for arrival of the second single data items indicates the first message has been received. Without the call to shmem_fence, data in the previous "put" could still be in transit.

What might a program do with adaptive routing if it has a problem? Firstly different runs might give different results. This happens since the data used in the computation might not have been correct when the computation proceeds past a wait. (Note, users who sync with BARRIER don't have to worry, the processor putting the data will only join the barrier once the data transfer is complete.)

Does adaptive routing improve bandwidth? It does not improve the peak performance of the system but does avoid help problems with torus congestion. This was generally not a problem on the T3D since the torus was used only by users programs and carried a small system load. With the T3E being a stand-alone system, however, with processors performing operating system, user, and application tasks, there is much more traffic on the torus which increases the potential for hotspots.

Here is a bit of the 'man' page:



 

 SHMEM_FENCE(3)                 CrayLibs 2.0                     SR-2165 2.0

 

 NAME

      shmem_fence - Assures ordering of delivery of puts

 

 SYNOPSIS

      C:

 

      #include <mpp/shmem.h>

      void shmem_fence(void);

 

      Fortran:

 

      CALL SHMEM_FENCE

 

 IMPLEMENTATION

      All Cray Research systems

 

 DESCRIPTION

      This function ensures ordering of put (remote write) operations.   All

      put operations issued to a particular processing element (PE) prior to

      the call to shmem_fence are guaranteed to be delivered before any

      subsequent put operations to the same PE which follow the call to

      shmem_fence.

 

CUG Spring '97 Report

The Silicon Valley CUG, hosted last week in San Jose by NASA Ames and Sterling Software, was pleasantly upbeat, especially after the previous CUG which had an air of anxiety due, perhaps, to uncertainty surrounding both the future of CUG and the future of CRI (which had just been acquired by SGI). At this point, CUG's future seems secure, though its direction is still under debate (there will be one meeting per year rather than two, and this decision at least seems to have settled with everyone). And the future of Cray also looks good.

There has been a milestone reached at Cray. The revenue from sale of Cray MPP systems exceeded, for the first time ever, the revenue from Cray PVP systems (this was during one quarter only... but still...).

The SGI/CRI merger seems to be going smoothly, and SGI seems to treat Cray quite generously. In his address to the CUG, Ed McCracken, CEO of Cray/SGI, stated: "You [CUG attendees] have more impact on us by being here in Silicon Valley than we have on you."

This theme was echoed by others. In general, SGI seemed down-right grateful that it is teamed with responsible, reputable, grey-haired Cray. They have promoted Robert Ewald, former CEO of CRI, to be the Executive Director Of Computer Systems (I didn't get all of these titles exact, but they're close) for all of SGI/Cray, and have relocated him to Mountain View. They have promoted Mick Dungworth, formerly Director of Customer Service (?) for CRI to the same position for all of SGI/Cray. They have moved development of the Origin2000 (top of SGI's line) to Eagan, where Irene Qualters, former CRI VP of System Development (?) under Robert Ewald is now Executive Director of SGI/Cray for Supercomputer Development (?) and responsible for the Eagan facility.

What follows are notes from some presentations relevant to this Newsletter:



----------------------------------------------------------------------
Jeff Brooks of SGI/Cray: T3E Single PE Optimization

-100 mflops on 1 300Mhz pe is worthy goal.
-latency to memory is a FACT (like death and taxes) to be dealt with.
-C will never beat FORTRAN for compiler optimization.

Streams:
-after 3 cache misses, starts pre-fetching from remote PE's memory 
 using stream register.
-6 stream buffers (LHS takes 1)
-use MPI to avoid conflicts between e-registers and cache.

f90 v3.0:
-This compiler has new optimization options.

-New Math Library:
  use -lmfastv to get optimized "vector" math library.

-The speaker's favorite optimization options:
   f90 -O3,unroll2,pipeline2,apad -lmfastv

-Cache Bypass using e-register:
  -Using cache will slow down array initialization.  Do this to bypass 
   cache and use e-registers:

!dir$ cache_bypass c,a
       do  i=1,n
         c(i) = a(i)
       enddo

Unrolling:
       do i=1,n,1
         t(i) = t(i) * t(i)      ! expect 1 result per 4 clock periods
       enddo

       do i=1,n,4                ! expect 4 results per 4 clock periods
         t(i+0) = t(i+0) * t(i+0)     
         t(i+1) = t(i+1) * t(i+1)    
         t(i+2) = t(i+2) * t(i+2)   
         t(i+3) = t(i+3) * t(i+3)  
       enddo
----------------------------------------------------------------------
SGI/Cray CEO Ed McCracken's Address:

-Quote: "California attitude: 'If it exists, it's obsolete.'"

-Purpose of SGI/Cray: "unleash human creativity."

-Culture of SGI/Cray:
  -innovative:  "white corpuscles rise up to kill boring projects."
  -respect honesty toward clients.
  -Integrity: "honoring commitments."
  -Passion to excel.
  -Deliver results:  it's meaningless w/out delivery.

-Strengths:
  -systems: SGI creates entire systems (gets components from specialists).
  -"man-machine" interface.
  -customer relations.

-Coming Changes:
  -10^3 speedup in computers by 2010.
  -Drastic speedup in networks.

-Effects of these changes:
  -Recentralization of computing; application servers; growth in large .
   network hubs; resurgence of supercomputing centers.

-Current market:
  -33% (!!) of SGI systems in network hubs, now.   
     -Managers of network hubs want a few huge systems rather than 
      100's of small systems.
  -33% in supercomputing.
  -33% in graphics.

-Current competitive areas in the SGI world:
  -Unix.
  -PC's.
  -Network.
  -Consumer electronics.

-Relationship with user groups:
 -Catering too much to desires of today's users can impede progress, as
  users want things to stay the same. Need to look to tomorrow's users.

----------------------------------------------------------------------
Richard Langstrom, SGI/Cray: T3E 

-Resource Scheduling: To be in coming U/mk releases:
  -Gang scheduler (moves groups of apps into run-status; swaps groups).
  -Load Balancer  (shifts apps between PEs).

-Quotas: Coming in v 1.5.2 (July?).

----------------------------------------------------------------------
Kent Koeninger, SGI/Cray:  Cray product comparison:

T90: 880 Gb/sec -- mem-processor max bandwidth
     1.6 Gflop per processor max

J90: Compatible w/ Y-MP
     Good price/perf ratio

T3E: Most scalable

Origin2000: independent processor price/perf ratio comparable w/ 
     T90 and T3E.  SMP or MPP style of programming.

     Each node is PVP w/ shared memory.  Nodes connected MPP style
     for globally addressable memory.  
----------------------------------------------------------------------

Quick-Tip Q & A


A: {{ Can you tell NQS to not restart your job from the beginning after
      a failed checkpoint at shutdown or after a crash?  You might want
      to do this if restarting the job would overwrite work already
      completed. }}

     # Use the qsub "no rerun" option, -nr.  From the qsub man page:
     #
     #   -nr     Specifies that the batch request cannot be rerun.
     #
     
Q: Is diamond a good conductor (of heat? of electricity?). How about
   silicon?    


[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top