ARSC T3D Users' Newsletter 36, May 17, 1995

New T3D Batch Queues

The T3D batch queues were changed on May 16, 1995. The current T3D queues are:

always on:


   16pe_24h   1 job using at most  16 PEs for 24 hours
   32pe_24h   1 job using at most  32 PEs for 24 hours
   64pe_24h   1 job using at most  64 PEs for 24 hours
   64pe_10m   1 job using at most  64 PEs for 10 minutes
  128pe_5m    1 job using at most 128 PEs for  5 minutes
There is one additional queue that is enabled on Friday at 6:00 PM and disabled at 4:00 AM on Sunday:

  128pe_8h    1 job using at most 128 PEs for  8 hours
A request made to these queues will be run as soon as enough PEs are available to satisfy the request. The intent of this change is to provide more production access to users who are moving from development work to production runs. We will be closely monitoring how this new queue structure is working and we may need to modify it in the future. Please contact Mike Ess if you have any concerns about the batch queues.

User's UDBSEE Limits

Most T3D users currently have a limit of 32 PEs for batch access. Users can check their limits with the udbsee command:

  udbsee 
 grep jpelimit
The output will indicate their limits in interactive (i) and batch (b). For example:

  jpelimit[b]     :32:
  jpelimit[i]     :8:
If your batch PE limit is too small to access these new NQS queues and you would like to use them, please contact Mike Ess, either by phone at 907-474-5404 or email to ess@arsc.edu , to have your PE batch limits increased.

Users can query the NQS batch system with the command:


  qstat -a
to see what other NQS T3D jobs are scheduled to run on the T3D. The utility mppmon is available to see what jobs are currently running on the T3D. T3D jobs are executed on a "first fit" priority and run to completion without interruption.

A Barrier Routine with a Fixed Delay

In developing code for the T3D it is often the case that not all PEs reach a barrier point. The behavior of the program when this happens is that the program looks hung. This is because those PEs that have reached the barrier are spinning and the PE that hasn't reached the barrier is holding everyone up.

One of our users, Dr. Alan Wallcraft, a scientist with the Naval Research Center in Stennis, Mississippi and I had the need to implement a barrier function that waits a specific number of seconds. For this function, if the time delay is exceeded, then the T3D job is aborted and the user can tell which PEs have reached the barrier and which one(s) were holding up the show.

Below are two implementation of this "barrier with delay". The first is implemented with PVM and the SET_BARRIER/TEST_BARRIER functions. This version can be used with Fortran 90. The second version is implemented with Craft Fortran. A driver program and the complete source files are listed below. (These routines are also available in /usr/local/examples/mpp/src as barrier.1f and barrier2.f).


      program test
      real a( 100000 )
      intrinsic irtc
      include '/usr/include/mpp/fpvm3.h'
c
      call pvmfmytid(itid)
      call pvmfgetpe(itid, mype)
c
      irtc0 = irtc()
c
c loop on idelay.
c
      do idelay= 5,1,-1
c
c generate some unequal size tasks
c
         do i = 1, 5000 * (mype+1)
            a(i) = sqrt( real(i+idelay)**3)
         enddo
         do k= 1,35
            do i = 2, 5000 * (mype+1)
               a(i) = a(i-1) + a(i) + sqrt( real(i)**3 )
            enddo
         enddo
c
c call a barrier that aborts if any processor waits more than idelay seconds
c
         call debug_barrier(idelay)
         if     (mype.eq.0) then
           write(6,*) 'idelay=',idelay,' ok at ',
     +                (irtc()-irtc0)/150000000.0 ,' sec'
           call flush(6)
         endif
c
c a test that is always .false., to prevent optimizing away a(:)
c
         if     (a(5000*(mype+1)).eq.-999.9) then
           write(6,*) a(1),a(99),a(5000*(mype+1))
         endif
      enddo
      end
      SUBROUTINE DEBUG_BARRIER(IDELAY)
      IMPLICIT NONE
      INTEGER IDELAY
C
C     A VERSION OF BARRIER THAT ABORTS AFTER IDELAY SECONDS.
C
      INTRINSIC IRTC
      INTEGER   IRTC
      LOGICAL   TEST_BARRIER
C
      INTEGER   ITICK,NTICK, ITID,MYPE
C
      INCLUDE '/usr/include/mpp/fpvm3.h'
C
      CALL SET_BARRIER()
      IF     (TEST_BARRIER()) THEN
        RETURN
      ENDIF
C
      ITICK = IRTC()
      NTICK = ITICK + IDELAY*150000000
C
      DO WHILE (ITICK.LE.NTICK)
        IF     (TEST_BARRIER()) THEN
          RETURN
        ELSE
          ITICK = IRTC()
        ENDIF
      ENDDO
C
C     ONLY GET HERE AFTER IDELAY SECONDS.
C
      CALL PVMFMYTID(ITID)
      CALL PVMFGETPE(ITID, MYPE)
      WRITE(0,*) 'ERROR  -  DEBUG_BARRIER(',IDELAY,
     +           ') TIMED OUT ON PE ',MYPE
      CALL FLUSH(0)
      CALL ABORT()
      STOP
C     END OF DEBUG_BARRIER.
      END
A version using Craft Fortran:

      program test
      real a( 100000 )
      intrinsic irtc
      intrinsic my_pe
c
c loop on idelay.
c
      call mybarriersetup
c
      mype  = my_pe()
      irtc0 = irtc()
c
      do idelay= 5,1,-1
c
c generate some unequal size tasks
c
         do i = 1, 5000 * (mype+1)
            a(i) = sqrt( real(i+idelay)**3)
         enddo
         do k= 1,35
            do i = 2, 5000 * (mype+1)
               a(i) = a(i-1) + a(i) + sqrt( real(i)**3 )
            enddo
         enddo
c
c call a barrier that aborts if any processor waits more than idelay seconds
c
         delay = idelay
         call mybarrier(delay)
         if     (mype.eq.0) then
           write(6,*) 'idelay=',idelay,' ok at ',
     +                (irtc()-irtc0)/150000000.0 ,' sec'
           call flush(6)
         endif
         if     (a(5000*(mype+1)).eq.-999.9) then
           write(6,*) a(1),a(99),a(5000*(mype+1))
         endif
      enddo
      end
      subroutine mybarrier(delay)
c
c this subroutine is a replcement for the standard call barrier routine
c if any processor waits at a barrier more than delay seconds a call to
c abort is made and all PE dump core
c
      integer flags( 0:127 )
      common /mine/ flags
CDIR$ shared flags(:block)
      intrinsic my_pe
      mype = my_pe()
      flags( mype ) = flags( mype ) + 1.
      t1 = real( irtc( ) ) / 150000000.0
  10  continue
         et = real( irtc( ) ) / 150000000.0 - t1
         if( et .gt. delay ) then
            write(0,*) 'error  -  mybarrier(',delay,
     +                 ') timed out on pe ',mype
            do i = 0, N$PES - 1
               if( flags( i ) .lt. flags( mype ) ) then
                  write(0,*) 'pe ',i,' not at the barrier'
               endif
            enddo 
            call flush(0)
            call abort()
         endif
         do i = 0, N$PES - 1
            if( flags( i ) .lt. flags( mype ) ) then  ! barrier
               goto 10
            endif
         enddo 
      end
      subroutine mybarriersetup()
c
c an initialization routine for the status flags
c
      integer flags( 0:127 )
      common /mine/ flags
CDIR$ shared flags(:block)
      intrinsic my_pe
      flags( my_pe() ) = 0
      call barrier()                     ! make sure we're in sync
      end
With the 1.2 PE version of totalview, a user can envoke these test programs as:

  totalview a.out
When executed from within totalview the progress of each PE is shown at the time of the abort initiated by the one PE that has waited at the barrier more than "delay" seconds.

If you know of similar techniques on the T3D please send them to me and I'll pass them on to the readers of the ARSC T3D newsletter.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top