ARSC T3D Users' Newsletter 12, November 11, 1995

What Exact Versions Are We Running Now?

Last week I showed the verbose output from the "what" command to see exactly what is in a T3D library in the directory /mpp/lib. A more succinct report comes from:


  what /mpp/lib/*.a 
 grep version
currently on denali, this produces:

  libcomm_version 1.0.0.0 06/08/94 09:41:28
  libf_version    8.2.0.0 06/08/94 09:43:07
  libfi_version   8.2.0.0 06/08/94 10:00:14
  libm_version    8.1.0.1 06/15/94 16:43:37
  libpvm3_version 1.1.0.0 06/08/94 09:04:01
  libsci_version  1.1.0.0 06/08/94 09:09:10
  libsma_version  1.1.0.0 06/08/94 09:39:24
  libu_version    8.2.0.0 06/08/94 10:14:54
The version numbers listed above identifies what is in CrayLib_M. The new version of CrayLib_M for December 4th produces:

  libcomm_version 1.0.0.8 10/12/94 08:39:18
  libf_version    8.2.0.9 10/12/94 08:43:15
  libfi_version   8.2.0.9 10/14/94 10:37:51
  libm_version    8.1.0.2 09/08/94 18:20:35
  libpvm3_version 1.1.0.7 09/09/94 10:52:22
  libsci_version  1.1.0.9 10/12/94 07:25:20
  libsma_version  1.1.0.12        10/12/94 08:33:06
  libu_version    8.2.0.9 10/12/94 10:09:48

Timing SHMEM Calls and Compiler Optimization

Another exercise from last week's T3D class showed some interesting effects of compiler optimization choices on shmem_get speeds. There is a very comprehensive man page on denali for the shmem_get and shmem_put calls. (Just try "man shmem".) For the shmem_get call, a user specifies a local target array and its length to be filled by an array from another PE.

Below is a simple program for timing the speed of shmem_get calls for messages of increasing size. The program below times 6 shmem_get calls of size 8, 80, 800, 8000, 80000, and 800000 bytes. Using these times and sizes we can get a speed for the shmem_get call as a function of message size. This program does all of its timings on PE0 and times shmem_get calls when the data is both on and off PE0. (A shmem_get of memory local is done better with just a copy.)


      program sh
      parameter( NMAX = 100 000 )
      real a( NMAX ), b( NMAX ), c( NMAX )
      INTRINSIC MY_PE
      integer DONE
CDIR$ SHARED DONE
      DONE = 0
      do 10 i = 1, NMAX
        a( i ) = i
   10 continue
      me = MY_PE()
      n = 1 
      if( me .eq. 0 ) then
      do 100 i = 1, 6
        t1 = second( )
        call shmem_get( a, b, n, 0 )
        t2 = second( )
        call shmem_get( a, c, n, 1 )
        t3 = second( )
        rmb = 8*n/1000000.0
        write(6,600)i,n,t2-t1,t3-t2,rmb/(t2-t1),rmb/(t3-t2)
        n = n * 10
  100 continue
      DONE = 1
      else
    1 continue
      if( DONE .eq. 0 ) goto 1
      endif
  600 format( i4, i7, f10.6, f10.6, f10.2, f10.2 ) 
      do 200 i = 1, NMAX
      if( i .ne. b( i ) ) then
        write( 6, 601 ) i, a( i ), b( i )
        stop
      endif
      if( i .ne. c( i ) ) then
      write( 6, 602 ) i, a( i ), b( i )
      stop
      endif
  200 continue
      end
  601 format( "b is bad ", i10, f10.1, f10.1 )
  602 format( "c is bad ", i10, f10.1, f10.1 )
      real function second( )
      second = float( irtc( ) ) / 150000000.0
      end
The typical output for running this program is:

  /mpp/bin/cf77 -Wf"  " -c shmem.f
  mppldr shmem.o
  a.out
  1      1  0.000016  0.000009      0.50      0.91
  2     10  0.000011  0.000011      7.13      7.49
  3    100  0.000017  0.000029     47.92     27.27
  4   1000  0.000073  0.000211    109.51     37.93
  5  10000  0.000692  0.002137    115.66     37.44
  6 100000  0.006769  0.020570    118.19     38.89
The major problem with this result is that the program hangs. As the program was written all PEs other than PE0 just spin, waiting for the DONE variable to change from 0 to 1. At the default level of compiler optimization the compiler doesn't recognize the the variable DONE is SHARED and might be changed by some other PE. So the spin loop has no way of being stopped. Compiling with optimization off produces this output:

  /mpp/bin/cf77 -Wf" -o off " -c shmem.f
  mppldr shmem.o
  a.out
  1      1  0.000019  0.000009      0.43      0.88
  2     10  0.000012  0.000011      6.66      7.08
  3    100  0.000020  0.000031     40.40     25.67
  4   1000  0.000092  0.000223     86.58     35.82
  5  10000  0.000860  0.002107     93.00     37.96
  6 100000  0.008549  0.021146     93.57     37.83
This time the program properly ends but the performance of the shmem_gets have taken a nasty performance hit on the local PE transfer. It has been suggested that by using the CDIR$ SUPPRESS directive the compiler will not optimize away correct behavior but will retain speed elsewhere in the code. I changed the class exercise to:

       
      .
      .
      .
      else
    1 continue
CDIR$ SUPPRESS DONE
      if( DONE .eq. 0 ) goto 1
      endif
      .
      .
      .
And the program runs correctly (i.e., doesn't hang, and correctly transfers the data) but the performance hit is another surprise:

  /mpp/bin/cf77 -Wf" " -c shmemq.f
  mppldr shmemq.o
  a.out
  1      1  0.000020  0.000011      0.39      0.74
  2     10  0.000015  0.000014      5.42      5.81
  3    100  0.000027  0.000046     29.99     17.28
  4   1000  0.000159  0.000375     50.24     21.30
  5  10000  0.001154  0.002900     69.33     27.58
  6 100000  0.011401  0.028644     70.17     27.93
This time even the transfers off PE0 took a performance hit. This is counterintuitive. The CDIR$ SUPPRESS worked but has less performance than the original solution.

In the general case on the T3D, I think that it would be safest to develop code with optimization off and then when the program behaves correctly optimization can be turned on. If with optimization turned on, the program doesn't work correctly then the sensitive sections might be isolated in separate compilation units and compiled without optimization. In this one case CDIR$ SUPPRESS didn't seem to be very useful.

Timers on the T3D

This subject seems more complicated than it needs to be and I'll be putting out more detailed information in the future but for now, here's a summary of the timers available on the T3D:

  timer              Wallclock    Fortran  T3D or   Granularity  Resolution
             or CPU timer  or C     Y-MP      T3D   Y-MP   T3D   Y-MP

  irtc          wallclock  Fortran  both   ~.15um ~.2um 
  rtc           wallclock  Fortran  both   ~.30um ~.3um 
  tsecnd        CPU        Fortran  both                 10000um  3um
  gettimeofday  wallclock  C        both  ~2500um  ~30um
  second        CPU        Fortran  Y-MP                     1um  5um
notes:
  1. Anything in Fortran or C is callable from the other language but with a additional overhead
  2. um = microseconds (.000001 of a second)
  3. Without multiprogramming on the T3D, wallclock is CPU time
  4. Resolution is from the Livermore Loops benchmark
  5. Granularity is measured with the program in Newsletter #11
  6. There are man pages on each of these functions

Reminders

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.

Upcoming Upgrade on ARSC T3D Software

ARSC will be upgrading the T3D software to CrayLib_M 1.1.1.2, MAX 1.1.0.4 and SCC_M 4.0.2.11 on December 4th.

We are asking users to submit their codes to be part of our regression test suite. We are looking for relatively small, short running, self contained programs in source form that are currently running correctly. The advantage of submitting your code is that it will be one of the first programs run on a new T3D release and so if it doesn't work we wouldn't install the new T3D software. Otherwise your program may fail after the new T3D software has become the default.

PE Limits

At ARSC, most production runs can be handled by the 32 PE NQS queue. The use of large partitions of 64 or 128 PEs seems restricted to research problems of the "what if..." type. Basically, development is done on smaller partitions and then to complete a table of performance or problem sizes the larger partitions are used.

We have been accommodating this situation by individually changing a user's interactive permissions with the understanding that users would share their results with us. We are in the process of notifying users of the change in this policy. In the future we will reset everyone back to the default settings and then make it easy for users to request changes that will be in affect for a short time period (perhaps a week per request).

Next Week's Newsletter

Next week, I will be at Supercomputing '94 and there will be no newsletter. I'd like to meet, in person, the members of the ARSC T3D users group, so if you stop by the ARSC booth we can talk or arrange a meeting. Also I have poster #23 in the poster session; you can also catch me there.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top