ARSC T3D Users' Newsletter 29, March 31, 1995

BENCHLIB for the T3D

At the Denver CUG, Jeff Brooks gave a 2 hour description of Single PE Optimization on the CRAY T3D that generated a lot of interest in BENCHLIB. Basically BENCHLIB is a collection of very fast unsupported routines for the T3D. Jeff has let ARSC distribute the BENCHLIB source through the ARSC ftp server. Using the anonymous login id you can find the BENCHLIB source, and postscript files of this paper and slides from the Denver CUG. The ftp site is ftp.arsc.edu and the files are in the directory pub/submissions:


  -rw-r--r--   ftp    other  1634893 Mar 24 13:33 cug_slides.ps.Z
  -rw-r--r--   ftp    other   144153 Mar 24 13:17 libbnch.tar.Z
  -rw-r--r--   ftp    other    69077 Mar 24 13:33 t3d_opt.ps.Z
I'm interested in what's in there and I'm going to be trying the routines in the next few weeks. In this newsletter I'll give a short overview, and in future newsletters some examples.

BENCHLIB consist of 6 libraries of optimized routines:


  lib_32.a
  lib_scalar.a
  lib_util.a
  lib_random.a
  lib_tri.a
  lib_vect.a
These libraries are available on Denali in the directory:

  /usr/local/examples/mpp/lib
The sources are available in:

  /usr/local/examples/mpp/src
but remember there is a COPYRIGHT notice that goes with this library:

  >                      COPYRIGHT (C)1995
  > 
  > The UNICOS operating system is derived from the  AT&T  UNIX
  > System  V  operating system. UNICOS is also based in part on
  > the Fourth Berkeley Software Distribution under license from
  > the The Regents of the University of California.
  > 
  > UNIX is a trademark of AT&T.
  > 
  > This software is provided "AS IS", without warranty  of  any
  > kind,  either  expressed  or  implied,  including,  but  not
  > limited to, the implied  warranties  of  merchantability  or
  > fitness for a particular purpose. Cray Research, Inc. incurs
  > no obligation to support this software.
The operative word here is: "AS IS". These are NOT supported products from CRI (or ARSC). If they work for you well and good, but if you have problems then I would suggest you report the problem to me and NOT wait for a solution.

The README in the src directory is:


  > T3D libraries for scalar and "vector" intrinsics and for
  > utilities.
  > 
  > lib32:  Contains some 32-bit intrinsic functions.
  > 
  > random:  Contains fast random number utilities for T3D
  > 
  > scalar_fastmath:  Contains faster versions of 64-bit intrinsic
  >                   functions. This library can simply be linked
  >                   with existing .o files to pick these up.
  > 
  > trisol:  Fast tri-diagonal solver
  > 
  > util:  Some assembler utilities. Prefetch, setting the annex,
  >        etc.
  > 
  > vect_fastmath:  Very fast versions of intrinsics that operate
  >                 on a vector of operands.
  > 
Below is the description provided with each library. If there is no description available we can get a feel for what is in each library by listing the files in each directory:

  lib32: (no README file)

   6341 Feb 13 14:07 hexp_replace.s
   4485 Feb 13 14:07 hlog_replace.s
   3518 Feb 13 14:07 hsqrt_replace.s
    526 Mar 29 09:45 makefile
   7329 Feb 13 14:07 sqrt32iv.s
   8904 Feb 13 14:07 sqrtv32.s
   6521 Feb 13 14:07 table_alog.f
   2382 Feb 13 14:07 table_exp.f
   3932 Feb 13 14:07 table_sqrt.f
  12857 Feb 13 14:07 table_sqrt32.f
  12884 Feb 13 14:07 table_sqrt32i.f

  lib_util.a (no README file)
This library contain the "pref" call described in Jeff's paper, pref prefetchs operands into cache before they are used.

   263 Mar 29 09:47 makefile
   358 Feb 13 14:07 mem_quiet.s
   816 Feb 13 14:07 mpp_annex.s
  1259 Feb 13 14:07 pref.s

  lib_random

   341 Mar 29 09:46 makefile
   676 Feb 13 14:07 ranfadv.f
  2551 Feb 13 14:07 rantom.man
  3856 Feb 13 14:07 rantom.s
  2847 Feb 13 14:07 rantomf.f

  from rantom.man

  > Fast 64bit Linear Congruential Random number generator for T3D.
  > 
  > T. Hewitt
  > CRI.    Dec 27 1993.
  > 
  > I have built a package for providing FAST random numbers for the T3D.
  > The package contains the following entry points:
  >   rantom()      : return a single pseudo-random number range (0.,1.)
  >   rantget(iseed): Obtain the current seed
  >   rantset(iseed): Reset the seed
  >   rantadv(n)    : Advance the seed by N random numbers.
  >                   NOTE:      dummy=rantom() is equivalent to
  >                              call rantadv(1)
  >   rantomn(n,array): Fill an array with Pseudo random numbers in the
  >                     range (0 to 1).
  > 
  > 
  > I have two files that contain the source:
  >   rantomf.f - FORTRAN source for T3D
  >   rantom.s  - T3D assembler for rantom() and rantomn(n,array)
  >               Note you need rantomf.f
  >               If you load rantom.s before rantomf you will get the
  >               FAST assembler versions.
  >               You don't need rantom.s, ALL its entry points are contained
  >               in rantomf.f, it merely has FASTER implementations of them.
  > 
  > Also available is:
  >   rantom.ymp.f - Fortran source for Y-MP/C90 architecture.
  >                  This contains the same entry points as described above.
  >                  Results are identical to T3D version, except that the
  >                  low order bits that you would get on T3D are truncated.
  >                  IEEE floating format has 6 more bits of precision.
  > 
  > Why you should use RANTOM instead of RANF.
  >   (1) Speed   - rantom() is faster 75   clocks (vrs) ??? for ranf()
  >                 rantomn can fill an array at one result per 25 clocks.
  >   (2) Period  - the period of rantom() is 2**62 versus 2**46 for ranf().
  >                 It would take a 1024PE 20years to cycle through ALL the
  >                 numbers available to rantomn!
  >   (3) Randomness - the low order bits from ranf are not random (but instead
  >                 are always 0). All mantissa bits from rantomn are "random".
  >                   Note you will never get identically 1.0 or 0.
  >                 The max output value is 1.0 -1ulp.
  >                 Approx 1/2**53 results will be approx 1.e-60.
  > 
  > To use on multiple PEs do:
  > 
  >   determine K such that:
  >   K is larger than the max number of rand numbers you will use on any PE.
  >   Avoid K being near a (large) power of 2.
  >   init by
  >   call rantset(my_pe*k)
  > 
  >   NOTE n$cpus*k < 2**62    or same rand numbers may be seen an different PEs.
  > 
  >   Tom Hewitt

  lib_tri.a

   743 Feb 13 14:07 README
   379 Mar 29 09:46 makefile
  1354 Feb 13 14:07 tester.f
  8595 Feb 13 14:07 trisol.divs.s
  8513 Mar 24 13:05 trisol.s
  1655 Feb 13 14:07 trisol_f.f
  8513 Feb 13 14:07 trisol_s.s

  from README:

  > Single PE Tridiagonal system solver on T3D.
  > 
  > I have written a routine, which uses the Burn On Both Ends methods to
  > solve a single (scalar) tridiagonal system on the T3D.
  > Three versions are available:
  >   (1)   trisol.f          - All Fortran implementation   12.7 Mflops
  >   (2)   trisol.s          - Assembler of above           18.8 Mflops
  >   (3)   trisol.divs.s     - trisol.s by divt=>divs       31.3 Mflops
  >    The third version is less accurate, as it uses 32bit IEEE divides
  > instead of 64bit IEEE divides. As Tri-diagonal systems are nearly always
  > used as part of iterative methods, this version should be adequate for
  > most uses.
  > 
  > The codes are available on dione or ferrari in the directory
  > ~hewitt/mpp_funcs
  > 
  > Tom Hewitt


  lib_scalar.a(no README file)

   4652 Feb 13 14:07 alog_s.s
   4992 Feb 13 14:07 cos_s.s
   4918 Feb 13 14:07 coss_s.s
    202 Feb 13 14:07 croot_table.f
   1261 Feb 13 14:07 cuberoot.f
   6166 Feb 13 14:07 exp_s.s
   2422 Feb 13 14:07 exp_table.f
   6561 Feb 13 14:07 log_table.f
    721 Mar 29 09:46 makefile
  12097 Feb 13 14:07 rtor_s.s
    129 Feb 13 14:07 rtorss.f
   4951 Feb 13 14:07 sin_s.s
   4566 Feb 13 14:07 sincos.table.f
   3751 Feb 13 14:07 sqrt_s.s
   3864 Feb 13 14:07 sqrti.s
   3972 Feb 13 14:07 table_sqrt.f
   1310 Feb 13 14:07 tester.f

  lib_vect.a

    588 Feb 13 14:07 COPYRIGHT
    847 Mar 24 08:43 README
   2704 Feb 13 14:07 aln2.s
   9716 Feb 13 14:07 aln2_v.s
   9977 Feb 13 14:07 alog_v.s
   9809 Feb 13 14:07 atan_v.s
  15019 Feb 13 14:07 cos_v.s
  14128 Feb 13 14:07 coss_v.s
  13447 Feb 13 14:07 exp_v.s
    801 Mar 29 09:47 makefile
   7093 Feb 13 14:07 oneover_v.s
    460 Feb 13 14:07 rtor_v.f
  15762 Feb 13 14:07 sin_v.s
   8369 Feb 13 14:07 sqrt_v.s
   9348 Feb 13 14:07 sqrti_v.s
   6733 Feb 13 14:07 table_2x.f
   6674 Feb 13 14:07 table_atan.f
   6478 Feb 13 14:07 table_inv64.f
  14355 Feb 13 14:07 table_ln2.f
   4535 Feb 13 14:07 table_sincos.f
  13078 Feb 13 14:07 table_sqrt.f
  13080 Feb 13 14:07 table_sqrti.f
   8130 Feb 13 14:07 tester.f
   2744 Feb 13 14:07 twotox.s
   8622 Feb 13 14:07 twotox_v.s
   3766 Feb 13 14:07 vscale.s
    946 Feb 13 14:07 vset.s
   3274 Feb 13 14:07 zcopy.s

  from the README file:

  > Contents:
  > 
  > 
  > aln2.s          Log base2 (scalar)
  > aln2_v.s        Log base2 (vector)
  > alog_v.s        natural log (vector)
  > atan_v.s        Tangent (vector)
  > cos_v.s         Cosine (vector)
  > coss_v.s        Coss (vector)
  > exp_v.s         Exp (vector)
  > oneover_v.s     Computes 1/x(i) for a vector x(i)
  > sin_v.s         Sin (vector)
  > sqrt_v.s        Sqrt (vector)
  > sqrti_v.s       Computes 1/sqrt(x(i)) (vector)
  > twotox.s        Computes 2 **x (scalar)
  > twotox_v.s      Computes 2** x(i) (vector)
  > vscale.s        Scales vector
  > vset.s          Sets vector
  > zcopy.s         Copies vector
  > 
  > All vector routines have calling syntax:
  > 
  > Call routine (number, input, output)
  > 
  > where number is the size of the vector, input is the input array and
  > output is the output array.  The input array can be the same as the
  > output array.
  > 
  > The exception to this is 
  > 
  > rtor_v.f
  > 
  > call rtor_v(n,x,p,y)
  > 
  >         n = number
  >         x = x array
  >         p = array of powers
  >         y = output array
  > 
  > computes y(i) = x(i)**p(i)
  > 
Because the source is available you can say that they have the ultimate in documentation and that's where I would look for additional information.

ARSC T3D Future Upgrades

We are testing the upgrade to the T3D 1.2 Programming Environment (libraries, tools and compilers.) The new compilers and libraries are available for users to try out. The command on denali:

  news compilers 
 more
will provide the details of the names and paths to these new versions. This upgrade includes the new Fortran 90 and C++ compilers for the T3D. If you have any problems or find any differences please contact Mike Ess.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top