ARSC T3D Users' Newsletter 35, May 12, 1995

Correction of Tuesday Test Times

Last week I explained the situation about how T3D jobs do not survive downtimes on the Y-MP. Virginia Bedford, the ARSC Systems Manager, corrects my information to say that on Tuesdays when the machine is taken down for testing, the machine is shutdown at 6:00 PM and that only the warning is sent at 5:30.

NAG for the T3D

Last week from HPCwire, I got this notice:

  > Subject: 5448 Cray Research/NAG in Agreement for PVM & Fortran 77 Libraries May 1
  > 
  > Cray Research/NAG in Agreement for PVM & Fortran 77 Libraries      May 1
  > CONTRACTS/VENTURES                                                 HPCwire
  > =============================================================================
  > 
  >   Eagan, Minn. -- A five-year agreement has been inked between Cray Research,
  > Inc. and the Numerical Algorithms Group, Ltd. (NAG) of Oxford, UK, to bundle
  > NAG library software with Cray's CRAY T3D Massively Parallel Processing (MPP)
  > supercomputers and successor systems.
  > 
  >   As part of the agreement, all new and existing CRAY T3D system customers
  > will be given a free, one year license for both NAG's new parallelized PVM
  > library and its widely used Fortran 77 library. The agreement covers
  > successor MPP systems from Cray as well as new versions of the NAG software
  > until the year 2000.
  > 
  >   Through this agreement, customers who use existing NAG library-based
  > applications on other computer systems can more easily transfer them to the
  > CRAY T3D and future Cray Research MPP systems. NAG and Cray officials expect
  > development and porting efforts for new and existing MPP applications to be
  > accelerated with the use of these libraries on the CRAY T3D and successor
  > Cray MPP systems.
  > 
  >   "The new PVM Library is the ideal vehicle for CRAY T3D system users to
  > drive the parallel architecture hard in computationally demanding
  > applications," noted Brian Ford, director of NAG. "The agreement strengthens
  > our relationship with Cray Research and will help keep NAG at the forefront
  > of solutions for high-performance computing."
  > 
  >   According to Irene Qualters, senior vice president of software for Cray
  > Research and acting head of Supercomputing Systems Division, "Cray Research
  > provides the most powerful and efficient parallel application development
  > environment in the industry. We actively foster the development of parallel
  > libraries which we consider to be essential tools in expanding the usefulness
  > of parallel systems for customers' applications. This agreement with NAG will
  > enhance our parallel environment and the value of our MPP systems."
  > 
  >   The NAG Fortran 77 library features over 1,100 user-callable routines
  > covering all the important areas of mathematical and statistical computation.
  > The latest version is at Mark 16 and the agreement assures users receive
  > regular upgrades to future releases.
  > 
  >   The NAG PVM library has been specifically produced for distributed memory
  > parallel machines and workstation clusters. It offers considerably greater
  > speed of execution over conventional sequential numerical software. The PVM
  > library makes use of the Basic Linear Algebra Communications Subprograms
  > (BLACS) and PVM for message passing and includes areas such as optimization,
  > dense linear algebra (including ScaLAPACK), spare linear algebra, random
  > number generators, and quadrature. The library is aimed at users with large
  > problems that make efficient use of the increased processing power and memory
  > capacity of multiple processor systems. The math implemented in this library
  > is used for high-performance computing in Grand Challenge research, as well
  > as by NAG's customers in the aerospace, financial, petroleum and utilities
  > industries.
  > 
  >   For more information, contact Mardi Larson of Cray Research at 612/683-3538
  > or Terry Burgess of NAG at 011 44 1 865 511245.
  > 
  > Copyright 1995 HPCwire.
I have contacted CRI and am waiting for details that should be out soon.

Tridiagonal Solvers on the T3D

In newsletter #29 (3/31/95), I announced the availability of benchlib on the ARSC T3D. The sources for these libraries are available on the ARSC ftp server in the file:

  pub/submissions/libbnch.tar.Z
The compiled libraries are also available on Denali in:

  /usr/local/examples/mpp/lib/lib_32.a
  /usr/local/examples/mpp/lib/lib_scalar.a
  /usr/local/examples/mpp/lib/lib_util.a
  /usr/local/examples/mpp/lib/lib_random.a  
  /usr/local/examples/mpp/lib/lib_tri.a 
  /usr/local/examples/mpp/lib/lib_vect.a
and the sources are available in:

  /usr/local/examples/mpp/src.
In previous newsletters, I've described the contents of some of the libraries:

  #30 (4/7/95)  - the "pref" routine of lib_util.a
  #33 (4/28/95) - the fast scalar math routines in lib_scalar.a
  #34 (5/05/95) - the fast vector math routines in lib_vector.a
In this newsletter, I will describe the routines in lib_tri.a and compare them to the tridiagonal solvers in LINPACK and LAPACK that run of the T3D.

Background

Many applications require the solution to a tridiagonal system of linear equations. Rather than program this task yourself, it might be more efficient to use a library routine available on the T3D. A system of linear equations that is "tridiagonal" looks something like the following matrix equation:

               A                    *   x   =   b
where with matrix and vectors expanded:

  d1   u1    0   0   0   0   0   0      x1      b1
  l2   d2   u2   0   0   0   0   0      x2      b2
   0   l3   d3  u3   0   0   0   0      x3      b3
   0    0   l4  d4  u4   0   0   0  *   x4  =   b4
   0    0    0  l5  d5  u5   0   0      x5      b5
   0    0    0   0  l6  d6  u6   0      x6      b6
   0    0    0   0   0  l7  d7  u7      x7      b7
   0    0    0   0   0   0  l8  d8      x8      b8
The task for such problems is to find values for the vector x that solves the matrix equation. Because of the large number of zeros and the regular pattern of nonzero elements in the matrix A, special algorithms can be used to reduce the storage and the number of floating point operations required to solve this equation. In this case, both the storage requirements and number of floating point operations are proportional to the number of unknowns. In contrast, using the full dense linpack solver for the same matrix equation will require storage proportional to the square of the number of unknowns and the floating point operations required will be proportional to the cube of the number of unknowns.

It is the usual convention that:


  vector b (b1,b2,b3,...) is called the right hand side
  vector x (x1,x2,c3,...) is called the solution vector
  vector l (l2,l3,l4,...) is called the subdiagonal
  vector d (d1,d2,d3,...) is called the main diagonal
  vector u (u1,u2,u3,...) is called the supradiagonal.
This matrix is called symmetric if l[i + 1] = u[i] for all values of i. For a problem where the values of l, d and u may be arbitrary, the algorithm for solving the matrix equation should have a pivoting strategy to insure as numerically accurate a solution as possible. But in many important problems the values of l, d and u are such that no pivoting is required for such a numerically accurate solution. One such important problem, is when the matrix is symmetric and the values of the diagonal "dominate" the off diagonal elements. For tridiagonal matrices this is when:

  abs(d[i]) > abs(u[i - 1) + abs(l[i + 1])
and

  abs(d[i]) > abs(u[i]) + abs(l[i])
Sometimes, when tridiagonal solvers are implemented in Fortran, we have a two dimensional A declared as:

  REAL A( N, 3 )
where each column of the corresponding to a column of the array:

  A( 2:N, 1 )   is the subdiagonal     l
  A( 1:N, 2 )   is the main diagonal   d
  A( 1:N-1, 3 ) is the supradiagonal   u

The Routines Available

There are two routines for tridiagonal systems in LINPACK, SGTSL and SPTSL. SGTSL solves the general case with pivoting and SPTSL is the special case where the matrix is symmetric and also does not require pivoting. Typical calls to these routines look like:

  SGTSL( N, A(1,1), A(1,2), A(1,3), RHS, INFO )
and

  SPTSL( N, A(1,1), A(1,2), RHS )
where RHS holds on input the right hand side and on output the solution vector. There are man pages on both these routines on Denali but because Linpack is not part of /mpp/lib/libsci.a (Newsletter #25) these routines must be gotten in source form from Oak Ridge National Labs (ftp address: netlib2.cs.utk.edu).

The SPTSL algorithm has another nice feature in that it is a "BABE" algorithm. BABE stands for "Burn At Both Ends", meaning that the Gaussian Elimination starts at both corners of the matrix and proceeds to the middle of the matrix. This reduces the overhead of loop control by a factor of two and gives the compiler or assembly language programmer more instructions to schedule.

In LAPACK, there are routines that solve the same problems as the above LINPACK routines but they have more features. The main extension of the LAPACK routines is that these routines divide the entire algorithm into two steps:

  1. Forming an LU factorization of the matrix
  2. Using the factorization to solve for multiple right hand sides
So now a typical solution requires two calls, for the general case:

  CALL SGTTRF(N,A(1,1),A(1,2),A(1,3),D2,IPIV,INFO1)           !factor
  CALL SGTTRS('N',N,1,A(1,1),A(1,2),A(1,3),D2,IPIV,S,N,INFO2) !solve
and for the symmetric case without pivoting:

  CALL SPTTRF( N, A( 1, 2 ), A( 1, 1 ), INFO1 )               !factor
  CALL SPTTRS( N, 1, A( 1, 2 ), A( 1, 1 ), R, N, INFO2 )      !solve
Again, there are comprehensive manpages on Denali and these routines are available in both /lib/libsci.a (Y-MP) and /mpp/lib/libsc.a (T3D).

In BENCHLIB, there are three routines for solving the nonsymmetric case that requires no pivoting (for example the diagonally dominant case.) In this sense the problem solved by the BENCHLIB routines is a middle problem between the general and symmetric cases described above. (Of course the symmetric case is a special case of this middle problem where subdiagonal and supradiagonal are the same.)

The three versions are:


  trisol_f.f      a Fortran version 
  trisol_s.s      an assembly language version
  trisol.divs.s   an assembly language version that uses a
                  faster but less accurate divide algorithm
I was unable to get the last two versions to produce correct answers so I abandoned them and concentrated on trisol_f.f. A typical call looks like:

  trisol( B, RHS, N, 1. )
where B is the transpose of the matrix used in the LINPACK and LAPACK routines. That is, we have declared storage for the three diagonals as:

  REAL B( 3, N )
This is a really interesting idea. To maintain good cache performance the three values needed at one step of the Gaussian elimination are either in the same cache line or in adjacent cache lines. In contrast the storage arrangement in the LINPACK and LAPACK routines describes the diagonals with the same offset from three different memory locations. This method opens three streams to memory and when the BABE algorithm is implemented as in SPTSL, six streams to memory.

I have modified the library in /usr/local/examples/mpp/lib/lib_tri.a to call this Fortran version and I have examples that work for me. (The 1. in the calling sequence seems to be an artifact from debug runs that the developer left in the library routine.) The readme file supplied with the source is in: /usr/local/examples/mpp/lib/src/trisol.

Timing Results

Each of the 5 routines solve a slightly different problem but to generate some timing considerations I chose solving a problem that wasn't exactly "diagonally dominant". The problem I chose was:

  2   1   0   0   0   0   0   0     x1     3
  1   2   1   0   0   0   0   0     x2     4
  0   1   2   1   0   0   0   0     x3     4
  0   0   1   2   1   0   0   0  *  x4  =  4
  0   0   0   1   2   1   0   0     x5     4
  0   0   0   0   1   2   1   0     x6     4
  0   0   0   0   0   1   2   1     x7     4
  0   0   0   0   0   0   1   2     x8     3
The solution vector should be all ones and all 5 routines solved a range of problems without noticeable differences in errors. Here is a table giving the times for SPTSL and the other 4 as ratios compared to the SPTSL time.

  Table of comparison of times for tridiagonal solvers routines on the
  T3D for one right side and a diagonally dominant matrix
   
                     <-------------Ratio of Times------------------>

  Order of Time for  SGTSL/ (SPTTRF+SPTTRS) (SGTTRF+SPTTRS) TRISOLF/
  Tridiag.   SPTSL   SPTSL      /SPTSL          /SPTSL       SPTSL
  System   (seconds)        

      1    0.000005   1.07       4.76            5.84         0.17
      2    0.000007   1.25       2.83            4.73         1.03
      3    0.000008   1.21       2.57            4.21         0.88
      4    0.000009   1.35       2.53            4.26         1.15
      5    0.000012   1.16       2.04            3.28         0.85
      6    0.000012   1.25       2.16            3.38         0.99
      7    0.000013   1.29       2.02            3.18         0.94
      8    0.000014   1.22       1.97            3.03         0.91
      9    0.000016   1.25       5.24            2.92         0.87
     10    0.000018   1.25       1.74            2.63         0.85
     20    0.000028   1.32       1.51            2.28         0.82
     30    0.000041   1.31       1.34            2.07         0.76
     40    0.000052   1.34       1.29            1.96         0.84
     50    0.000064   1.36       1.23            1.91         0.87
     60    0.000075   1.37       1.24            1.89         0.87
     70    0.000087   1.35       1.18            1.85         0.88
     80    0.000100   1.35       1.16            1.84         0.83
     90    0.000110   1.37       1.17            1.88         0.88
    100    0.000124   1.35       1.16            1.82         0.84
    200    0.000244   1.34       1.09            1.77         0.81
    300    0.000368   1.35       1.09            1.78         0.79
    400    0.000495   1.36       1.08            1.78         0.79
    500    0.000636   1.33       1.06            1.73         0.79
    600    0.000761   1.35       1.08            1.73         0.82
    700    0.000906   1.34       1.07            1.69         0.82
    800    0.001030   1.36       1.08            1.69         0.82
    900    0.001181   1.34       1.06            1.71         0.81
   1000    0.001382   1.28       1.01            1.56         0.76
   2000    0.002694   1.37       1.03            1.59         0.82
   4000    0.005470   1.35       1.03            1.57         0.81
   8000    0.011042   1.34       1.02            1.55         0.81
  10000    0.014046   1.32       1.00            1.52         0.82
From this table we can make a few observations:
  1. The LAPACK routines have much more functionality than the LINPACK routines but the price of that functionality is speed.
  2. The TRISOL Fortran routine of BENCHLIB implements the same algorithm as the SPTSL routine of LINPACK, but had a performance improvement due to its data structure for the matrix because it improves cache performance.
  3. The overhead of a pivoting strategy seems about 30% to 50% over the nonpivoting routines.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top