ARSC T3D Users' Newsletter 40, June 16, 1995

The 2D FFT on the ARSC T3D

Chris Yerkes (yerkes@arsc.edu) of the UAF Electrical Engineering Department is using the ARSC T3D to implement an application that needs a two dimensional FFT. In newsletter #38 (06/02/95) we reviewed the timings for the one dimensional complex to complex library routine CCFFT which is the computational kernel of his application. The next important function is the transpose of the matrix, below is the kernel of this code:


  /* p_trans_matrix.c - module to perform matrix transpose on a  */
  /*                      complex distributed 2-dimensional array. */
  /*                      Total array is of dimension p*mdivp by   */
  /*                      p*ndivp (where p is the number of nodes  */
  /*                      upon which the array is distributed).    */
  /*                      Upon entry each node level subarray is   */
  /*                      of dimension p*mdivp by ndivp. Upon      */
  /*                      exit each node level subarray is of      */
  /*                      dimension p*ndivp by mdivp. Uses a       */
  /*                      variation of Eklundh's matrix                */
  /*                      transposition algorithm [1].               */
  /*                                                               */
  /*        [1] J.O. Eklundh,"A Fast Computer Method for Matrix    */
  /*            Transposing," IEEE Trans. on Computers, C-21,      */
  /*            July 1972, pp. 801-803.                               */
  /*                                                             */
  /* Note: Designed for Cray T3D MPP using shared memory message */
  /*         passing (shmem) construct.                               */
  /*                                                             */
  /* parameters:  ndivp  - row dimension of 2-dimensional array  */
  /*                       divided by number of nodes            */
  /*              mdivp  - column dimension of 2-dimensional     */
  /*                       array divided by number of nodes      */
  /*              s_ptr  - first element in distributed subarrays*/
  /*              workspace -  workspace array of size           */
  /*                             mdivp*ndivp                       */
  /*                                                             */
  /* Chris Yerkes - NCCOSC RDTE DIV, UAF School of Engineering,  */
  /* and Arctic Region Supercomputing Center. 5/95               */
  /* e-mail: yerkes@nosc.mil or yerkes@arsc.edu                   */
  
  #include <stdio.h>
  #include "p_2dfft.h"
  #include "const.h"
  #include "t3d_var.h"
  
  void p_trans_matrix(long *ndivp,
                      long *mdivp,
                      struct cimag *s_ptr,
                      struct cimag *workspace)
  {
    long i,j,k,l,m,l_step,m_step;
    int chunk_size;
    long recieve_node,my_piece,other_piece;
    long *sbar;
    sbar = &sbarb[0];
    
    l_step = (*mdivp);
    m_step = (*ndivp);
    chunk_size = (sizeof(struct cimag)/8)*((*ndivp)*(*mdivp));
    for (i=1 ; i< numproc ; i++) {
          recieve_node = i^me;
          my_piece = (recieve_node % numproc)*chunk_size; 
          other_piece = ((i^recieve_node) % numproc)*chunk_size; 
          shmem_barrier(0,0,numproc,sbar);
  /*        shmem_put((long *)workspace,(long *)(s_ptr+my_piece),
                                  chunk_size,recieve_node);  */
            shmem_get((long *)workspace,(long *)(s_ptr+other_piece),
                                  chunk_size,recieve_node);  
          shmem_barrier(0,0,numproc,sbar); 
          l = 0;
          for (j=0 ; j< (*ndivp) ; j++ ){
              m = 0;
              for (k=0 ; k< (*mdivp) ; k++ ) {
                  (*(s_ptr+my_piece+j+m)).rimag=(*(workspace+k+l)).rimag;
                  (*(s_ptr+my_piece+j+m)).iimag=(*(workspace+k+l)).iimag;
                  m += m_step;
              }
              l += l_step;
          } 
    }
    my_piece = me*chunk_size;
    for (j=0 ; j< (*mdivp)*(*ndivp) ; j++) {
          (*(workspace+j)).rimag=(*(s_ptr+my_piece+j)).rimag;
          (*(workspace+j)).iimag=(*(s_ptr+my_piece+j)).iimag;
    }        
    l = 0;
    for (j=0 ; j< (*ndivp) ; j++ ) {
          m = 0;
          for (k=0 ; k< (*mdivp) ; k++ ) {
                  (*(s_ptr+my_piece+j+m)).rimag=(*(workspace+k+l)).rimag;
                  (*(s_ptr+my_piece+j+m)).iimag=(*(workspace+k+l)).iimag;
                  m += m_step;
              }
          l += l_step;
    } 
  }
For the 128 PE with all 8MW nodes, this 2D FFT gets the following performance on square 2D arrays (the word "memory" means that the problem was too large to fit into memory):

  Performance (MFLOPS) on Chris Yerkes' 2D FFT on ARSC's T3D
  
  Side of
   square  <--------------------number of PEs------------------------->
    array
              1       2       4       8      16      32      64     128
     
      8     5.1     6.0     6.8     6.2     7.6     4.8     2.7     1.4
     16    11.0    16.1    21.1    23.8    20.6    23.1    13.7     7.3
     32    14.4    24.0    39.2    54.8    65.8    56.8    61.3    35.3
     64    16.2    29.0    55.6    92.4   139.1   169.1   145.8   152.3
    128    19.5    37.6    71.9   128.2   228.3   334.2   412.3   352.3
    256    22.8    44.4    86.7   166.8   311.6   519.2   786.0   957.5
    512    21.2    41.8    82.7   162.5   316.0   599.2  1069.2  1620.8
   1024    18.6    36.9    73.5   145.8   288.6   565.1  1087.3  1986.9
   2048  memory    34.8    69.4   138.6   276.0   547.0  1074.8  2062.7
   4096  memory  memory  memory   134.8   269.6   536.8  1066.5  2096.1
   8192  memory  memory  memory  memory  memory   464.8   927.3  1842.3
  16384  memory  memory  memory  memory  memory  memory  memory  1981.8
  32768  memory  memory  memory  memory  memory  memory  memory  memory
 
          
  Wall clock times (seconds) on Chris Yerkes' 2D FFT on ARSC's T3D
  
  Side of
   square       <------------------------number of PEs------------------------->
    array
              1       2       4       8      16      32      64     128
  
      8   .0003   .0003   .0003   .0003   .0002   .0004   .0007   .0014
     16   .0009   .0006   .0004   .0004   .0005   .0004   .0007   .0014
     32   .0035   .0021   .0013   .0009   .0007   .0009   .0008   .0014
     64   .0151   .0084   .0044   .0026   .0017   .0015   .0017   .0016
    128   .0589   .0305   .0159   .0089   .0050   .0034   .0028   .0032
    256   .2303   .1181   .0605   .0314   .0168   .0101   .0067   .0055
    512  1.1132   .5641   .2853   .1452   .0746   .0393   .0220   .0146
   1024  5.6349  2.839   1.4270   .7191   .3632   .1855   .0964   .0528
   2048  memory 13.25    6.6448  3.3276  1.6717   .8434   .4292   .2237
   4096  memory  memory  memory 14.9312  7.4664  3.7503  1.8877   .9605
   8192  memory  memory  memory  memory  memory 18.7711  9.4081  4.7355
  16384  memory  memory  memory  memory  memory  memory  memory 18.9627
  32768  memory  memory  memory  memory  memory  memory  memory  memory
 
Users interested in more details about this 2D FFT should contain Chris Yerkes at the above e-mail address.

Fortran Character Arrays in PVM and MPI

Last week's announcement of the EPCC/CRI MPI at ARSC was mostly taken up with the list of problems with the initial release. This was more due to the fact that the only ASCII file I had about the release was the list of known problems. I think it's a sign of good software to be up front about known problems, it means the user's effort is valued by the developers. I encourage those interested in MPI to ask for the EPCC/CRI postscript users guide which I can e-mail and check out the MPI information available at: http://www.mcs.anl.gov/mpii. MPI may be a replacement for PVM but it has some of the same problems as PVM. One common problem is how to send a message which is a Fortran character string. Vance Shaffer sends in this suggestion:

  >c       Thought you might be interested in this little problem and
  >c       a work around for it.  In MPI, character strings present a problem,
  >c       just like they do with PVM.  To get around the problem you can
  >c       do the same trick as with PVM - equivalence the character string
  >c       you are working with to an integer array and then pass the integer
  >c       address.
  > 
  >c       Vance
  >
  >        PROGRAM TEST1
  >        include 'mpif.h'
  >        integer errcode,rank
  >        integer status(MPI_STATUS_SIZE)
  >        equivalence (buf, iarray)
  >        integer iarray(2)
  >
  >        CHARACTER*16 buf
  >        call MPI_INIT(errcode)
  >        call MPI_COMM_RANK(MPI_COMM_WORLD,rank,errcode)
  >
  >        buf="test1"
  >        if(rank.eq.0) then
  >        call MPI_SSEND(iarray,5,MPI_CHARACTER,1,1,MPI_COMM_WORLD,
  >     1                 errcode)
  >c       call MPI_SSEND(buf,5,MPI_CHARACTER,1,1,MPI_COMM_WORLD,errcode)
  >        call MPI_RECV(iarray,5,MPI_CHARACTER,1,1,MPI_COMM_WORLD,status,
  >     *                errcode)
  >c       call MPI_RECV(buf,5,MPI_CHARACTER,1,1,MPI_COMM_WORLD,status,
  >c    *                errcode)
  >        print *, buf,rank
  >        else
  >        call MPI_RECV(iarray,5,MPI_CHARACTER,0,1,MPI_COMM_WORLD,status,
  >     *                errcode)
  >c       call MPI_RECV(buf,5,MPI_CHARACTER,0,1,MPI_COMM_WORLD,status,
  >c    *                errcode)
  >        print *, buf,rank
  >        buf="test2"
  >        call MPI_SSEND(iarray,5,MPI_CHARACTER,0,1,MPI_COMM_WORLD,
  >     1                 errcode)
  >c       call MPI_SSEND(buf,5,MPI_CHARACTER,0,1,MPI_COMM_WORLD,errcode)
  >        endif
  >
  >        call MPI_FINALIZE(errcode)
  >
  >        end

Limited Availability of CRAY2IEG

At ARSC we have several graphics applications that use the T3D as a computational resource for images that are displayed on a workstation. In these applications, the Y-MP is necessary for control and file manipulation but in general, all computational tasks are pushed onto the T3D. There is a very powerful library routine on the Y-MP, CRAY2IEG, which does bit manipulation between CRI and IEEE data types (see manpage on Denali). These operations seem to be exactly what the T3D processor would do well on. But the routine CRAY2IEG is only available on the Y-MP. This difference between machines makes these heterogeneous applications push some computation onto the least appropriate machine.

New Fortran Compiler

An upgrade version of the cf77 compiler is available on Denali with the path:

  /mpp/bin/cft77new and /mpp/bin/cf77new
For the default versions we have:

  /mpp/bin/cf77 -V
  Cray CF77_M   Version 6.0.4.1 (6.59)   05/25/95 13:36:39
  Cray GPP_M    Version 6.0.4.1 (6.16)   05/25/95 13:36:39
  Cray CFT77_M  Version 6.2.0.4 (227918) 05/25/95 13:36:39
and for this new version:

  /mpp/bin/cf77new -V
  Cray CF77_M   Version 6.0.4.1 (6.59)   05/25/95 13:37:26
  Cray GPP_M    Version 6.0.4.1 (6.16)   05/25/95 13:37:26
  Cray CFT77_M  Version 6.2.0.9 (259228) 05/25/95 13:37:27
This new compiler fixes a potential race condition in shared memory accesses and also fixes an inlining problem with the F90 intrinsics, MINLOC and MAXLOC.

I have completed my testing of this compiler and it will become the default on June 20. I encourage users to try this compiler before it becomes the default.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
  12. RANF() and its manpage differ between machines (Newsletter #37)
  13. CRAY2IEG is available only on the Y-MP (Newsletter #40)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top