ARSC T3D Users' Newsletter 41, June 23, 1995

The 2D FFT on the ARSC T3D

Srikanth Thirumalai of the Mathematical Software Group at Cray Research sends in the following note about CRI library routines for the two dimensional Fast Fourier Transforms on the T3D:


  > This is just to inform you that Craylibs 1.2.1 has 2
  > routines PCCFFT2D and PCCFFT3D for 2D and 3D parallel
  > complex-to-complex FFTs. I have included the performance
  > numbers for PCCFFT2D for your convenience.
  > 
  > Craylibs 1.2.2 will have similar routines for
  > real-to-complex and complex-to-real FFTs. They have been
  > called PSC(CS)FFT2D(3D).
  > 
  > Performance of PCCFFT2D in Mflops.
  > 
  > Side of
  > square  <---------------number of PEs---------------------->
  >  array
  >          1     2      4      8     16     32      64     128
  >
  >   32  30.5  39.7   60.0   75.9   90.4   90.6    89.1    88.3
  >   64  41.8  59.9  106.1  163.9  237.5  302.4   278.7   278.2
  >  128  42.0  66.7  127.6  226.4  407.6  652.3   834.3   776.9
  >  256  42.4  71.3  138.3  259.9  483.1  870.4  1433.4  1989.6
  >  512  33.2  59.6  116.3  227.2  432.8  808.5  1438.7  2613.9
These numbers are better than those published in the last newsletter. ARSC is moving up to the the 1.2.1 release of Craylibs and ARSC users will be informed through this newsletter when ARSC upgrades.

Small Changes to the ARSC T3D Batch Queues

On June 16th we made some small adjustments to the ARSC T3D batch queues, the T3D queues are now described on Denali with the command:

  news t3dbatch
which produces:

  The description of the ARSC T3D NQS queues is:

  New T3D Batch Queues
  ====================
  The T3D batch queues were changed on June 16th 1995. The
  current T3D queues are:

  Always on:

    m_8pe_24h     2 jobs using at most   8 PEs for 24 hours
    m_16pe_24h    2 jobs using at most  16 PEs for 24 hours
    m_32pe_24h    1 job  using at most  32 PEs for 24 hours
    m_64pe_10m    1 job  using at most  64 PEs for 10 minutes
    m_64pe_24h    1 job  using at most  64 PEs for 24 hours
    m_128pe_5m    1 job  using at most 128 PEs for  5 minutes

  There is one additional queue that is enabled on Friday at
  6PM and disabled at 4AM on Sunday:

    m_128pe_8h    1 job using at most 128 PEs for  8 hours

  A request made to these queues will be run as soon as enough
  PEs are available to satisfy the request.

  User's UDBSEE limits

  Most T3D users currently have a limit of 128 PEs for batch
  access. Users can check their limits with the udbsee command:

    udbsee 
 grep jpelimit

  The output will indicate their limits in interactive (i) and
  batch (b). For example:

    jpelimit[b]     :128:
    jpelimit[i]     :8:

  If your batch PE limit is too small to access these new NQS
  queues and you would like to use them, please contact Mike
  Ess, either by phone at 907-474-5404 or e-mail to ess@arsc.edu,
  to have your PE batch limits increased.

  Users can query the NQS batch system with the command:

    qstat -a

  to see what other NQS T3D jobs are scheduled to run on the T3D.
  The utility mppmon is available to see what jobs are currently
  running on the T3D. T3D jobs are executed on a "first fit"
  priority and run to completion without interruption.
                                  Mike Ess,  June 19, 1995

Big Jobs in the ARSC T3D NQS Queues

At ARSC it is possible to use all 128 PEs either through the two 128 PE batch queues or for some users during interactive sessions. When submitting such jobs it is a good idea to use the mppmon command to see who is using the machine before you submit your 128 PE job. The two ideas to keep in mind are:
  1. Your job won't run until the machine becomes idle. Even a single PE job will make your 128 PE job wait until the machine is idle.
  2. All jobs submitted after your 128 PE job will wait until your 128 PE job completes. The combination of a long running job (any number of PEs) followed by a 128 PE job (for any amount of time) will effectively close down the T3D until both the long running job AND the 128 PE job finishes.
So let's be careful and considerate when submitting 128 PE jobs (i.e., the machine has the best possibility of going idle at night and on the weekends.) CRI has plans to implement a job rollin/rollout capability that could help this situation but that feature isn't available yet.

The EPCC/CRI Version of MPI on the ARSC T3D

The three files necessary for running MPI programs on the ARSC T3D have been put in their default locations:

  mpi.h and mpif.h         go to       /usr/include/mpp
  libmpi.a                 goes to     /mpp/lib
Now (when the environmental variable TARGET is set to cray-t3d) MPI C programs can be compiled as:

  cc -c shifter.c
  cc shifter.o -lmpi
and similarly for MPI Fortran programs:

  cf77 -c -I/usr/include/mpp shifterF.f
  cf77 shifterF.o -lmpi
If there are any problems with MPI on the ARSC T3D please contact Mike Ess.

Sorting on the T3D

A user asked: What type of sorting functions are available on the T3D? On the Y-MP, there are several sorting functions: ISORTD, SSORTB, ISORTB and ORDERS. But looking over the decks that are in the MPP Craylibs, I could only find qsort in /mpp/lib/libc.a. qsort requires the user to supply a function that compares the elements of the array to be sorted so I doubt that it can be very fast sorting on a specific instance, like an array of integers.

There can be a big speed difference in sorting functions and to start an investigation I typed in several functions from: "Numerical Recipes in C", by Press, Flannery, Teukolsky & Vetterling and timed them on a single PE on the T3D. The functions in this book are translations from the book "Numerical Recipes in Fortran" and assume that the elements to be sorted are in the array a[1:n]. I changed the functions to sort the array when the elements are a[0:n-1] which seems more natural in C. Each function has inputs of the length of an array of integers and the starting address of the array, the function overwrites the input array with the sorted array.

The timing of sorting functions is very sensitive to how "orderly" the input array is but for input arrays generated with the random generator RANF (newsletter #37) sorting times below are typical:


  A timing comparison of sorting functions on the T3D (seconds)

  Number of    insertion    shell    quicksort  Munstock's
  elements to     sort       sort       sort        sort 
  sort
      1         0.000004   0.000010   0.000005   0.000004
      2         0.000005   0.000012   0.000006   0.000008
      3         0.000006   0.000013   0.000006   0.000008
      4         0.000006   0.000017   0.000007   0.000011
      5         0.000007   0.000018   0.000008   0.000013
     10         0.000009   0.000024   0.000012   0.000021
     20         0.000017   0.000043   0.000022   0.000045
     30         0.000026   0.000072   0.000032   0.000079
     40         0.000041   0.000099   0.000044   0.000123
     50         0.000055   0.000133   0.000056   0.000160
    100         0.000194   0.000327   0.000122   0.000364
    200         0.000645   0.000846   0.000270   0.000873
    300         0.001543   0.001358   0.000423   0.001687
    400         0.002762   0.001795   0.000595   0.002505
    500         0.004387   0.002507   0.000759   0.003231
   1000         0.016640   0.005644   0.001750   0.008732
   2000         0.075755   0.014613   0.004072   0.026065
   3000         0.194306   0.024549   0.007052   0.047113
   4000         0.389964   0.034641   0.010076   0.070686
   5000         0.611296   0.045975   0.013453   0.095750
  10000         2.698027   0.106615   0.032286   0.224263
  20000        11.086379   0.242284   0.077115   0.535524
  30000        25.146869   0.382456   0.126676   0.892756
  40000        44.656612   0.560791   0.180124   1.307390
  50000        69.894340   0.719300   0.235710   1.781510
The table shows that making the right algorithm choice can make a big difference. Below is the source code for the timer/tester and the source for each of the sorting functions. If any user has more results in this area I would be happy to put them in this newsletter.

Often times in sorting, it is not so useful to have the sorted array overwrite the unsorted input array. It is useful to have an index array whose values are the position of the array element in the sorted array. With this index, the sorted array could be printed out as:


  for( i = 0; i < n; i++ ) printf( " %d %d\n", i, a[ index[ i ] ] );
The "Numerical Recipes in C" book shows how to modify the insertion, shell and heap sort functions to return this index that sorts the input array. The function "munstock" is this kind of sorting function, it produces an index to sort the array. (I got this function from Jim Munstock of CRI more than ten years ago in CDC assembler. The translation to C is straightforward and it's a useful utility to keep around.):

  #define MAXLENGTH 100000
  #define MAXCASE   31
  main()
  {
    int a[ MAXLENGTH ], b[ MAXLENGTH ], c[ MAXLENGTH ], d[ MAXLENGTH ];
    int e[ MAXLENGTH ], indx[ MAXLENGTH ];
    int i, j, n;
    fortran double RANF();
    void insert(), heap(), shell(), munstock();
    double t1, t2, t3, t4, t5, second();
    int ncase,kcase[MAXCASE] = {0,1,2,3,4,5,10,20,30,40,50,100,200,300,400,500,1000,2000,3000,4000,5000,10000,20000,30000,40000,50000,100000,200000,300000,400000,500000};
  
    for( ncase = 0; ncase < 27; ncase++ ) {
    n = kcase[ ncase ];
    for( i = 0; i < n; i++ ) {
      a[ i ] = n * RANF();
      b[ i ] = a[ i ];
      c[ i ] = a[ i ];
      d[ i ] = a[ i ];
      e[ i ] = a[ i ];
      }
      t1 = second( );
        insert( n, b );
      t2 = second( );
        shell( n, c );
      t3 = second( );
      heap( n, d );
      t4 = second( );
        munstock( n, e, indx );
      t5 = second( );
      if( n > 0 ) {
        for( j = 0; j < n-1; j++ ) {
          if( b[ j ] > b[ j+1 ] ) {
            printf( " failure at insert sort %d %d %d\n", j, a[ j ], b[ j ] );
            for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], b[ i ] );
            exit( -1 );
          }
        }
        for( j = 0; j < n; j++ ) {
          if( c[ j ] != b[ j ] ) {
            printf( " failure at shell sort  %d %d %d\n", j, b[ j ], c[ j ] );
            for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], c[ i ] );
            exit( -1 );
          }
        }
        for( j = 0; j < n; j++ ) {
          if( d[ j ] != b[ j ] ) {
            printf( " failure at heap sort %d %d %d\n", j, b[ j ], d[ j ] );
            for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], d[ i ] );
            exit( -1 );
          }
        }
        for( j = 0; j < n; j++ ) {
          if( e[ indx[ j ] ] != b[ j ] ) {
            printf( " failure at index sort %d %d %d\n",j,indx[j],e[indx[j]]);
            for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, b[i],e[indx[i]]);
            exit( -1 );
          }
        }
      }
      printf( " %2d %6d %12.6f %10.6f %10.6f %10.6f\n",ncase,n,t2-t1,t3-t2,t4-t3,t5-t4);
    }
  }
  void insert( n, arr ) 
  /* sorts the array arr[0:n-1] in ascending order with insertion sort 
       insertion sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky &
       Vetterling, modified for "C" arrays by Mike Ess, 1995 */
  int n;
  int arr[];
  {
    int i, j, a;
  
    if( n < 1 ) {
       printf( "insert sort called with elements to sort less than 1\n" );  
    } else {
      for( j = 1; j < n; j++ ) {           /* pick out each element */
        a = arr[ j ];
        i = j - 1;
        while( i >= 0 && arr[ i ] > a ) {  /* look for place to insert it */
          arr[ i + 1 ] = arr[ i ];
          i--;
        }
        arr[ i + 1 ] = a;                  /* insert it */
      }
    }
  }
  #include <math.h>
  #define ALN2I 1.442695022
  #define TINY 1.0e-5
    void shell( n, arr )
  /* sorts the array arr[0:n-1] in ascending order with shell sort 
       shell sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky &
       Vetterling, modified for "C" arrays by Mike Ess, 1995 */
  int n;
  int arr[];
  {
    int nn, m, j, i, lognb2;
    int t;
  
    if( n < 1 ) {
      printf( "shell  sort called with elements to sort less than 1\n" );
    } else {
      lognb2 = (log((double) n) * ALN2I + TINY);
      m = n;
      for( nn = 1; nn <= lognb2; nn++ ) {   /* Loop over partial sorts */
        m >>= 1;
        for( j = m; j < n ; j++ ) {         /* Outer loop of straight insertion */
          i = j-m;
          t = arr[ j ];
          while( i >= 0 && arr[ i ] > t ) { /* Inner loop of straight insertion */
            arr[ i + m ] = arr[ i ];
            i -= m;
          }
          arr[ i + m ] = t;
        }
      }
    }
  }
  void heap( n, ra )
  /* sorts the array ra[0:n-1] in ascending order with heap sort 
       heap sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky &
       Vetterling, modified for "C" arrays by Mike Ess, 1995 */
  int n;
  int ra[];
  {
    int l, j, ir, i;
    int rra;
  
    if( n < 1 ) {
      printf( "heap   sort called with elements to sort less than 1\n" ); 
    } else {
      l = n / 2 + 1;
      ir = n;
  /* The index l will be decremented from its initial value down to 1 during the 
     "hiring" (heap creation) phase. Once it reaches 1, the index ir will be 
      decremented from its initial value down to 1 during the "retirement-and-
      promotion" (heap selection) phase */
      for( ;; ) {
        if( l > 1 ) {                   /* Still in hiring phase */
          l--;
          rra = ra[ l-1 ];
        } else {                        /* In retirement-and-promotion phase.  */
          rra = ra[ ir-1 ];             /* Clear a space at the end of array.  */
          ra[ ir-1 ] = ra[ 0 ];         /* Retire the top of the heap into it. */
          ir--;                         /* Done with the last promotion.       */
          if( ir == 0 ) {               /* The least competent worker of all!  */
            ra[ 1-1 ] = rra;
            return;
          }
        }
        i = l;                          /* Whether we are in the hiring phase  */
        j = l + l;                      /* or promotion phase, we here set up to*/
        while( j <= ir ) {              /* shift down element rra to its proper */
          if( j < ir ) {                /* level.                               */
            if( ra[ j-1 ] < ra[ j ] ) j++;   /* Compare to the better underling */
          }
          if( rra < ra[ j-1 ] ) {       /* demote rra */
            ra[ i-1 ] = ra[ j-1 ]; 
            i = j;
            j = j + j;
          } else {
            j = ir + 1;                 /* This is rra's level. Set j to term- */
          }                             /* inate the sift-down                 */
        }
        ra[ i-1 ] = rra;                /* Put rra into its slot.              */
      }
    }
  }
  void munstock( length, a, ind )
  /* Sort the array "a" to produce an index "ind" of the sorted array. From
     Jim Munstock, translated to C by Mike Ess 1989 */
  int length;
  int a[ ];
  int ind[ ];
  {
  int i, ii, ij, j, m, m1, n2;
  int t;
  
    for ( i = 0 ; i < length; i++ ) ind[ i ] = i;
    m = 1;
    n2 = length / 2;
    m = 2 * m;
    while ( m <= n2 ) m = 2 * m;
    m = m - 1;
   three:;
    m1 = m + 1;
    for ( j = m1-1 ; j < length; j++ ) {
      t = a[ ind[ j ] ];
      ij = ind[ j ];
      i = j - m;
      ii = ind[ i ];
   four:;
      if ( t < a[ ii ] ) {
        ind[ i+m ] = ii;
        i = i - m;
        if ( i >= 0 ) {
          ii = ind[ i ];
          goto four;
        }
      }
      ind[ i+m ] = ij;
    }
    m = m / 2;
    if ( m > 0 ) goto three;
    return;
  }

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
  12. RANF() and its manpage differ between machines (Newsletter #37)
  13. CRAY2IEG is available only on the Y-MP (Newsletter #40)
  14. Missing sort routines on the T3D (Newsletter #41)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top