ARSC T3D Users' Newsletter 40, June 16, 1995
The 2D FFT on the ARSC T3D
Chris Yerkes (yerkes@arsc.edu) of the UAF Electrical Engineering Department is using the ARSC T3D to implement an application that needs a two dimensional FFT. In newsletter #38 (06/02/95) we reviewed the timings for the one dimensional complex to complex library routine CCFFT which is the computational kernel of his application. The next important function is the transpose of the matrix, below is the kernel of this code:
/* p_trans_matrix.c - module to perform matrix transpose on a */
/* complex distributed 2-dimensional array. */
/* Total array is of dimension p*mdivp by */
/* p*ndivp (where p is the number of nodes */
/* upon which the array is distributed). */
/* Upon entry each node level subarray is */
/* of dimension p*mdivp by ndivp. Upon */
/* exit each node level subarray is of */
/* dimension p*ndivp by mdivp. Uses a */
/* variation of Eklundh's matrix */
/* transposition algorithm [1]. */
/* */
/* [1] J.O. Eklundh,"A Fast Computer Method for Matrix */
/* Transposing," IEEE Trans. on Computers, C-21, */
/* July 1972, pp. 801-803. */
/* */
/* Note: Designed for Cray T3D MPP using shared memory message */
/* passing (shmem) construct. */
/* */
/* parameters: ndivp - row dimension of 2-dimensional array */
/* divided by number of nodes */
/* mdivp - column dimension of 2-dimensional */
/* array divided by number of nodes */
/* s_ptr - first element in distributed subarrays*/
/* workspace - workspace array of size */
/* mdivp*ndivp */
/* */
/* Chris Yerkes - NCCOSC RDTE DIV, UAF School of Engineering, */
/* and Arctic Region Supercomputing Center. 5/95 */
/* e-mail: yerkes@nosc.mil or yerkes@arsc.edu */
#include <stdio.h>
#include "p_2dfft.h"
#include "const.h"
#include "t3d_var.h"
void p_trans_matrix(long *ndivp,
long *mdivp,
struct cimag *s_ptr,
struct cimag *workspace)
{
long i,j,k,l,m,l_step,m_step;
int chunk_size;
long recieve_node,my_piece,other_piece;
long *sbar;
sbar = &sbarb[0];
l_step = (*mdivp);
m_step = (*ndivp);
chunk_size = (sizeof(struct cimag)/8)*((*ndivp)*(*mdivp));
for (i=1 ; i< numproc ; i++) {
recieve_node = i^me;
my_piece = (recieve_node % numproc)*chunk_size;
other_piece = ((i^recieve_node) % numproc)*chunk_size;
shmem_barrier(0,0,numproc,sbar);
/* shmem_put((long *)workspace,(long *)(s_ptr+my_piece),
chunk_size,recieve_node); */
shmem_get((long *)workspace,(long *)(s_ptr+other_piece),
chunk_size,recieve_node);
shmem_barrier(0,0,numproc,sbar);
l = 0;
for (j=0 ; j< (*ndivp) ; j++ ){
m = 0;
for (k=0 ; k< (*mdivp) ; k++ ) {
(*(s_ptr+my_piece+j+m)).rimag=(*(workspace+k+l)).rimag;
(*(s_ptr+my_piece+j+m)).iimag=(*(workspace+k+l)).iimag;
m += m_step;
}
l += l_step;
}
}
my_piece = me*chunk_size;
for (j=0 ; j< (*mdivp)*(*ndivp) ; j++) {
(*(workspace+j)).rimag=(*(s_ptr+my_piece+j)).rimag;
(*(workspace+j)).iimag=(*(s_ptr+my_piece+j)).iimag;
}
l = 0;
for (j=0 ; j< (*ndivp) ; j++ ) {
m = 0;
for (k=0 ; k< (*mdivp) ; k++ ) {
(*(s_ptr+my_piece+j+m)).rimag=(*(workspace+k+l)).rimag;
(*(s_ptr+my_piece+j+m)).iimag=(*(workspace+k+l)).iimag;
m += m_step;
}
l += l_step;
}
}
For the 128 PE with all 8MW nodes, this 2D FFT gets the following performance on square 2D arrays (the word "memory" means that the problem was too large to fit into memory):
Performance (MFLOPS) on Chris Yerkes' 2D FFT on ARSC's T3D
Side of
square <--------------------number of PEs------------------------->
array
1 2 4 8 16 32 64 128
8 5.1 6.0 6.8 6.2 7.6 4.8 2.7 1.4
16 11.0 16.1 21.1 23.8 20.6 23.1 13.7 7.3
32 14.4 24.0 39.2 54.8 65.8 56.8 61.3 35.3
64 16.2 29.0 55.6 92.4 139.1 169.1 145.8 152.3
128 19.5 37.6 71.9 128.2 228.3 334.2 412.3 352.3
256 22.8 44.4 86.7 166.8 311.6 519.2 786.0 957.5
512 21.2 41.8 82.7 162.5 316.0 599.2 1069.2 1620.8
1024 18.6 36.9 73.5 145.8 288.6 565.1 1087.3 1986.9
2048 memory 34.8 69.4 138.6 276.0 547.0 1074.8 2062.7
4096 memory memory memory 134.8 269.6 536.8 1066.5 2096.1
8192 memory memory memory memory memory 464.8 927.3 1842.3
16384 memory memory memory memory memory memory memory 1981.8
32768 memory memory memory memory memory memory memory memory
Wall clock times (seconds) on Chris Yerkes' 2D FFT on ARSC's T3D
Side of
square <------------------------number of PEs------------------------->
array
1 2 4 8 16 32 64 128
8 .0003 .0003 .0003 .0003 .0002 .0004 .0007 .0014
16 .0009 .0006 .0004 .0004 .0005 .0004 .0007 .0014
32 .0035 .0021 .0013 .0009 .0007 .0009 .0008 .0014
64 .0151 .0084 .0044 .0026 .0017 .0015 .0017 .0016
128 .0589 .0305 .0159 .0089 .0050 .0034 .0028 .0032
256 .2303 .1181 .0605 .0314 .0168 .0101 .0067 .0055
512 1.1132 .5641 .2853 .1452 .0746 .0393 .0220 .0146
1024 5.6349 2.839 1.4270 .7191 .3632 .1855 .0964 .0528
2048 memory 13.25 6.6448 3.3276 1.6717 .8434 .4292 .2237
4096 memory memory memory 14.9312 7.4664 3.7503 1.8877 .9605
8192 memory memory memory memory memory 18.7711 9.4081 4.7355
16384 memory memory memory memory memory memory memory 18.9627
32768 memory memory memory memory memory memory memory memory
Users interested in more details about this 2D FFT should contain Chris Yerkes at the above e-mail address.
Fortran Character Arrays in PVM and MPI
Last week's announcement of the EPCC/CRI MPI at ARSC was mostly taken up with the list of problems with the initial release. This was more due to the fact that the only ASCII file I had about the release was the list of known problems. I think it's a sign of good software to be up front about known problems, it means the user's effort is valued by the developers. I encourage those interested in MPI to ask for the EPCC/CRI postscript users guide which I can e-mail and check out the MPI information available at: http://www.mcs.anl.gov/mpii. MPI may be a replacement for PVM but it has some of the same problems as PVM. One common problem is how to send a message which is a Fortran character string. Vance Shaffer sends in this suggestion:>c Thought you might be interested in this little problem and >c a work around for it. In MPI, character strings present a problem, >c just like they do with PVM. To get around the problem you can >c do the same trick as with PVM - equivalence the character string >c you are working with to an integer array and then pass the integer >c address. > >c Vance > > PROGRAM TEST1 > include 'mpif.h' > integer errcode,rank > integer status(MPI_STATUS_SIZE) > equivalence (buf, iarray) > integer iarray(2) > > CHARACTER*16 buf > call MPI_INIT(errcode) > call MPI_COMM_RANK(MPI_COMM_WORLD,rank,errcode) > > buf="test1" > if(rank.eq.0) then > call MPI_SSEND(iarray,5,MPI_CHARACTER,1,1,MPI_COMM_WORLD, > 1 errcode) >c call MPI_SSEND(buf,5,MPI_CHARACTER,1,1,MPI_COMM_WORLD,errcode) > call MPI_RECV(iarray,5,MPI_CHARACTER,1,1,MPI_COMM_WORLD,status, > * errcode) >c call MPI_RECV(buf,5,MPI_CHARACTER,1,1,MPI_COMM_WORLD,status, >c * errcode) > print *, buf,rank > else > call MPI_RECV(iarray,5,MPI_CHARACTER,0,1,MPI_COMM_WORLD,status, > * errcode) >c call MPI_RECV(buf,5,MPI_CHARACTER,0,1,MPI_COMM_WORLD,status, >c * errcode) > print *, buf,rank > buf="test2" > call MPI_SSEND(iarray,5,MPI_CHARACTER,0,1,MPI_COMM_WORLD, > 1 errcode) >c call MPI_SSEND(buf,5,MPI_CHARACTER,0,1,MPI_COMM_WORLD,errcode) > endif > > call MPI_FINALIZE(errcode) > > end
Limited Availability of CRAY2IEG
At ARSC we have several graphics applications that use the T3D as a computational resource for images that are displayed on a workstation. In these applications, the Y-MP is necessary for control and file manipulation but in general, all computational tasks are pushed onto the T3D. There is a very powerful library routine on the Y-MP, CRAY2IEG, which does bit manipulation between CRI and IEEE data types (see manpage on Denali). These operations seem to be exactly what the T3D processor would do well on. But the routine CRAY2IEG is only available on the Y-MP. This difference between machines makes these heterogeneous applications push some computation onto the least appropriate machine.New Fortran Compiler
An upgrade version of the cf77 compiler is available on Denali with the path:/mpp/bin/cft77new and /mpp/bin/cf77newFor the default versions we have:
/mpp/bin/cf77 -V Cray CF77_M Version 6.0.4.1 (6.59) 05/25/95 13:36:39 Cray GPP_M Version 6.0.4.1 (6.16) 05/25/95 13:36:39 Cray CFT77_M Version 6.2.0.4 (227918) 05/25/95 13:36:39and for this new version:
/mpp/bin/cf77new -V Cray CF77_M Version 6.0.4.1 (6.59) 05/25/95 13:37:26 Cray GPP_M Version 6.0.4.1 (6.16) 05/25/95 13:37:26 Cray CFT77_M Version 6.2.0.9 (259228) 05/25/95 13:37:27This new compiler fixes a potential race condition in shared memory accesses and also fixes an inlining problem with the F90 intrinsics, MINLOC and MAXLOC.
I have completed my testing of this compiler and it will become the default on June 20. I encourage users to try this compiler before it becomes the default.
List of Differences Between T3D and Y-MP
The current list of differences between the T3D and the Y-MP is:- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
- Missing Linpack and Eispack routines in libsci (Newsletter #25)
- F90 manual for Y-MP, no manual for T3D (Newsletter #31)
- RANF() and its manpage differ between machines (Newsletter #37)
- CRAY2IEG is available only on the Y-MP (Newsletter #40)
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
