ARSC T3D Users' Newsletter 38, June 2, 1995
Performance of CCFFT on the T3D
Chris Yerkes ( yerkes@arsc.edu ) of the UAF Electrical Engineering Department is using the ARSC T3D to implement an application that needs a two dimensional FFT. As a first step towards implementing his application he timed the T3D library routine CCFFT that performs a complex to complex FFT. His timing program is a good example of a C program calling a library routine that is usually called from a Fortran program. Here's his code:
/* Test FFT program */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <fortran.h>
#define MAXLEN 32768 /* maximum length of FFT */
#define MAXFFTS 32 /* number of FFTs to do */
#define MAXP2 15 /* log2( MAXLEN ) */
float fortran rtc();
void fortran ccfft();
static double Rdata[2*MAXLEN*MAXFFTS];
static float pad[change];
static double Work[2*MAXLEN];
static double re_table[2*MAXLEN],re_work[4*MAXLEN];
void main()
{
int i,j,k,l,N,zero,one;
float t1,t2;
double et,t11[MAXP2],t21[MAXP2];
double rone;
double *Rd,*Wk;
Rd = &Rdata[0];
Wk = &Work[0];
N=2;
et = 0;
zero=0;
one=1;
rone=1.0;
for (j = 1; j < MAXP2 ; j++){
N *= 2;
ccfft(&zero,&N,&rone,&Work,&Work,&re_table,&re_work,&zero);
l = 0;
/* Initialize random vector */
for (k = 0; k < 2*MAXFFTS*N;k++) (*(Rd+k)) = rand();
/* Treat as 2d array of vectors of stride MAXFFTS*/
for (k = 0; k < MAXFFTS ;k++){
l = 0;
for (i=0;i < N;i++){
(*(Wk+(2*i) )) = (*(Rd+(2*k) +l));
(*(Wk+(2*i+1))) = (*(Rd+(2*k+1)+l));
l += MAXFFTS;
}
t1 = rtc();
ccfft(&one,&N,&rone,&Work,&Work,&re_table,&re_work,&zero);
t2 = rtc();
et += (t2-t1)/150000000.0; /*Elapsed time using 150MHz*/
l = 0;
for (i=0;i < N;i++){
(*(Rd+(2*k) +l)) = (*(Wk+(2*i) ));
(*(Rd+(2*k+1)+l)) = (*(Wk+(2*i+1)));
l += MAXFFTS;
}
}
t11[j-1]=(1.0/(double)MAXFFTS)*et; /* Averaged elapsed time */
t21[j-1]=1.0/(1000000.*t11[j-1])*5*N*(j+1); /*MFLOPS from ccfft man page*/
printf(" %4d %10d %e %6.1f\n", j, N, t11[j-1],t21[j-1]);
}
}
On the 150 MFLOPs peak Alpha processor of the T3D, the performance was about 10% of peak, which was a disappointment. The T3D processor has a direct mapped 8KB cache and shares 2 pages of memory, so there can be significant degradation in performance when the program is missing cache lines or is swapping between pages. This is especially true with the power-of-two FFT, where it is very likely that consecutive loads map to the same cache line.
To check out the possibility that cache misses or page swapping were partly responsible, I added one line to the above program
static float pad[change];and allowed the value of 'change' to range from 1 to 12. Here is a table of MFLOPs for different values of change:
Performance (MFLOPs) of the T3D routine CCFFT
for Chris Yerkes' timing program
value of the pad (i.e., "change")
trans-
---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
form
0 1 2 3 4 5 6 7 8 9 10 11 12
length
---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
4
7.3 7.3 7.3 7.5 7.5 7.6 7.6 7.3 7.2 7.2 7.2 7.5 7.5
8
8.7 8.8 8.6 8.9 8.8 9.1 9.1 8.7 8.6 8.7 8.7 8.8 8.8
16
10.9 11.8 11.7 11.3 11.3 11.6 11.6 10.9 10.8 11.7 11.8 11.2 11.2
32
13.2 14.3 14.2 13.3 13.3 13.3 14.3 12.6 12.6 14.2 14.2 13.3 13.3
64
15.7 17.1 16.7 15.7 15.7 16.6 17.2 15.3 15.4 17.1 17.1 15.7 15.7
128
17.6 19.1 18.8 17.4 17.2 19.2 19.5 17.6 17.6 19.0 19.1 17.2 17.4
256
19.6 21.3 21.3 19.0 18.9 21.5 21.7 19.6 19.5 21.4 21.3 19.0 19.0
512
18.9 19.6 19.6 18.5 18.5 19.8 19.8 18.9 18.9 19.6 19.6 18.5 18.5
1024
14.3 16.4 16.4 14.0 14.0 16.6 16.5 14.3 14.3 16.4 16.4 14.0 14.0
2048
11.4 14.3 14.3 11.2 11.2 14.4 14.3 11.4 11.4 14.3 14.3 11.2 11.2
4096
10.1 12.9 12.9 10.0 10.0 13.0 13.0 10.1 10.1 12.9 12.9 10.0 10.0
8192
8.8 11.0 11.0 8.7 8.7 11.1 11.1 8.8 8.8 11.0 11.0 8.7 8.6
16384
8.4 10.6 10.6 8.3 8.3 10.7 10.7 8.4 8.4 10.6 10.6 8.3 8.3
32768
8.1 10.1 10.1 8.0 8.0 10.2 10.2 8.1 8.1 10.1 10.1 8.0 8.0
From the table we have several observations:
- Eventually the transform is so large that performance decreases, this must be a lost of cache locality. (It rarely happens in the Y-MP vector world that a larger problem is less efficient that a smaller one.)
- The pad is a multiple of 32 bits but from the performance it looks like the allocation is 64 bits at a time.
- A simple 1 word (64 bits) pad gets about a 20% performance boost for the largest transform (32768) for this timing program.
FFT Operation Counts
In the common man page for FFTs on Denali, the operations count of a power-of-two FFT is approximated as:5 * n * log2( n )where:
- each addition and multiplication is one operation
- n is the length of the complex to complex FFT
- log2 is the base 2 logarithm of n
Operation counts for the CCFFT routine on the Y-MP
Length of Log2 of Estimate of Actual Actual Total Actual
Transform the Length operations Additions Multiplies Operations
n log2(n) 5*n*log2(n)
--------- ---------- ----------- --------- ---------- -----------
4 2 40 31 7 38
8 3 120 81 25 106
16 4 320 173 49 222
32 5 800 471 232 703
64 6 1920 1047 488 1535
128 7 4480 2379 1086 3465
256 8 10240 5275 2270 7545
512 9 23040 12012 5480 17492
1024 10 51200 26476 11880 38356
2048 11 112640 58781 27314 86095
4096 12 245760 127453 58162 185615
8192 13 532480 279150 132156 411306
16384 14 1146880 598894 280124 879018
32768 15 2457600 1295743 625222 1920965
65536 16 5242880 2754124 1313795 4067919
There are several reasons why the approximation, 5*n*log2(n), is too generous:
- The algorithm implemented is actually a 2, 4, and 8 radix algorithm not just a radix 2 algorithm.
- The innermost loops of the implemented algorithm are probably interchanged to insure a 'good' vector length.
- Trivial operations like multiplies by 1.0 and additions with 0.0 have been optimized out.
Next week we'll have more results on the way to a two-dimensional FFT. There are two- and three-dimensional FFTs scheduled for the 1.2.1 release of the Programming Environment. That release isn't available yet but we may be able to move to it in July.
New T3D Batch PE Limits
In the past week all active users of the ARSC T3D had their batch PE limit increased to 128. This allows these users access to the 128-PE 8-hour queues that run on the weekends. If you need your T3D UDB limits changed please contact Mike Ess.New Fortran Compiler
An upgrade version of the cf77 compiler is available on Denali with the path:/mpp/bin/cft77new and /mpp/bin/cf77newFor the default versions we have:
/mpp/bin/cf77 -V Cray CF77_M Version 6.0.4.1 (6.59) 05/25/95 13:36:39 Cray GPP_M Version 6.0.4.1 (6.16) 05/25/95 13:36:39 Cray CFT77_M Version 6.2.0.4 (227918) 05/25/95 13:36:39and for this new version:
/mpp/bin/cf77new -V Cray CF77_M Version 6.0.4.1 (6.59) 05/25/95 13:37:26 Cray GPP_M Version 6.0.4.1 (6.16) 05/25/95 13:37:26 Cray CFT77_M Version 6.2.0.9 (259228) 05/25/95 13:37:27This new compiler fixes a potential race condition in shared memory accesses and also fixes an inlining problem with the F90 intrinsics, MINLOC and MAXLOC.
This compiler will become the default after we finish testing it and users will be notified before that happens. I encourage users to try this compiler before it becomes the default.
List of Differences Between T3D and Y-MP
The current list of differences between the T3D and the Y-MP is:- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
- Missing Linpack and Eispack routines in libsci (Newsletter #25)
- F90 manual for Y-MP, no manual for T3D (Newsletter #31)
- RANF() and its manpage differ between machines (Newsletter #37)
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
