ARSC T3D Users' Newsletter 41, June 23, 1995
The 2D FFT on the ARSC T3D
Srikanth Thirumalai of the Mathematical Software Group at Cray Research sends in the following note about CRI library routines for the two dimensional Fast Fourier Transforms on the T3D:
> This is just to inform you that Craylibs 1.2.1 has 2 > routines PCCFFT2D and PCCFFT3D for 2D and 3D parallel > complex-to-complex FFTs. I have included the performance > numbers for PCCFFT2D for your convenience. > > Craylibs 1.2.2 will have similar routines for > real-to-complex and complex-to-real FFTs. They have been > called PSC(CS)FFT2D(3D). > > Performance of PCCFFT2D in Mflops. > > Side of > square <---------------number of PEs----------------------> > array > 1 2 4 8 16 32 64 128 > > 32 30.5 39.7 60.0 75.9 90.4 90.6 89.1 88.3 > 64 41.8 59.9 106.1 163.9 237.5 302.4 278.7 278.2 > 128 42.0 66.7 127.6 226.4 407.6 652.3 834.3 776.9 > 256 42.4 71.3 138.3 259.9 483.1 870.4 1433.4 1989.6 > 512 33.2 59.6 116.3 227.2 432.8 808.5 1438.7 2613.9These numbers are better than those published in the last newsletter. ARSC is moving up to the the 1.2.1 release of Craylibs and ARSC users will be informed through this newsletter when ARSC upgrades.
Small Changes to the ARSC T3D Batch Queues
On June 16th we made some small adjustments to the ARSC T3D batch queues, the T3D queues are now described on Denali with the command:news t3dbatchwhich produces:
The description of the ARSC T3D NQS queues is:
New T3D Batch Queues
====================
The T3D batch queues were changed on June 16th 1995. The
current T3D queues are:
Always on:
m_8pe_24h 2 jobs using at most 8 PEs for 24 hours
m_16pe_24h 2 jobs using at most 16 PEs for 24 hours
m_32pe_24h 1 job using at most 32 PEs for 24 hours
m_64pe_10m 1 job using at most 64 PEs for 10 minutes
m_64pe_24h 1 job using at most 64 PEs for 24 hours
m_128pe_5m 1 job using at most 128 PEs for 5 minutes
There is one additional queue that is enabled on Friday at
6PM and disabled at 4AM on Sunday:
m_128pe_8h 1 job using at most 128 PEs for 8 hours
A request made to these queues will be run as soon as enough
PEs are available to satisfy the request.
User's UDBSEE limits
Most T3D users currently have a limit of 128 PEs for batch
access. Users can check their limits with the udbsee command:
udbsee
grep jpelimit
The output will indicate their limits in interactive (i) and
batch (b). For example:
jpelimit[b] :128:
jpelimit[i] :8:
If your batch PE limit is too small to access these new NQS
queues and you would like to use them, please contact Mike
Ess, either by phone at 907-474-5404 or e-mail to ess@arsc.edu,
to have your PE batch limits increased.
Users can query the NQS batch system with the command:
qstat -a
to see what other NQS T3D jobs are scheduled to run on the T3D.
The utility mppmon is available to see what jobs are currently
running on the T3D. T3D jobs are executed on a "first fit"
priority and run to completion without interruption.
Mike Ess, June 19, 1995
Big Jobs in the ARSC T3D NQS Queues
At ARSC it is possible to use all 128 PEs either through the two 128 PE batch queues or for some users during interactive sessions. When submitting such jobs it is a good idea to use the mppmon command to see who is using the machine before you submit your 128 PE job. The two ideas to keep in mind are:- Your job won't run until the machine becomes idle. Even a single PE job will make your 128 PE job wait until the machine is idle.
- All jobs submitted after your 128 PE job will wait until your 128 PE job completes. The combination of a long running job (any number of PEs) followed by a 128 PE job (for any amount of time) will effectively close down the T3D until both the long running job AND the 128 PE job finishes.
The EPCC/CRI Version of MPI on the ARSC T3D
The three files necessary for running MPI programs on the ARSC T3D have been put in their default locations:mpi.h and mpif.h go to /usr/include/mpp libmpi.a goes to /mpp/libNow (when the environmental variable TARGET is set to cray-t3d) MPI C programs can be compiled as:
cc -c shifter.c cc shifter.o -lmpiand similarly for MPI Fortran programs:
cf77 -c -I/usr/include/mpp shifterF.f cf77 shifterF.o -lmpiIf there are any problems with MPI on the ARSC T3D please contact Mike Ess.
Sorting on the T3D
A user asked: What type of sorting functions are available on the T3D? On the Y-MP, there are several sorting functions: ISORTD, SSORTB, ISORTB and ORDERS. But looking over the decks that are in the MPP Craylibs, I could only find qsort in /mpp/lib/libc.a. qsort requires the user to supply a function that compares the elements of the array to be sorted so I doubt that it can be very fast sorting on a specific instance, like an array of integers.There can be a big speed difference in sorting functions and to start an investigation I typed in several functions from: "Numerical Recipes in C", by Press, Flannery, Teukolsky & Vetterling and timed them on a single PE on the T3D. The functions in this book are translations from the book "Numerical Recipes in Fortran" and assume that the elements to be sorted are in the array a[1:n]. I changed the functions to sort the array when the elements are a[0:n-1] which seems more natural in C. Each function has inputs of the length of an array of integers and the starting address of the array, the function overwrites the input array with the sorted array.
The timing of sorting functions is very sensitive to how "orderly" the input array is but for input arrays generated with the random generator RANF (newsletter #37) sorting times below are typical:
A timing comparison of sorting functions on the T3D (seconds)
Number of insertion shell quicksort Munstock's
elements to sort sort sort sort
sort
1 0.000004 0.000010 0.000005 0.000004
2 0.000005 0.000012 0.000006 0.000008
3 0.000006 0.000013 0.000006 0.000008
4 0.000006 0.000017 0.000007 0.000011
5 0.000007 0.000018 0.000008 0.000013
10 0.000009 0.000024 0.000012 0.000021
20 0.000017 0.000043 0.000022 0.000045
30 0.000026 0.000072 0.000032 0.000079
40 0.000041 0.000099 0.000044 0.000123
50 0.000055 0.000133 0.000056 0.000160
100 0.000194 0.000327 0.000122 0.000364
200 0.000645 0.000846 0.000270 0.000873
300 0.001543 0.001358 0.000423 0.001687
400 0.002762 0.001795 0.000595 0.002505
500 0.004387 0.002507 0.000759 0.003231
1000 0.016640 0.005644 0.001750 0.008732
2000 0.075755 0.014613 0.004072 0.026065
3000 0.194306 0.024549 0.007052 0.047113
4000 0.389964 0.034641 0.010076 0.070686
5000 0.611296 0.045975 0.013453 0.095750
10000 2.698027 0.106615 0.032286 0.224263
20000 11.086379 0.242284 0.077115 0.535524
30000 25.146869 0.382456 0.126676 0.892756
40000 44.656612 0.560791 0.180124 1.307390
50000 69.894340 0.719300 0.235710 1.781510
The table shows that making the right algorithm choice can make a big difference. Below is the source code for the timer/tester and the source for each of the sorting functions. If any user has more results in this area I would be happy to put them in this newsletter.
Often times in sorting, it is not so useful to have the sorted array overwrite the unsorted input array. It is useful to have an index array whose values are the position of the array element in the sorted array. With this index, the sorted array could be printed out as:
for( i = 0; i < n; i++ ) printf( " %d %d\n", i, a[ index[ i ] ] );The "Numerical Recipes in C" book shows how to modify the insertion, shell and heap sort functions to return this index that sorts the input array. The function "munstock" is this kind of sorting function, it produces an index to sort the array. (I got this function from Jim Munstock of CRI more than ten years ago in CDC assembler. The translation to C is straightforward and it's a useful utility to keep around.):
#define MAXLENGTH 100000
#define MAXCASE 31
main()
{
int a[ MAXLENGTH ], b[ MAXLENGTH ], c[ MAXLENGTH ], d[ MAXLENGTH ];
int e[ MAXLENGTH ], indx[ MAXLENGTH ];
int i, j, n;
fortran double RANF();
void insert(), heap(), shell(), munstock();
double t1, t2, t3, t4, t5, second();
int ncase,kcase[MAXCASE] = {0,1,2,3,4,5,10,20,30,40,50,100,200,300,400,500,1000,2000,3000,4000,5000,10000,20000,30000,40000,50000,100000,200000,300000,400000,500000};
for( ncase = 0; ncase < 27; ncase++ ) {
n = kcase[ ncase ];
for( i = 0; i < n; i++ ) {
a[ i ] = n * RANF();
b[ i ] = a[ i ];
c[ i ] = a[ i ];
d[ i ] = a[ i ];
e[ i ] = a[ i ];
}
t1 = second( );
insert( n, b );
t2 = second( );
shell( n, c );
t3 = second( );
heap( n, d );
t4 = second( );
munstock( n, e, indx );
t5 = second( );
if( n > 0 ) {
for( j = 0; j < n-1; j++ ) {
if( b[ j ] > b[ j+1 ] ) {
printf( " failure at insert sort %d %d %d\n", j, a[ j ], b[ j ] );
for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], b[ i ] );
exit( -1 );
}
}
for( j = 0; j < n; j++ ) {
if( c[ j ] != b[ j ] ) {
printf( " failure at shell sort %d %d %d\n", j, b[ j ], c[ j ] );
for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], c[ i ] );
exit( -1 );
}
}
for( j = 0; j < n; j++ ) {
if( d[ j ] != b[ j ] ) {
printf( " failure at heap sort %d %d %d\n", j, b[ j ], d[ j ] );
for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, a[ i ], d[ i ] );
exit( -1 );
}
}
for( j = 0; j < n; j++ ) {
if( e[ indx[ j ] ] != b[ j ] ) {
printf( " failure at index sort %d %d %d\n",j,indx[j],e[indx[j]]);
for( i = 0; i < n; i++ ) printf( " %d %d %d\n", i, b[i],e[indx[i]]);
exit( -1 );
}
}
}
printf( " %2d %6d %12.6f %10.6f %10.6f %10.6f\n",ncase,n,t2-t1,t3-t2,t4-t3,t5-t4);
}
}
void insert( n, arr )
/* sorts the array arr[0:n-1] in ascending order with insertion sort
insertion sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky &
Vetterling, modified for "C" arrays by Mike Ess, 1995 */
int n;
int arr[];
{
int i, j, a;
if( n < 1 ) {
printf( "insert sort called with elements to sort less than 1\n" );
} else {
for( j = 1; j < n; j++ ) { /* pick out each element */
a = arr[ j ];
i = j - 1;
while( i >= 0 && arr[ i ] > a ) { /* look for place to insert it */
arr[ i + 1 ] = arr[ i ];
i--;
}
arr[ i + 1 ] = a; /* insert it */
}
}
}
#include <math.h>
#define ALN2I 1.442695022
#define TINY 1.0e-5
void shell( n, arr )
/* sorts the array arr[0:n-1] in ascending order with shell sort
shell sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky &
Vetterling, modified for "C" arrays by Mike Ess, 1995 */
int n;
int arr[];
{
int nn, m, j, i, lognb2;
int t;
if( n < 1 ) {
printf( "shell sort called with elements to sort less than 1\n" );
} else {
lognb2 = (log((double) n) * ALN2I + TINY);
m = n;
for( nn = 1; nn <= lognb2; nn++ ) { /* Loop over partial sorts */
m >>= 1;
for( j = m; j < n ; j++ ) { /* Outer loop of straight insertion */
i = j-m;
t = arr[ j ];
while( i >= 0 && arr[ i ] > t ) { /* Inner loop of straight insertion */
arr[ i + m ] = arr[ i ];
i -= m;
}
arr[ i + m ] = t;
}
}
}
}
void heap( n, ra )
/* sorts the array ra[0:n-1] in ascending order with heap sort
heap sorting from "Numerical Recipes in C", Press,Flannery,Teukolsky &
Vetterling, modified for "C" arrays by Mike Ess, 1995 */
int n;
int ra[];
{
int l, j, ir, i;
int rra;
if( n < 1 ) {
printf( "heap sort called with elements to sort less than 1\n" );
} else {
l = n / 2 + 1;
ir = n;
/* The index l will be decremented from its initial value down to 1 during the
"hiring" (heap creation) phase. Once it reaches 1, the index ir will be
decremented from its initial value down to 1 during the "retirement-and-
promotion" (heap selection) phase */
for( ;; ) {
if( l > 1 ) { /* Still in hiring phase */
l--;
rra = ra[ l-1 ];
} else { /* In retirement-and-promotion phase. */
rra = ra[ ir-1 ]; /* Clear a space at the end of array. */
ra[ ir-1 ] = ra[ 0 ]; /* Retire the top of the heap into it. */
ir--; /* Done with the last promotion. */
if( ir == 0 ) { /* The least competent worker of all! */
ra[ 1-1 ] = rra;
return;
}
}
i = l; /* Whether we are in the hiring phase */
j = l + l; /* or promotion phase, we here set up to*/
while( j <= ir ) { /* shift down element rra to its proper */
if( j < ir ) { /* level. */
if( ra[ j-1 ] < ra[ j ] ) j++; /* Compare to the better underling */
}
if( rra < ra[ j-1 ] ) { /* demote rra */
ra[ i-1 ] = ra[ j-1 ];
i = j;
j = j + j;
} else {
j = ir + 1; /* This is rra's level. Set j to term- */
} /* inate the sift-down */
}
ra[ i-1 ] = rra; /* Put rra into its slot. */
}
}
}
void munstock( length, a, ind )
/* Sort the array "a" to produce an index "ind" of the sorted array. From
Jim Munstock, translated to C by Mike Ess 1989 */
int length;
int a[ ];
int ind[ ];
{
int i, ii, ij, j, m, m1, n2;
int t;
for ( i = 0 ; i < length; i++ ) ind[ i ] = i;
m = 1;
n2 = length / 2;
m = 2 * m;
while ( m <= n2 ) m = 2 * m;
m = m - 1;
three:;
m1 = m + 1;
for ( j = m1-1 ; j < length; j++ ) {
t = a[ ind[ j ] ];
ij = ind[ j ];
i = j - m;
ii = ind[ i ];
four:;
if ( t < a[ ii ] ) {
ind[ i+m ] = ii;
i = i - m;
if ( i >= 0 ) {
ii = ind[ i ];
goto four;
}
}
ind[ i+m ] = ij;
}
m = m / 2;
if ( m > 0 ) goto three;
return;
}
List of Differences Between T3D and Y-MP
The current list of differences between the T3D and the Y-MP is:- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
- Missing Linpack and Eispack routines in libsci (Newsletter #25)
- F90 manual for Y-MP, no manual for T3D (Newsletter #31)
- RANF() and its manpage differ between machines (Newsletter #37)
- CRAY2IEG is available only on the Y-MP (Newsletter #40)
- Missing sort routines on the T3D (Newsletter #41)
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
