ARSC T3E Users' Newsletter 132, December 5, 1997
Guy's "T3E Tools" class On mbone - Next Wednesday
Final reminder: ARSC is broadcasting the lecture portion of its T3E Tools class on the mbone. This will occur next Wednesday, December 10th, beginning at 9am Alaska Standard Time and ending at about noon. (9am AST == 10am Pacific Standard Time.) Guy will discuss and demonstrate Totalview, PAT, Apprentice, and VAMPIR on the T3E.
We welcome virtual attendees: For details, look on the mbone for the announcement, "ARSC - T3E Tools Lecture," or on the web for:
http://www.arsc.edu/user/classes/ClassT3ETools.html
--or--
/arsc/support/news/t3enews/t3enews129/index.xml
Note for local attendees: the lecture has been moved from the ARSC facility to the media lab in Rasmussen Library, 3rd floor.
Preprocessing--cpp Style
Some users have reported problems using the C compiler to preprocess Fortran codes under PE3.0.
The PE3.0 distribution notes (see "news CC30.news") state:
3) The cc 6.0 driver will now enforce what has been documented for
some time. Namely, that users who wish to use the "cc" command
to preprocess files that have suffixes that are not recognized as
C or C++ will be passed to the loader and not preprocessed.
Users who wish to preprocess these files should use the "cpp"
command.
Thus, you should invoke the preprocessor, "cpp," directly, but note that the command line options will be slightly different compared to accessing cpp via cc. This makes sense, as the options under cc had to tell the compiler to pass the code through.
As an example, this PE2.0 style command,
cc -E -Wp,"-N" -D CHANGE=4 ./code.F > code.f
in PE3.0 becomes,
cpp -E -N -D CHANGE=4 ./code.F > code.f
(The cpp options: "-E" inserts "#line" directives; "-N" disables insertion of spaces; -D defines a macro.)
Fortran users might prefer to preprocess files using CRI's f90 compiler directly. This example would be equivalent to the above:
f90 -eP -F -D CHANGE=4 ./code.F mv code.i code.f
This would both preprocess and compile:
f90 -F -D CHANGE=4 ./code.F
(The f90 options: "-F" expands macros anywhere in the code; -D defines a macro; "-eP" enables the preprocessor without doing any subsequent compilation.)
As a reminder, f90 invokes the preprocessor on files with the extensions .F or .F90, but not on those with the .f or .f90 extensions (unless you specify, "-eZ").
How f90 Handles Function Declaration Errors
A Fortran coding error, incorrect declaration of function return types, can cause programs to crash or worse, give false results under PE3.0 f90.
The subtle version of the problem is an error of omission. It occurs when a program allows implied typing (i.e., does not contain the HIGHLY RECOMMENDED implicit none statement) and fails to declare the type of a function which, by implied typing, is expected to return a value of a different type. The not-so-subtle version occurs if the programmer explicitly declares the wrong type.
For clarity, here's a blatant example:
yukon$ cat bad.f
cccccccccccccccccccccccccccccccccccccccccccccccccc
program bad
print*, xgetnum () ! Program expects a real by implied typing.
end
ccc
integer function xgetnum() ! xgetnum actually returns integer.
xgetnum = 7
return
end
cccccccccccccccccccccccccccccccccccccccccccccccccc
yukon$ f90 bad.f # compiles without error
yukon$ ./a.out # runs without crashing
5.63415508906672332E-241
And here is a different result:
yukon$ cat bad.f
cccccccccccccccccccccccccccccccccccccccccccccccccc
program junk
i = xgetnum ()
print*, i
end
ccc
integer function xgetnum()
xgetnum = 7
return
end
cccccccccccccccccccccccccccccccccccccccccccccccccc
yukon$ f90 bad.f # compiles without error
yukon$ ./a.out # crashes
SIGNAL: Floating point exception (invalid floating point operation)
Beginning of Traceback (PE 0):
Interrupt at address 0x800001258 in routine 'JUNK'.
Called from line 449 (address 0x800000b58) in routine '$START$'.
End of Traceback.
Floating exception(coredump)
In the next example the problem is with the implied type of a Cray function. Note that the name of the function "cri2ieg" begins with a "c" which implies that it returns a real value. In reality, it returns an integer.
The series of re-compiles and runs following the listing show the effect of the three #ifdefs. The most dangerous combination is the last, in which the function returns an erroneous value and the program doesn't crash.
cccccccccccccccccccccccccccccccccccccccc
program test2
integer native, foreign, ierr, DATATYPE
! This declaration would correctly override the implied real type
! of cri2ieg().
#ifdef FIX_DECLARATION
integer cri2ieg
#endif
! cri2ieg and ieg2cri do not understand a data type of "666". And
! should return error "-2" if it is specified. The data type "2"
! is correct for this example, and corresponds to integers.
#ifdef CREATE_BUG
DATATYPE=666
#else
DATATYPE=2
#endif
! This shows the unpredictable outcome of the type mismatch.
#ifdef OBSCURE_THE_PROBLEM
print*, "Random print statement... DATATYPE=", DATATYPE
#endif
foreign = 0
native = 12
ccc convert "native" to "foreign" IEEE 32-bit integer
ierr = cri2ieg(DATATYPE, 1, foreign, 0, native, 1, 64, 32)
write(6,*) "cri2ieg() returned error code:", ierr
ccc convert "foreign" back to CRI format
native = 0
ierr = ieg2cri(DATATYPE, 1, foreign, 0, native, 1, 64, 32)
write(6,*) "ieg2cri() returned error code:", ierr
ccc print result of conversion
write(6,*) "After re-conversion - native is ", native
end
cccccccccccccccccccccccccccccccccccccccc
$ f90 test2.F
$ ./a.out
SIGNAL: Floating point exception (invalid floating point operation)
[ core dump ]
$ f90 -D FIX_DECLARATION test2.F
$ ./a.out
cri2ieg() returned error code: 0
ieg2cri() returned error code: 0
After re-conversion - native is 12
$ f90 -D CREATE_BUG test2.F
$ ./a.out
SIGNAL: Floating point exception (invalid floating point operation)
[ core dump ]
$ f90 -D FIX_DECLARATION -D CREATE_BUG test2.F
$ ./a.out
cri2ieg() returned error code: -2
ieg2cri() returned error code: -2
After re-conversion - native is 0
$ f90 -D CREATE_BUG -D OBSCURE_THE_PROBLEM test2.F
$ ./a.out
Random print statement... DATATYPE= 666
cri2ieg() returned error code: 0
ieg2cri() returned error code: -2
After re-conversion - native is 0
Solutions?
- IMPLICIT NONE (or the compiler option -eI) will help you avoid errors from incorrect type declaration and typos. It should be used in every new program.
- The Cray tool, cflint, analyzes Fortran source and warns you about possible problems: good for archival code (FORTRAN is 40 years old this year, 1997.)
Here's what cflint says about bad.f:
$ cat bad.f
cccccccccccccccccccccccccccccccccccccccccccccccccc
program bad
i = xgetnum ()
print*, i
end
ccc
integer function xgetnum()
xgetnum = 7
return
end
cccccccccccccccccccccccccccccccccccccccccccccccccc
$ f90 -Ca bad.f
$ cflint bad.f
cflint 2.3.0.23: initiated 14:04 Thu Oct23,1997
(created 10:57 Wed Jun25,97) CRAY Research, Inc.
cflint bad.f
#################################################################
Global Call-Chain Considerata
===============================
1) <516> At line 3, BAD() calls XGETNUM(fileline 7):
BAD() expects the return value to be "Real(8),"
but XGETNUM() returns "Integer(8)"
"cflint 2.3.0.23" 14:04 Thu Oct23,1997 2 Subprograms:
0 Local Messages, 1 Global Message
Data Distribution
[ We received four responses to last week's quick-tip question (three of which are given in the usual "quick-tip" section, below). Brad Chamberlain of the University of Washington sent a clearly unique solution and, on our request, extended his "tip" into this short article. ]
I like to formulate data distribution in terms similar to line drawing in computer graphics, because they both ask a similar question: how should one partition K things between M discrete buckets? (assuming K is greater than M)
In graphics, this amounts to drawing a line that has to span K pixel rows in M pixel columns. A good line will have no more than two step sizes that are adjacent integers: ceil(K/M) and floor(K/M). Furthermore, it will distribute these two step sizes in such a way that the line looks as straight and balanced as is possible on a pixelized display.
In parallel processing, we're trying to distribute K data elements (or tasks) across M processors. Again, it seems desirable to have two integer step sizes (e.g. the values "1" and "2" from the quick-tip question's second example). Do we care as much about their relative distribution? I argue yes.
Simple task distribution algorithms will just put all the ceil(K/M) steps in the first elements of N and fill the remaining values of N with floor(K/M). My feeling is that a more even distribution shouldn't hurt parallel execution, and may in fact improve it (e.g. if the network traffic is at all related to the distribution, a more even data distribution may result in a better distribution of network traffic).
One of the classic line drawing algorithms in graphics was developed by Bresenham in his 1965 paper "Algorithm for Computer Control of a Digital Plotter". Here I give a more intuitive version for formulating a partition:
#define ROUND(x) (int)(0.5+(x))
void partition(int K,int M,int N[]) {
double slope; /* the slope of our "line" */
double accum; /* an accumulator for the current "y" value */
int rounded; /* the rounded-off version of the accumulator */
int prev; /* the previous value of rounded */
int i;
slope = (double)K/M; /* calculate the slope */
accum = 0.0; /* initialize the accumulator */
prev = 0; /* initialize the previous value */
for (i=0;i<M;i++) {
accum += slope; /* increment the accumulator */
rounded = ROUND(accum); /* round it to an integer */
N[i] = rounded - prev; /* store it, minus the prev */
prev = rounded; /* capture the new previous value */
}
}
When run on the example in the question, it returns:
N[0] = 2, N[1] = 1, N[2] = 2
A more compelling example would be distributing 10 elements between 6 processors. If we took the simple approach described above, we'd get:
2 2 2 2 1 1
The line-based algorithm will return:
2 1 2 2 1 2
which is obviously more evenly distributed.
Editor's Note:
It is interesting how this technique relates to the recursive distribution (*) techniques in which the work is divided as the processors are divided. So to take Brad's example, we would look to divide 10 onto 6 like this:
data aaaaa AAAAA -> aaa bb AAA BB -> aa c bb AA C BB
pes 3 3 2 1 2 1 1 1 1 1 1 1
Here the data is divided into two parts at each stage along with the number of processors. Each fraction is then further divided until the processors can not be divided any further, at which point the work is spread onto one processor. A weighted technique might take the following route:
data aaaaaa AAAA -> aaa bbb AA BB -> aa c bb d AA BB
pes 4 2 2 2 1 1 1 1 1 1 1 1
As Brad mentions, these examples have distributed the data evenly. When working with real codes, the programmer must also consider the relative costs of the computation, which may not vary linearly with data sizes, as well as the communication pattern/volume a particular data distribution/layout generates. Naturally, this caveat applies to the simpler distribution algorithms as well.
(*) Here's a simple example using recursion:
void distribute (int K, int M, int N[], int i) {
/* K: number of tasks to distribute
* M: number of PEs
* N: result array--number of tasks indexed by PE number
* i: lowest index (PE number) of the M PEs
*/
if (M==1)
N[i] = K;
else {
distribute ( K/2, M/2, N, i);
distribute ( K-K/2, M-M/2, N, i+M/2);
}
}
Quick-Tip Q & A
A: {{ What is a good formula or algorithm for apportioning K equally
sized, independent tasks among M PEs on the T3E. For instance,
if K==5 and M==3, then the algorithm might return: N(0)=1,
N(1)=1, N(2)=3, which wouldn't be as good as if it returned:
N(0)=2, N(1)=2, N(2)=1. }}
# Thanks to the readers who sent in the following solutions.
#
# Using the example of the previous article, they all distribute 10 tasks
# among 6 PEs as follows:
#
# 2 2 2 2 1 1
#
# For most situations, this is certainly good enough and the choice of
# technique is personal.
#---------------------------------------------------------------------
N(i) = (K + M - i - 1) / M ! integer division
#---------------------------------------------------------------------
# In C, you can use:
block_min = K / M;
block_extra = K % M;
if( _my_pe() < block_extra ) {
block_size = block_min + 1;
} else {
block_size = block_min;
}
# In Fortran:
block_min = K / M
block_extra = MOD (K, M)
IF( _my_pe() .LT. block_extra ) THEN
block_size = block_min + 1
ELSE
block_size = block_min
ENDIF
#---------------------------------------------------------------------
# Try this:
for (i=0;i<M;i++) N[i] = K / M; /* Integer Division */
for (i=0;i<K-((K/M)*M);i++) N[i]++;
#---------------------------------------------------------------------
Q: You're writing a monte-carlo simulation, which will use very long
sequences of pseudo-random numbers, to run on multiple PEs. How can
you ensure that the generated sequences will not overlap?
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
