ARSC T3E Users' Newsletter 132, December 5, 1997

Guy's "T3E Tools" class On mbone - Next Wednesday

Final reminder: ARSC is broadcasting the lecture portion of its T3E Tools class on the mbone. This will occur next Wednesday, December 10th, beginning at 9am Alaska Standard Time and ending at about noon. (9am AST == 10am Pacific Standard Time.) Guy will discuss and demonstrate Totalview, PAT, Apprentice, and VAMPIR on the T3E.

We welcome virtual attendees: For details, look on the mbone for the announcement, "ARSC - T3E Tools Lecture," or on the web for:



Note for local attendees: the lecture has been moved from the ARSC facility to the media lab in Rasmussen Library, 3rd floor.

Preprocessing--cpp Style

Some users have reported problems using the C compiler to preprocess Fortran codes under PE3.0.

The PE3.0 distribution notes (see "news") state:

   3) The cc 6.0 driver will now enforce what has been documented for
      some time.  Namely, that users who wish to use the "cc" command
      to preprocess files that have suffixes that are not recognized as
      C or C++ will be passed to the loader and not preprocessed.
      Users who wish to preprocess these files should use the "cpp"

Thus, you should invoke the preprocessor, "cpp," directly, but note that the command line options will be slightly different compared to accessing cpp via cc. This makes sense, as the options under cc had to tell the compiler to pass the code through.

As an example, this PE2.0 style command,

  cc -E -Wp,"-N" -D CHANGE=4 ./code.F > code.f
    in PE3.0 becomes,
  cpp -E -N -D CHANGE=4 ./code.F > code.f

(The cpp options: "-E" inserts "#line" directives; "-N" disables insertion of spaces; -D defines a macro.)

Fortran users might prefer to preprocess files using CRI's f90 compiler directly. This example would be equivalent to the above:

  f90  -eP -F -D CHANGE=4 ./code.F 
  mv code.i code.f

This would both preprocess and compile:

  f90  -F -D CHANGE=4 ./code.F 

(The f90 options: "-F" expands macros anywhere in the code; -D defines a macro; "-eP" enables the preprocessor without doing any subsequent compilation.)

As a reminder, f90 invokes the preprocessor on files with the extensions .F or .F90, but not on those with the .f or .f90 extensions (unless you specify, "-eZ").

How f90 Handles Function Declaration Errors

A Fortran coding error, incorrect declaration of function return types, can cause programs to crash or worse, give false results under PE3.0 f90.

The subtle version of the problem is an error of omission. It occurs when a program allows implied typing (i.e., does not contain the HIGHLY RECOMMENDED implicit none statement) and fails to declare the type of a function which, by implied typing, is expected to return a value of a different type. The not-so-subtle version occurs if the programmer explicitly declares the wrong type.

For clarity, here's a blatant example:

  yukon$ cat bad.f
         program bad
         print*, xgetnum ()   ! Program expects a real by implied typing.
         integer function xgetnum()  ! xgetnum actually returns integer.
         xgetnum = 7
  yukon$ f90 bad.f                   # compiles without error
  yukon$ ./a.out                     # runs without crashing

And here is a different result:

  yukon$ cat bad.f
         program junk
         i = xgetnum ()            
         print*, i
         integer function xgetnum()
         xgetnum = 7
  yukon$ f90 bad.f                   # compiles without error
  yukon$ ./a.out                     # crashes
   SIGNAL: Floating point exception (invalid floating point operation)
    Beginning of Traceback (PE 0):
     Interrupt at address 0x800001258 in routine 'JUNK'.
     Called from line 449 (address 0x800000b58) in routine '$START$'.
    End of Traceback.
   Floating exception(coredump)

In the next example the problem is with the implied type of a Cray function. Note that the name of the function "cri2ieg" begins with a "c" which implies that it returns a real value. In reality, it returns an integer.

The series of re-compiles and runs following the listing show the effect of the three #ifdefs. The most dangerous combination is the last, in which the function returns an erroneous value and the program doesn't crash.

      program test2
      integer native, foreign, ierr, DATATYPE

! This declaration would correctly override the implied real type
! of cri2ieg().
      integer cri2ieg

! cri2ieg and ieg2cri do not understand a data type of "666".  And 
! should return error "-2" if it is specified.  The data type "2" 
! is correct for this example, and corresponds to integers.

! This shows the unpredictable outcome of the type mismatch.
      print*, "Random print statement... DATATYPE=", DATATYPE

      foreign = 0
      native = 12

ccc   convert "native" to "foreign" IEEE 32-bit integer

      ierr = cri2ieg(DATATYPE, 1, foreign, 0, native, 1, 64, 32)
      write(6,*) "cri2ieg() returned error code:", ierr

ccc   convert "foreign" back to CRI format

      native = 0

      ierr = ieg2cri(DATATYPE, 1, foreign, 0, native, 1, 64, 32)
      write(6,*) "ieg2cri() returned error code:", ierr

ccc   print result of conversion

      write(6,*) "After re-conversion - native is ", native


  $  f90 test2.F
  $  ./a.out
    SIGNAL: Floating point exception (invalid floating point operation)
    [ core dump ]

  $  f90 -D FIX_DECLARATION test2.F
  $  ./a.out
    cri2ieg() returned error code: 0
    ieg2cri() returned error code: 0
    After re-conversion - native is  12

  $  f90 -D CREATE_BUG test2.F
  $  ./a.out
    SIGNAL: Floating point exception (invalid floating point operation)
    [ core dump ]

  $  ./a.out
    cri2ieg() returned error code: -2
    ieg2cri() returned error code: -2
    After re-conversion - native is  0

  $  ./a.out
    Random print statement... DATATYPE= 666
    cri2ieg() returned error code: 0
    ieg2cri() returned error code: -2
    After re-conversion - native is  0


  1. IMPLICIT NONE (or the compiler option -eI) will help you avoid errors from incorrect type declaration and typos. It should be used in every new program.
  2. The Cray tool, cflint, analyzes Fortran source and warns you about possible problems: good for archival code (FORTRAN is 40 years old this year, 1997.)

Here's what cflint says about bad.f:

      $ cat bad.f
           program bad
           i = xgetnum ()
           print*, i
           integer function xgetnum()
           xgetnum = 7
    $ f90 -Ca bad.f
    $ cflint bad.f
     cflint  initiated 14:04 Thu Oct23,1997   
                     (created 10:57 Wed Jun25,97) CRAY Research, Inc.
     cflint   bad.f
                  Global Call-Chain Considerata
         1) <516>  At line 3, BAD() calls XGETNUM(fileline 7):
            BAD() expects the return value to be "Real(8),"
            but XGETNUM() returns "Integer(8)"
     "cflint" 14:04 Thu Oct23,1997  2 Subprograms:  
                                  0 Local Messages,  1 Global Message 

Data Distribution

[ We received four responses to last week's quick-tip question (three of which are given in the usual "quick-tip" section, below). Brad Chamberlain of the University of Washington sent a clearly unique solution and, on our request, extended his "tip" into this short article. ]

I like to formulate data distribution in terms similar to line drawing in computer graphics, because they both ask a similar question: how should one partition K things between M discrete buckets? (assuming K is greater than M)

In graphics, this amounts to drawing a line that has to span K pixel rows in M pixel columns. A good line will have no more than two step sizes that are adjacent integers: ceil(K/M) and floor(K/M). Furthermore, it will distribute these two step sizes in such a way that the line looks as straight and balanced as is possible on a pixelized display.

In parallel processing, we're trying to distribute K data elements (or tasks) across M processors. Again, it seems desirable to have two integer step sizes (e.g. the values "1" and "2" from the quick-tip question's second example). Do we care as much about their relative distribution? I argue yes.

Simple task distribution algorithms will just put all the ceil(K/M) steps in the first elements of N and fill the remaining values of N with floor(K/M). My feeling is that a more even distribution shouldn't hurt parallel execution, and may in fact improve it (e.g. if the network traffic is at all related to the distribution, a more even data distribution may result in a better distribution of network traffic).

One of the classic line drawing algorithms in graphics was developed by Bresenham in his 1965 paper "Algorithm for Computer Control of a Digital Plotter". Here I give a more intuitive version for formulating a partition:

       #define ROUND(x)   (int)(0.5+(x))

       void partition(int K,int M,int N[]) {
         double slope;  /* the slope of our "line" */
         double accum;  /* an accumulator for the current "y" value */
         int rounded;   /* the rounded-off version of the accumulator */
         int prev;      /* the previous value of rounded */
         int i;
         slope = (double)K/M;       /* calculate the slope */
         accum = 0.0;               /* initialize the accumulator */
         prev = 0;                  /* initialize the previous value */
         for (i=0;i<M;i++) {
           accum += slope;          /* increment the accumulator */
           rounded = ROUND(accum);  /* round it to an integer */
           N[i] = rounded - prev;   /* store it, minus the prev */
           prev = rounded;          /* capture the new previous value */

When run on the example in the question, it returns:

       N[0] = 2, N[1] = 1, N[2] = 2

A more compelling example would be distributing 10 elements between 6 processors. If we took the simple approach described above, we'd get:

       2 2 2 2 1 1

The line-based algorithm will return:

       2 1 2 2 1 2

which is obviously more evenly distributed.

Editor's Note:

It is interesting how this technique relates to the recursive distribution (*) techniques in which the work is divided as the processors are divided. So to take Brad's example, we would look to divide 10 onto 6 like this:

    data    aaaaa AAAAA ->  aaa bb AAA BB ->  aa c bb AA C BB
    pes       3     3        2  1   2  1      1  1 1  1  1 1

Here the data is divided into two parts at each stage along with the number of processors. Each fraction is then further divided until the processors can not be divided any further, at which point the work is spread onto one processor. A weighted technique might take the following route:

    data    aaaaaa AAAA ->  aaa bbb AA BB ->  aa c bb d  AA BB
    pes       4     2        2   2  1  1      1  1 1  1  1  1

As Brad mentions, these examples have distributed the data evenly. When working with real codes, the programmer must also consider the relative costs of the computation, which may not vary linearly with data sizes, as well as the communication pattern/volume a particular data distribution/layout generates. Naturally, this caveat applies to the simpler distribution algorithms as well.

(*) Here's a simple example using recursion:

      void distribute (int K, int M, int N[], int i) {
        /* K: number of tasks to distribute
        *  M: number of PEs 
        *  N: result array--number of tasks indexed by PE number
        *  i: lowest index (PE number) of the M PEs                

        if (M==1)
          N[i] = K;
        else {
          distribute (  K/2, M/2, N, i);
          distribute (  K-K/2, M-M/2, N, i+M/2);

Quick-Tip Q & A

A: {{ What is a good formula or algorithm for apportioning K equally 
      sized, independent tasks among M PEs on the T3E.  For instance,
      if K==5 and M==3, then the algorithm might return: N(0)=1,
      N(1)=1, N(2)=3, which wouldn't be as good as if it returned:
      N(0)=2, N(1)=2, N(2)=1.  }}

# Thanks to the readers who sent in the following solutions.  
# Using the example of the previous article, they all distribute 10 tasks
# among 6 PEs as follows:
#    2 2 2 2 1 1
# For most situations, this is certainly good enough and the choice of
# technique is personal.

         N(i) = (K + M - i - 1) / M        ! integer division


# In C, you can use:
       block_min   = K / M;
       block_extra = K % M;
       if( _my_pe() < block_extra ) {
               block_size = block_min + 1;
       } else {
               block_size = block_min;

# In Fortran:
       block_min   = K / M
       block_extra = MOD (K, M)
       IF( _my_pe() .LT. block_extra ) THEN
               block_size = block_min + 1
               block_size = block_min


# Try this:
        for (i=0;i<M;i++) N[i] = K / M;        /* Integer Division */
        for (i=0;i<K-((K/M)*M);i++) N[i]++;


Q: You're writing a monte-carlo simulation, which will use very long
   sequences of pseudo-random numbers, to run on multiple PEs.  How can
   you ensure that the generated sequences will not overlap?

[ Answers, questions, and tips graciously accepted. ]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top