ARSC T3D Users' Newsletter 92, June 21, 1996

Meeting of Local ARSC T3D Users Group: A Special Guest

To stay better attuned to the needs of T3D users, and to promote interdisciplinary exchanges, we are going to rejuvenate an old practice at ARSC. We will hold periodic, informal meetings of T3D users. This forum is open to any interested T3D user, and will take place about once a month. If you don't live in Fairbanks, but happen to be passing through at the right time, park the R/V and join us!

Our first meeting will be at 3:00 PM, Thursday, July 11, in Butrovich 107.


Frank Chism, parallel processing and T3D/E consultant at CRI, is planning to be here. If you've been reading these newsletters for long, you will probably recognize Frank's name. This is your chance to ask your questions first hand.

CRI/EPCC MPI 1.5a Available To ARSC Users

The CRI sponsored implementation of MPI for the T3D is written by Edinburgh Parallel Computing Center (EPCC). EPCC has released an upgrade, from version 1.4a to 1.5a, which is now available to ARSC T3D users.

This release is located in the directory:

and is available for user testing. To participate, change your search paths to the MPI library and include files as follows:

On (or soon after) June 27th, we intend to move v1-5a into the default system location, which is:


EPCC Release Notes for MPI v1.5a

These are EPCC's release notes for MPI v1.5a covering "Fixed Problems, Enhancements" and "Known Problems." You may visit EPCC's web pages on MPI at:

Release 1.5a: Fixed Problems, Enhancements

This page provides lists of fixed problems and product enhancements in the major release 1.5a compared to the major release 1.4a.

Fixed Problems: Release and Installation

Corrected problems in the release package and mechanism:
  1. Syntax errors were generated when compiling C++ MPI programs using the CRI C++ compiler, CC. (The MPI C header file violated C++ syntax by using the delete keyword as a function prototype argument.) This problem has been corrected.

MPI Conformance

The MPI for T3D implementation conforms to MPI Standard v1.1, dated 12th June 1995. A number of problems with conformance to the MPI standard have been corrected:
  1. The datatype extent of derived datatypes constructed using MPI_Type_vector and MPI_Type_hvector were incorrect. For Vector datatypes, the extent incorrectly included a trailing stride: extent = extent' * |stride| instead of extent = extent' * (|stride|-1) + len. The extent of Hvector datatypes incorrectly took the stride quantity to be in terms of the extent of the base datatype instead of bytes. Both problems have been corrected. Related to the above problem, MPI_Type_indexed displacements were incorrectly taken to be in terms of bytes, instead of the base datatype extent. This problem has also been corrected.
  2. There were two naming errors in the MPI Fortran binding given below. Both have been corrected.
    • MPI_SENDRECV was incorrectly named MPI_MPI_SENDRECV
    • MPI_STARTALL was incorrectly named MPI_MPI_STARTALL
  3. Although MPI does not define a binding for Fortran 90, it is possible to compile Fortran 77 MPI programs using the F90 compiler. There was a problem with use of the optional datatype MPI_REAL4 stemming from the fact that Fortran language type REAL*4 is represented differently in the Cray F77 (8 byte floating-point) and F90 (4 byte floating point) compilers. The 1.4a MPI Fortran header file contained definitions for F77, so use of MPI_REAL4 from F90 was incorrect. The same problem applied to INTEGER*{1,2,4} and MPI_INTEGER{1,2,4}. This has been resolved by changing the definitions of the MPI datatypes to match the Cray F90 representations as follows. Cray F77 users should be aware that MPI datatypes for 32-bit types should not be used.
        Fortran      maps to            C
      MPI data type   ====>   # bits   type
      -------------           ------  -------
      MPI_INTEGER             64      int
      MPI_REAL                64      double
      MPI_INTEGER1            32      short
      MPI_INTEGER2            32      short
      MPI_INTEGER4            32      short
      MPI_REAL2               32      float
      MPI_REAL4               32      float
      MPI_REAL8               64      double
  4. The call MPI_Get_elements incorrectly rounded the element count down when the received data count recorded in the passed status object was not a multiple of the passed datatype size, i.e., where the call MPI_Get_count would return count MPI_UNDEFINED. This has been fixed.
  5. Error class enquiry MPI_Error_class returned the incorrect value for many error codes. This has been corrected.
  6. Process topology information was not propagated to duplication communicators within MPI_Comm_dup. Also, zero-dimensional cartesian and empty graphs were disallowed, this restriction has been relaxed.

Software Faults

Corrected bugs in the MPI implementation:
  1. Communicator constructor operations (MPI_Comm_{create,dup,split} etc.) applied to communicators created using MPI_Comm_create could fail, either causing a hang or a fault. When ranks were not re-ordered, this also applied to communicators generated with process topology, using MPI_{Cart,Graph}_create. This error has been fixed.
  2. The freeing of internal MPI data structures on completion of immediate (non-blocking) sends with small message size (4Kbytes or less) was incorrect, causing a memory leak. This applied to all send modes except Synchronous mode: MPI_Isend, MPI_Ibsend, MPI_Irsend. This resulted in an operand range error within an internal allocation routine after a large number of immediate sends. The problem has been resolved.
  3. Use of logical global reduction operators (MPI_LAND, MPI_LOR, MPI_LXOR) from Fortran using basic datatype MPI_LOGICAL gave incorrect results. This has been corrected.
  4. The MPI abort and fatal error mechanisms could leave the application hanging. This has been corrected, the entire application is terminated if any process calls MPI_Abort or generates a fatal MPI exception.
  5. There were problems in group reference counting, allowing MPI_Group_free to de-allocate group tables for communicators when obtained using MPI_Comm_group. This has been corrected.
  6. The group range include and exclude operations MPI_Group_range_{incl,excl} failed for zero-length ranges and did not check for invalid ranges. This has been fixed.
  7. Freeing inactive (completed) persistent receive requests using MPI_Request_free could generate a fault. This has been corrected.
  8. Communicator attributes with zero (or NULL) value were not found by MPI_Attr_get. This has been resolved.
  9. Attribute Copy and Delete functions written in Fortran caused a fault when invoked. Also, the Fortran binding to MPI_Attr_put incorrectly stored a reference to the attribute value, instead of caching the value itself. Both problems have been corrected.
  10. Negatively strided datatypes created using MPI_Type_(h)vector were not transferred correctly, causing message corruption or fault. This has been fixed.

MPI 2 Conformance

The MPI 2 One-sided Communications prototype integrated with MPI for Cray T3D aims to implement a subset of the One-sided Communications MPI 2 Forum subcommittee proposal prior to Supercomputing '95 (early December 1995). This interface is in flux and will not stabilise until the draft MPI 2 standard is finalised; this is scheduled to be presented at Supercomputing '96 (late November 1996).

Corrected problems with conformance of the MPI 2 One-sided Communications prototype:

  1. The counting request table passed to MPI2_RMC_init was required to be a valid reference even if the number of counting requests attached to the target window was given as zero. The proposal on which the prototype implementation is based allows a NULL reference to be passed where the table size is zero. (The User documentation supported the proposal rather than the implementation.) The implementation has been corrected.


  1. The MPI abort and exception mechanism has been improved. Error reporting is now output on the standard error stream (stderr) rather than standard output (stdout) to avoid corrupting user output. This stream is character buffered meaning that error reports from multiple processes may be interleaved. This can be avoided by switching error reporting to use standard output by defining environmental variable MPI_ERR_STDOUT to any value.
  2. To avoid taking up filespace, MPI suppresses generation of debug core (mppcore) files on abort or fatal exceptions. This can be activated by defining environmental variable MPI_ERR_CORE to have any value, allowing use of the debugger on the process state at the point of termination.

Release 1.5a MPI for Cray T3D: Known Problems and Support

This page gives an informal list of Known Problems with the current release of CRI/EPCC MPI for Cray T3D. There is a formal page describing problems in more detail, with references to some resources such as revised code modules. The formal page is only accessible by Cray T3D site contacts.

Please refer MPI for T3D problems to your local support group. They can contact the implementors at the following e-mail address if they are unable to resolve the problems.

CRI/EPCC MPI for Cray T3D Product Information

CRI/EPCC MPI for Cray T3D is supported on an informal basis except for those sites engaged in a Support Licence Agreement with Edinburgh Parallel Computing Centre (see Licencing Information).

Release and Installation

Problems with the release and installation mechanism:
  1. The MPI library may exhibit problems relating to limits in the branching distance within an executable can occur. This problem is only prevalent for very large applications and the symptoms are indicated by a loader message such as "...calculated a relative branch target at too great a distance" . A special library resolving this problem (with small performance overhead) is available to licenced sites.

MPI Conformance

This MPI implementation aims to conform to the MPI Standard v1.1 dated 12 June 1995. Known problems with MPI conformance:
  1. The passing of Fortran CHARACTER arrays and strings as message data to all relevant MPI calls is not possible. This can cause an address error fault or message/memory corruption. It is possible to circumvent this problem by using the Fortran EQUIVALENCE mechanism to access CHARACTER data via an suitably-sized array of a safe type (e.g., INTEGER). Refer to Using MPI on the Cray T3D for more information.
  2. Use of derived datatypes of zero overall size will cause an error of class MPI_ERR_DATATYPE to be returned. This case should be equivalent to using zero datatype count.
  3. The persistent communications activation operations MPI_Start and MPI_Startall may change the values of the MPI_Request handle(s) passed.

Software Faults

Known bugs in the MPI implementation:
  1. Message transfers that occupy less than the full receive buffer in an MPI communication, and have size that is not a multiple of the receive MPI datatype, may be truncated. The amount of message data transferred is rounded down to the nearest datatype multiple. This problem can be resolved by using a receive datatype with smaller size that matches the transfer.
  2. Passing an MPI_Op handle referring to a freed user operator, to MPI global reduction calls (eg. MPI_Reduce) results in an address error.

Software Limitations

Known limitations in the MPI implementation:
  1. The number of outstanding communications relating to any process is limited. Once this limit is exceeded the application will abort with an informative error message. This is not an MPI exception, meaning that it is not handled by an attached user error handler routine. This problem can be controlled using environmental variable, MPI_SM_POOL.

MPI 2 Conformance

The MPI 2 One-sided Communications prototype integrated with MPI for Cray T3D aims to implement a subset of the One-sided Communications MPI 2 Forum subcommittee proposal prior to Supercomputing '95 (early December 1995). This interface is in flux and will not stabilise until the draft MPI 2 standard is finalised; this is scheduled to be presented at Supercomputing '96 (late November 1996). There are significant differences between the proposal that is current at the time of writing (June 96) and the 1.5a implementation.

EPCC/MPI And MPICH: Timings and Question Marks

It seems like the time to repeat some MPI timings.

As noted above, we have a new version of EPCC/MPI. I have also downloaded version 1.0.12 of MPICH (ANL and Mississippi State's implementation of MPI -- see:

In Newsletter #66, Mike presented some comparisons between EPCC/MPI 1.4a and MPICH 1.0.11 based on a sample program which measures "the time it takes for a message to be sent around a ring of processors." I recompiled this simple program, ring.f, using each of four MPI versions, on each of two compilers, cf77 and f90, and did my runs. This seemingly simple chore turned out to be problematical (hence, the "Question Marks").

After some experimenting, I came up with a revised program, which I'll call "ring.simple.f". It meets the following goals (when compiled with f90) which "ring.f" did not meet:

  1. does not crash when linked with MPICH
  2. behaves "normally" when linked with EPCC/MPI
The salient features of these timings: EPCC timings have improved; MPICH timings have worsened; EPCC still shows a greater bandwidth dependence than MPICH.

Table 1

Transfer rates (PE to PE, no ACK) in Microseconds Per REAL*4 Value Obtained from /mpp/bin/f90 "ring.simple.f"

  (Elements)     1.4a      1.5a    1.0.11    1.0.12
    1            38.5      35.4      49.7      57.9  
    2            40.2      36.2      49.7      58.3  
    3            39.3      39.3      54.1      59.1  
    4            46.5      40.0      54.2      59.1  
    7            47.5      41.4      55.2      60.4  
    8            49.0      36.6      54.8      59.9  
   15            51.2      58.9      50.4      54.9  
   16            51.4      54.0      50.5      55.0  
   31            54.8      56.5      50.6      55.9  
   32            55.0      57.1      50.7      56.2  
   63            63.6      64.0      51.6      56.6  
   64            64.2      60.9      51.7      57.0  
  127            80.3      73.2      53.9      59.7  
  128            80.5      69.7      53.9      60.1  
  255           134.8      92.0      58.1      64.2  
  256           134.9      88.8      58.7      64.2  
  511           202.0     140.2      67.7      73.5  
  512           202.2     135.9      67.7      73.9  
The difference between "ring.simple.f" and "ring.f" is that "ring.f" sends/receives from/to sequentially increasing block locations within the overall transfer array. By the end of a run, "ring.f" will have sent the entire array, while "ring.simple.f" will have sent the first part, over and over. (see the program listings, below).

For Newsletter #66, the EPCC/MPI 1.4a timings were produced from ring.f compiled with cf77, while f90 was used for the MPICH 1.0.11. I've repeated that combination here, but included f90 runs for EPCC 1.5a as well.

The EPCC 1.4a version hangs when compiled under f90 (see "fixed problem" #3, above). However, although 1.5a under f90 doesn't hang, it gives me problems on odd array locations, as the table shows. I did not include my cf77 compilations using MPICH, but fyi, regardless of the MPICH version, they crash on the 511 buffer.

Table 2

Transfer rates (PE to PE, no ACK) in Microseconds Per REAL*4 Value Obtained from /mpp/bin/f90 "ring.f" (Except as noted)

  Size        1.4a      1.5a      1.5a    1.0.11  1.0.12
  (Elements)  cf77       f90      cf77      f90     f90
    1         40.5       36.9     31.4      51.2    62.1  
    2         43.6       37.6     32.2      51.1    62.1  
    3         43.1       51.4     35.7      66.4    74.2  
    4         53.0       41.4     36.5      55.8    62.8  
    7         54.2      379.4     51.3      67.1    75.0  
    8         55.8       39.5     45.4      56.5    63.8  
   15         58.7      840.5     54.8      52.1    57.9  
   16         59.6       41.7     49.7      51.3    57.5  
   31         65.5     1615.4     57.9      53.4    59.2  
   32         65.5       59.8     52.7      52.1    58.9  
   63         76.3     3138.4     62.9      55.0    61.0  
   64         75.2       64.3     57.1      53.4    60.5  
  127         92.3     6190.1     73.6      57.8    63.9  
  128         91.0       72.1     67.6      56.2    64.6  
  255        141.2    12291.7     89.6      63.4    69.3  
  256        140.4       91.9     85.2      61.1    69.0  
  511        208.9    24487.6    140.4      71.5    78.7  
  512        209.3      137.8    135.6      69.3    78.4  
My final timings are for a version of "ring.f" which passes REAL*8 values instead of REAL*4 values. EPCC/MPI versions behave "normally" with either f90 or cf77. MPICH versions hang and crash with either f90 or cf77.

Table 3

Transfer rates (PE to PE, no ACK) in Microseconds Per REAL*8 Value Obtained from /mpp/bin/f90 "ring.real8.f"

  (Elements)     1.4a      1.5a    1.0.11    1.0.12
    1            40.3      37.3      51.2      62.7 
    2            43.2      41.8      55.8      63.7 
    3            42.4      41.4      56.0      64.1 
    4            49.9      40.4      56.6      64.6 
    7            51.2      42.5      51.2      56.9 
    8            52.3      42.5      51.6      57.8 
   15            54.8      61.0      52.0      58.8 
   16            54.9      61.0      52.3      59.2 
   31            60.4      64.2      53.4      60.7 
   32            60.5      65.5      53.6      61.1 
   63            69.1      72.7      56.1      63.9 
   64            69.0      71.6      56.0      64.3 
  127            86.2      94.3      60.8      68.7 
  128            84.5      93.3      61.0      69.3 
  255           137.3     139.1      70.1      78.8 
  256           137.7     140.2      69.9      78.6 
  511           206.0     205.1     HANGS  Operand range error
  512           206.1     207.4      ****      ****
I confess to presenting these results rather naively, as this was my first experience with MPI. I'll keep digging, however, and will gladly accept feedback from MPI programmers out there. Eventually, hopefully next week, I'll revisit these timings in this newsletter.

Here are the sources:

  # epcc mpi 1.4a

  # epcc mpi 1.5a

  # 1.0.11 mpich

  # 1.0.12 mpich

  # FC=TARGET=cray-t3d /mpp/bin/cf77
  FC=TARGET=cray-t3d /mpp/bin/f90

  # SRC=ringI.f
  # SRC=ring8.f

  all:        epcc14 epcc15 mpich11 mpich12

  epcc14:        $(SRC)
          $(FC) -dp -X$(NPES) $(SRC) -I$(MPI_14_INC_PATH) -L$(MPI_14_LIB_PATH) -lmpi 
          a.out -npes $(NPES)

          $(FC) -dp -X$(NPES) $(SRC) -I$(MPI_15_INC_PATH) -L$(MPI_15_LIB_PATH) -lmpi
          a.out -npes $(NPES)
          $(FC) -dp -X$(NPES) $(SRC) -I$(MPICH_11_INC_PATH) -L$(MPICH_11_LIB_PATH) -lmpi 
          a.out -npes $(NPES)
          $(FC) -dp -X$(NPES) $(SRC) -I$(MPICH_12_INC_PATH) -L$(MPICH_12_LIB_PATH) -lmpi
          a.out -npes $(NPES)

          -rm a.out *.o mppcore

  Program:  ring.simple.f
        INTEGER          MPROC,NPROC
        INCLUDE "mpif.h"
  *     REAL*8 MPI_Wtime
        REAL*8 T0,T1
        REAL*4  BUFFER(8192)
        MYPEM = MOD( NPES + MYPE - 1, NPES)
        MYPEP = MOD(        MYPE + 1, NPES)
        IF     (MYPE.EQ.0) THEN
          CALL FLUSH(6)
        DO I= 1,8192
          BUFFER(I) = I
        DO N2= 1,9
          DO I= -1,0
            NN = 2**N2 + I
            NR = 8192/(2**N2)
            T0 = MPI_Wtime()
            DO IRING= 0,NR-1
              IF     (MYPE.EQ.0) THEN
       +                      MYPEP, 9901, MPI_COMM_WORLD,
       +                      MPIERR)
       +                      MYPEM, 9901, MPI_COMM_WORLD,
       +                      MPISTAT, MPIERR)
       +                      MYPEM, 9901, MPI_COMM_WORLD,
       +                      MPISTAT, MPIERR)
       +                      MYPEP, 9901, MPI_COMM_WORLD, 
       +                      MPIERR)
            T1 = MPI_Wtime()
            IF     (MYPE.EQ.0) THEN
              WRITE(6,6000) NN,(T1-T0)*1.0D6/(NR*NPES)
              CALL FLUSH(6)
   6000 FORMAT(' BUFFER = ',I6,'   TIME =',F10.1,' Microsec')
  C     END OF RING.

  Program:  ring.f
  Identical to ring.simple.f except for the arguments to the MPI_SEND and
  MPI_RECV calls.  This version transfers different, sequential segments
  of the entire buffer on each pass.  (This program also appears in 
  Newsletter #66.)

            DO IRING= 0,NR-1
              IF     (MYPE.EQ.0) THEN
       +                      MYPEP, 9901, MPI_COMM_WORLD,
       +                      MPIERR)
       +                      MYPEM, 9901, MPI_COMM_WORLD,
       +                      MPISTAT, MPIERR)
       +                      MYPEM, 9901, MPI_COMM_WORLD,
       +                      MPISTAT, MPIERR)
       +                      MYPEP, 9901, MPI_COMM_WORLD, 
       +                      MPIERR)

  Program:  ring.real8.f
  Identical to ring.f except that it passes 8 byte reals instead of 
  4 byte reals.

        REAL*8  BUFFER(8192)

            DO IRING= 0,NR-1
              IF     (MYPE.EQ.0) THEN
       +                      MYPEP, 9901, MPI_COMM_WORLD,
       +                      MPIERR)
       +                      MYPEM, 9901, MPI_COMM_WORLD,
       +                      MPISTAT, MPIERR)
       +                      MYPEM, 9901, MPI_COMM_WORLD,
       +                      MPISTAT, MPIERR)
       +                      MYPEP, 9901, MPI_COMM_WORLD, 
       +                      MPIERR)

Happy Solstice...

The sun rose at 2:59 AM and it will set tomorrow at 12:48 AM... but it all depends on your point of view.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top