ARSC T3D Users' Newsletter 31, April 14, 1995

Additional NQS Queues on the T3D

There are now 5 queues on Denali for accessing the T3D. The two newest queues facilitate using large PE configurations on the T3D. These two new queues will allow one 64 PE job and one 128 PE job to run for 5 minutes and 10 minutes respectively. The T3D queues on denali are now:


  QUEUE NAME   LIM TOT ENA STS QUE RUN WAI HLD ARR EXI
  ----------   --- --- --- --- --- --- --- --- --- ---
  m_8pe_24hr    4   0  yes  on  0   0   0   0   0   0 
  m_16pe_24hr   2   0  yes  on  0   0   0   0   0   0 
  m_32pe_24hr   1   0  yes  on  0   0   0   0   0   0 
  m_64pe        1   0  yes  on  0   0   0   0   0   0 
  m_128pe       1   0  yes  on  0   0   0   0   0   0 
A request made to these queues will be run as soon as enough PEs are available to satisfy the request.

Most T3D users currently have a limit of 32 PEs for batch access. Users can check their limits with the udbsee command:


  udbsee 
 grep jpelimit
The output will indicate their limits in interactive (i) and batch (b). For example:

  jpelimit[b]     :32:
  jpelimit[i]     :8:
If the batch PE limit is too small to access these new NQS queues, please contact Mike Ess to have the PE batch limits increased. Users can query the NQS batch system with the command:

  qstat -a
to see what other NQS T3D jobs are scheduled to run on the T3D. The utility mppmon is available to see what jobs are currently running on the T3D. T3D jobs are executed on a "first fit" priority and run to completion without interruption.

The Switch to the 1.2 Programming Environment

We are now running on the 1.2 Programming Environment. If you have any problems that you think are related to this upgrade, please contact Mike Ess.

I will be describing the differences between 1.1 and 1.2 in the next few newsletters.

New man Pages

There are now separate manpages for the mpp, cf77, cft77 and f90 products. Users can access these mpp pages with the commands:

  man 1m cf77
  man 1m cft77
  man 1m f90
In this newsletter I have extracted sections from the

          
  Programming Environment 1.2 and Compiler Products Releases letter
  Programming Environment 1.2 Release Letter, RL-1212 1.22
that describe the changes and additions with the 1.2 Programming Environment. These letters cover the Y-MP, T90, T3D and SPARC products and I have used only sections related to the T3D. I will e-mail both complete documents on request. The excerpts below cover the changes for Fortran 90 and Craft Fortran for the T3D. All lines from the release documents are prefaced with the "less than" character (<).

Limited Implementation of F90 on the T3D

From the 1.2 Programming Environment Release Letter from CRI:

  
  > 5.0.5 Limited functionality in CF90 0.1.1 compiler for Cray MPP systems
  > ---------------------------------
  > 
  > The CF90 0.1.1 compiler is not a fully Fortran 90 compliant compiler. The
  > following Fortran 90 features are not supported in this release: modules,
  > double precision (128 bits), double complex (256 bits), internal procedures,
  > array constructors and declarations containing intrinsic functions.
  > Also, CF90 0.1.1 does not support the shared programming model in CRAFT,
  > nor does it support BUFFER IN and BUFFER OUT statements (CF77 extensions
F90 has a lot of new syntax features that seem to lend themselves to multiprocessing, but the F90 currently on the T3D is only a single PE compiler. It has no mechanism for distributing data as Craft Fortran, but does work with the standard message passing libraries like PVM and SHMEM. Even when CRI does implement a mechanism for distributing data the simplest case will be an array. In Craft Fortran, the restrictions on such shared arrays are substantial and I would expect similar restrictions in a future CRI F90 product.

Currently there is no F90 manual for the T3D, but I have a 4 page ASCII paper describing what is in the release:


  "Signal-Processing(32-bit precision) Support for CRAY T3D Systems" SN-2191 1.2
I believe this is the only document specific to the MPP F90 product other than the manpage on denali.

The New 6.2 Craft Fortran Compiler

With docview on Denali, there is the new 6.2 Craft Fortran manual. There is no other documentation on 6.2 Craft Fortran supplied by CRI.

From the Programming Environment 1.2 and Compiler Products Releases letter:


  
  > ---------------------------------
  > 4. CF77_M 6.2
  > 
  > 4.1 Introduction
  > 
  > This purpose of this release is to enhance performance and to add the features;
  > shared WHERE, SERIAL_ONLY and PARALLEL_ONLY directives, and shared Fortran
  > 90 Intrinsics.
  > 
  > 4.2 Features
  > 
  > ---------------------------------
  > 
  > 4.2.1 Unrolling techniques
  > 
  > Description:
  > The compiler will perform loop unrolling to reveal cross-iteration memory
  > optimizations opportunities (read-after-write, read-after-read),
  > increasing opportunities for scheduling by increasing basic block size,
  > reducing loop overhead and improving chances for cache-hits.
  > 
  > The compiler will also perform unroll-and-jam and loop interchange on loop
  > nests to increase temporal and spatial locality of memory references. This
  > should result in increase in percentage of cache hits for the loop.
  > 
  > 
  > The compiler will also now recognize the following compiler directives:
  > 
  > cdir$ UNROLL[n]/NOUNROLL
  > 
  > The directive specifies whether to unroll or not to unroll the next loop.
  > "n" is optional, and specifies the amount of additional copies of the body
  > (note that "n" should be positive if specified). The presence of "n" will
  > disable automatic unrolling consideration of other loop in the same loop
  > nest.
  > 
  > The UNROLL directive can only be used on a loop whose trip count can be calculated
  > before entering the loop. When UNROLL is specified on a non-innermost loop,
  > the inner containing loops of this loop must be tightly nested; the compiler
  > will assume there is no dependence hazard which prevent UNROLL & JAM transformation
  > of this loop.
  > 
  > 
  > The compiler will also recognize the following command-line option:
  > 
  > -o nounroll/unroll[n]
  > 
  > This option controls unrolling of loops. n can be 1, or 2, indicating the interaction
  > with the source directives. "-o nounroll" indicates no automatic unrolling
  > will occur, and all UNROLL directives will be ignored. "-o unroll1" indicates
  > that unrolling will only occur for those loops marked with an UNROLL directive.
  > "-o unroll2" indicates that unrolling will be attempted for all loops. The
  > NOUNROLL compiler directive overrides the "unroll[n]" command-line option.
  > Default: nounroll. Default value of n: 2
  > 
  ...
  > External impact:
  > The user could see minor numerical differences due to different order of
  > evaluation for reduction-type loops.
  > 
  > Performance impact:
  > The user should see less execution time for restructured loop nests. Compile
  > time could increase due to extra compiler analysis and extra code from unrolling.
  > 
  > Documentation affected:
  > Cray MPP Fortran Reference Manual, publication SR-2504
  > cf77 man page.
  > 
  > 
  > Shared Where
  > 
  > Description:
  > User will be able to use explicitly shared data in the mask expression of the
  > WHERE statement. Shared and private data cannot be mixed in the expression.
  > UNKNOWN_SHARED data cannot be referenced in the mask expression.
  > 
  ...
  > 
  > 4.2.3 Array Intrinsics Support for Private Data
  > 
  > Description:
  > The following Fortran 90 array intrinsics in CF77 6.2 for MPP support PRIVATE
  > data: ALL, ANY, COUNT, CSHIFT, DOT_PRODUCT, MATMUL, MAXLOC, MAXVAL, MERGE,
  > MINLOC, MINVAL, PRODUCT, SUM, TRANSPOSE, PACK, RESHAPE, and UNPACK.
  > 
  > The EOSHIFT array intrinsic may be called with PRIVATE data for array arguments
  > with one to seven dimensions.
  > 
  ...
  > 
  > 4.2.4 Array Intrinsics Support for Shared Data
  > 
  > Description:
  > The user can pass shared data to the following array intrinsics: ALL, ANY,
  > COUNT, CSHIFT, DOT_PRODUCT, MATMUL, MAXLOC, MAXVAL, MERGE, MINLOC, MINVAL,
  > PRODUCT, SUM, and TRANSPOSE.
  > 
  > The EOSHIFT array intrinsic may be called with SHARED array arguments with
  > one to three dimensions. SHARED array arguments with greater than three
  > dimensions in a call to EOSHIFT are not supported in CF77 6.2. A diagnostic
  > will be given indicating deferment of the use of more than three dimensions
  > in a SHARED array argument to EOSHIFT.
  > 
  ...
  > 
  > Performance impact:
  > The user may see performance degradation when mixing private and shared
  > data, using expressions or array sections. These features cause the compiler
  > to create private temporaries. The private temporaries force the intrinsics
  > to run redundantly on each PE.
  >
  ...
  > 
  > 4.2.5 Argument Shape Checking
  > 
  > Description:
  > A compiler option (-Rd) was added which causes the compiler to generate runtime
  > code at the beginning of each subroutine to check that the "shape" of the dummy
  > and actual arguments conform to restrictions in the CRAFT programming model.
  > In particular, it checks that the dummy argument has no more dimensions than
  > its actual argument, and that each dimension of the dummy argument (except
  > possibly its last dimension) has the same extents as the corresponding actual
  > argument. If this restriction is violated, the program will print an error
  > message and abort. The code will also check that the last dimension of the
  > dummy argument is less than or equal to its corresponding dimension in the
  > actual argument. If this restriction is violated, it is a strong indicator
  > of over-indexing, and the program will print a one-time warning message
  > and proceed.
  > 
  ...
  > 
  > Performance impact:
  > Use of this option will increase runtime of generated code.
  > 
  ...
  > 
  > 4.2.6 SERIAL_ONLY, PARALLEL-ONLY Directives
  > 
  > Description:
  > 
  > Two new directives:
  > 
  > CDIR$ PARALLEL_ONLY
  > and
  > CDIR$ SERIAL_ONLY
  > 
  > 
  > when appearing within the bounds of a subprogram, assert that only a parallel
  > version, or only a serial version, respectively, needs to be generated for
  > the subprogram. The compiler will issue an error for the following:
  > 
  > a) If either directive appears outside a program unit.
  > b) If either directive appears in the main program.
  > c) If both directives appear within the same program unit.
  > 
  > Utilizing these directives, the user can express whether this routine will
  > only be called from a serial or parallel region.
  > 
  ...
  > 
  > Performance impact:
  > If these directives are utilized, the user could see greatly reduced code
  > size for codes containing CRAFT parallel constructs, and slightly reduced
  > runtime due to reduction in overhead.
  > 
  ...
  > 
  > 4.2.7 Invariant Reciprocal Hoisting
  > 
  > Description:
  > A new command line option has been added: -o [no]ieeedivide. This command
  > line options allows the compiler to decompose divides into multiply by reciprocal
  > in situations where a performance gain can be realized. The default is "-o
  > ieeedivide". This option is available on only Cray T3D platforms.
  > 
  > Platform:
  > Cray T3D systems
  > 
  > Compatibilities and Differences:
  > None
  > External impact:
  > Under "-o noieeedivide", the user may see slight numerical differences
  > at runtime from the same program compiled with "-o ieeedivide". The user
  > may also see numerical differences between instances of the same computation
  > in different contexts of the same program.
  > 
  > Performance impact:
  > Under "-o noieeedivide", the user may see improved performance in codes
  > which contain divides with loop-invariant divisor, or sequences of divides
  > with the same divisor.
  > 
  > Documentation affected:
  > MPP Fortran Reference Manual (SG-2504)
  > 
  > ---------------------------------
  > 
  > 4.2.8 Decrease Register spills & fix addressing contstants
  > 
  > Description:
  > Code generation optimization occurs by decreasing register spills and
  > fixing addressing constants.
  > 
  ...
  > 
  > Performance impact:
  > Many codes should see performance improvements.
  > 
  ...
  > 
  > 4.2.9 Trusted Arguments
  > 
  > Description:
  > The current syntax and semantics of the SHARED and PE_PRIVATE data distribution
  > directives was changed to allow for an asterisk (*) to appear prior to a variable
  > name if the variable is a dummy argument.
  > 
  > For the SHARED directive the presence of an asterisk will assert that the
  > user declaration can be trusted, and no redistribution is necessary for
  > that argument. The absence of an asterisk will assert that (possibly) redistribution
  > is needed for that argument; this is the default.
  > 
  >       SUBROUTINE JOE(A,B,C)
  >       REAL A(1024), B(4, 256), C(1024)
  > CDIR$ SHARED *A(:BLOCK),B(:,:BLOCK),*C(:BLOCK)
  > 
  > indicates that A and C are "trusted" arguments and B might need to be redistributed.
  > 
  > ****Note, there is a requirement that a "trusted" dummy argument represents
  > ****the base of the array (and not some offset into the array). This prevents
  > ****aligning on "A" but corresponding elements of "C" are on different
  > ****processor. For example:
  > 
  >       REAL X(2048), Y(4, 256), Z(2048)
  > CDIR$ SHARED X(:BLOCK), Y(:BLOCK, :), Z(:BLOCK)
  > 
  >       CALL JOE(X(1), Y, Z(2))
  >       ....
  >       END
  > 
  > ****The call to "JOE" violates the "trusted" assertion because the base
  > ****address of "Z" is not passed as the actual argument and the corresponding
  > ****dummy argument "C" is trusted.
  > 
  > For the PE_PRIVATE directive the presence of an asterisk will assert that
  > the user declaration can be trusted, and no Shared-to-Private-Coercion
  > (S2PC) should be performed for that argument. The absence of an asterisk
  > will assert that (possibly) S2PC is needed for that argument; this is the
  > default.
  > 
  > CDIR$ PE_PRIVATE *U, V
  > 
  > indicates that no shared-to-private-coercion occurs for U, however, S2PC
  > might occur for V.
  > 
  > Command-Line Interaction
  > ------------------------
  > 
  > S2PC is performed by default for all private arguments. It can be disabled
  > by the -dC command-line option; the default setting is -eC. For the PE_PRIVATE
  > directive the presence of an asterisk will assert that shared-to-private
  > coercion will not be performed for that variable.
  > 
  > The "trust me" (no redistribute) assertion can be specified for all objects
  > from the command line with the -dR option. The default setting for this option
  > is "-eR". The "trust me" assertion on the SHARED directive overrides the
  > -eR command-line option.
  > 
  > Along with both of the above command line options, the user can force the compiler
  > to not honor the "trust me" assertion on the SHARED and PE_PRIVATE directives
  > (i.e. redistribute all SHARED arguments, or perform S2PC on all PRIVATE
  > arguments, regardless of whether or not the argument name is preceded by
  > an asterisk on the directive) by specifying "-dA" on the command-line. The
  > default setting for this option will be "-eA".
  > 
  > Platform:
  > Cray MPP systems
  > 
  > Compatibilities and Differences:
  > CDIR$ SHARED *A(:BLOCK),B(:,:BLOCK),*C(:BLOCK)
  > 
  > indicates that A and C are "trusted" arguments and B might need to be redistributed.
  > 
  > ****Note, there is a requirement that a "trusted" dummy argument represents
  > ****the base of the array (and not some offset into the array). This prevents
  > ****aligning on "A" but corresponding elements of "C" are on different
  > ****processor. For example:
  > 
  >       REAL X(2048), Y(4, 256), Z(2048)
  > CDIR$ SHARED X(:BLOCK), Y(:BLOCK, :), Z(:BLOCK)
  > 
  >       CALL JOE(X(1), Y, Z(2))
  >       ....
  >       END
  > 
  > ****The call to "JOE" violates the "trusted" assertion because the base
  > ****address of "Z" is not passed as the actual argument and the corresponding
  > ****dummy argument "C" is trusted.
  > 
  > For the PE_PRIVATE directive the presence of an asterisk will assert that
  > the user declaration can be trusted, and no Shared-to-Private-Coercion
  > (S2PC) should be performed for that argument. The absence of an asterisk
  > will assert that (possibly) S2PC is needed for that argument; this is the
  > default.
  > 
  > CDIR$ PE_PRIVATE *U, V
  >
  > indicates that no shared-to-private-coercion occurs for U, however, S2PC
  > might occur for V.
  > 
  > Command-Line Interaction
  > ------------------------
  > 
  > S2PC is performed by default for all private arguments. It can be disabled
  > by the -dC command-line option; the default setting is -eC. For the PE_PRIVATE
  > directive the presence of an asterisk will assert that shared-to-private
  > coercion will not be performed for that variable.
  > 
  > The "trust me" (no redistribute) assertion can be specified for all objects
  > from the command line with the -dR option. The default setting for this option
  > is "-eR". The "trust me" assertion on the SHARED directive overrides the
  > -eR command-line option.
  > 
  > Along with both of the above command line options, the user can force the compiler
  > to not honor the "trust me" assertion on the SHARED and PE_PRIVATE directives
  > (i.e. redistribute all SHARED arguments, or perform S2PC on all PRIVATE
  > arguments, regardless of whether or not the argument name is preceded by
  > an asterisk on the directive) by specifying "-dA" on the command-line. The
  > default setting for this option will be "-eA".
  > 
  ...
  > 
  > Performance impact:
  > If the directive or command line option is used, the user could see reduced
  > runtime for subroutines containing explicitly shared dummy arguments.
  > 
  > Documentation affected:
  > Cray MPP Fortran Reference Manual, publication SR-2504
  > 
  > ---------------------------------
  > 
  > 4.2.10 PreFetch Queue Utilization
  > 
  > Description:
  > The compiler hides the latency of remote memory reads through utilization
  > of the prefetch queue mechanism.
  > 
  ...
  > 
  > Performance impact:
  > The user should see less execution time for CRAFT codes containing remote
  > memory reads.
  > 
  
  From the 1.2 Programming Environment Release Letter from CRI:
  
  > 1.6.1 CF77_M 6.2 compiler Loop Unrolling Directive
  > 
  > The implementation of the CDIR$ UNROLL directive in the CF77_M 6.2 compiler
  > differs from the CFPP$ UNROLL directive available in current versions of CF77
  > compiler for Cray PVP systems. More specifically, for CF77 on Cray PVP systems
  > the directive "CFPP$ UNROLL (n)" results in the stripmining of the loop by "n"
  > and for CF77_M, "CDIR$ UNROLL (n)" results in the stripmining of the loop
  > by "n+1".
  > 
  > For example, in order to "unroll a loop by 4" in CF77, the user sets the
  > directive argument"n" to 4
  > 
  >     "CFPP$ UNROLL (4)"
  > and in CF77_M, the user must set the directive argument "n" to 3
  >     "CDIR$ UNROLL (3)".
  > 
  > This difference will be removed in the first update of CF77_M, at which point
  > it will adopt the behavior of CF77 on the Cray PVP systems.
In the next newsletter I will discuss the differences and new timings of PVM in the 1.2 PE.

List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
  10. Missing Linpack and Eispack routines in libsci (Newsletter #25)
  11. F90 manual for Y-MP, no manual for T3D (Newsletter #31)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top