ARSC T3E Users' Newsletter 121, June 20, 1997

T3E Optimization Guide Available

Ed Anderson of the benchmarking group at CRI sent in the following technical publication:

  Title:    The Benchmarker's Guide to Single-processor Optimization 
            for CRAY T3E Systems  
  By:       Ed Anderson, Jeff Brooks, and Tom Hewitt; Benchmarking Group,
            Cray Research.
  Abstract: "How to use features of the CRAY T3E series processors 
            from a high-level language to improve the performance of
            application codes."

  Chapters: - Introduction
            - Hardware Overview
            - Functional unit optimization
            - Cache optimizations
            - Stream buffer optimizations
            - E-register optimizations
            - Compiling for performance
            - Case study: the NAS benchmark kernels
            - Performance tools

Every T3E programmer is encouraged to retrieve a copy. It is available in postscript via anonymous ftp to: It is in the directory: pub/mpp/docs , and is named: .

SHMEM on Origin 2000 & Performance Comparisons of Ocean Model

[ Many thanks to Dr. Alan Wallcraft, a scientist with the Naval Research Center, for sending this in and sharing his experience and results. ]

The SHMEM library is now available on the SGI Origin 2000. Use of the library is very similar to on the T3E, except that REAL and INTEGER are 32-bit by default. If you use SAVEd local variables in SHMEM calls, you must use -LANG:recursive=on to prevent optimization around BARRIER's:

          f77 -64 -LANG:recursive=on fortran_program.f -lsma

I have found that the 64-bit ABI (f77 -64) is essential for practical cases. The only other differences I have encountered are START_PES to initialize SHMEM (can also be optionally used on T3E), and SHMEM_BARRIER_ALL in place of BARRIER. I have not been able to get T3E-style parallel I/O to work on the Origin and thus modified my code to optionally do all I/O (writes) from PE 0. The other gotcha is that startup involves memory mapping from /tmp. If this is too small, point TMPDIR to a large enough scratch disk.

On my standard ocean model test case, the Origin is almost identical in speed to a T3E-600. However, unlike the T3E, the Origin does not dedicate processors to a single job. I have seen very little variation in SHMEM performance on "undersubscribed" systems (with fewer active processes than processors). However, I have heard of up to 10x degradation in MPI performance on over-subscribed Origins, and would expect similar problems with SHMEM.

Most existing Origin systems have 64 or fewer CPU's. Larger machines have been sold, but are currently running in "pieces". The Origin is showing less scalability to a large number of nodes than the T3E-600, and falls over completely on 48 (and 64) nodes. A larger problem might exhibit better scalability above 32 nodes on the Origin, but the T3E's are still showing good scalability beyond 32 nodes even on this relatively small region.


                    METHOD   NODES

    Cray C90       VECTOR      1     1.36 hrs    355    (REAL*8)

    Cray T3D       SHMEM       4     5.15 hrs     94    
    Cray T3D       SHMEM       8     2.70 hrs    179    1.91x  4 nodes
    Cray T3D       SHMEM      16     1.41 hrs    342    1.91x  8 nodes
    Cray T3D       SHMEM      32     0.81 hrs    596    1.74x 16 nodes

    Cray T3E-600   SHMEM       4     1.71 hrs    282    3.00x  T3D
    Cray T3E-600   SHMEM       8     0.91 hrs    532    1.89x  4 nodes
    Cray T3E-600   SHMEM      16     0.49 hrs    993    1.87x  8 nodes
    Cray T3E-600   SHMEM      32     0.28 hrs   1752    1.76x 16 nodes
    Cray T3E-600   SHMEM      48     0.20 hrs   2365    2.38x 16 nodes
    Cray T3E-600   SHMEM      88     0.16 hrs   3115    1.32x 48 nodes

    SGI O2000      SHMEM       4     1.70 hrs    284    0.99x  T3E-600
    SGI O2000      SHMEM       8     0.84 hrs    575    2.02x  4 nodes
    SGI O2000      SHMEM      16     0.43 hrs   1123    1.95x  8 nodes
    SGI O2000      SHMEM      32     0.28 hrs   1724    1.54x 16 nodes
    SGI O2000      SHMEM      48     0.27 hrs   1788    1.59x 16 nodes

    Cray T3E-900   SHMEM       4     1.44 hrs    335    1.19x  T3E-600
    Cray T3E-900   SHMEM       8     0.78 hrs    623    1.86x  4 nodes
    Cray T3E-900   SHMEM      16     0.40 hrs   1203    1.93x  8 nodes
    Cray T3E-900   SHMEM      32     0.23 hrs   2109    1.75x 16 nodes


Reading List: High Performance Parallel Processing

Here is the beginning of a reading list. If you have any favorite texts or other sources, please send them in. We will expand this list and reprint it every few months.

MPI information sources:

  MPI: The Complete Reference. Snir, Otto, Huss-Lederman, 
  Walker and Dongarra.  MIT Press.
  ISBN 0 262 69184 1
  Using MPI. Gropp, Lusk, Skjellum. MIT Press.
  ISBN 0 262 57104 8

Parallel Programming Skills/Examples:

  Practical Parallel Programming. Gregory V. Wilson. MIT Press.
  ISBN 0 262 23186 7

  Designing and Building Parallel Programs. Ian Foster. Addison Wesley.
  ISBN 0 201 57594 9


  Fortran90/95 Explained. Metcalf and Reid. Oxford Science Publications.
  ISBN 0 19 851888 9

  Fortran90 Programming. Ellis, Philips, and Lahey. Addison-Wesley. 
  ISBN 0-201-54446-6

C and C++:

  Parallel Programming using C++. G.V.Wilson and P Lu. MIT Press.
  ISBN 0 262 73118 5

Background information:

  Hal's Legacy: 2001's Computer as a Dream and a Reality.
  ISBN 0 262 19378 7
  High Performance Compilers for Parallel Computing. Michael Wolfe,
  ISBN 0-8053-2730-4
  Supermen. C.J Murray. Wiley.
  ISBN 0 471 04885 2

ARSC T3E Queue Modifications

Yukon's queue structure has been revised as follows:

  m_64pe_8h@yukon has removed. 
  m_64pe_4h@yukon has been added.

The queue priorities are as follows:

  From 06:00 - 18:00, Alaska time:
    m_32pe_8h@yukon has priority over m_64pe_4h@yukon.

  From 18:00 - 06:00, Alaska time:
    m_64pe_4h@yukon has priority over m_32pe_8h@yukon.

ARSC T3E Now Running UNICOS/mk

Based on the 1.5.0 release overview, here is the impact on some user features. This release also has improved stability due to resolution of earlier problems.

Removed Features:
  - The C_JOBPROCS resource was removed as a resource category for the
  getlim system call.
  - The C_JOBPROCS resource was removed as a resource category for the
  setlim system call.
  - jobprocs category was removed from the nlimit -c option

Added Features:
  - grmview(1) changes:
    - -m option displays PE map with compression
    - -M option displays PE map without compression
    - ApId (application ID) and command fields added to queue displays
  - ja -S option produces a summary report in table format.
  - Resource limits
    - nlimit(1) has a new option mppt (MPP time limits) under the resource
    - jstat(1) options added:
      - -m indicates that multi-PE application memory and times are 
      - -M indicates that both command & multi-PE app memory & times are 
      - -t indicates extended time info requested.
    - limit(1) options added:
      - -F option to limit file size a process or job can create
      - -M option to limit MPP memory size
      - Added limit values to command output
    - limit(2) has 2 new options:
      - L_MPPM added to resource option
      - C_APTEAM category added under category option
    - nlimit(3c) added C_APTEAM as a resource category
  - Limited checkpoint and restart support
    - Processes using open files cannot be checkpointed or restarted.
  - System V IPC message queues and semaphores available

Quick-Tip Q & A

A: {{ On the T3E, how can you limit the size of core files? }}

  #   Try the "limit" command.  From the man page:
  #   -d dlim  Indicates a limit on core file sizes.  The
  #            dlim argument refers to memory words and is
  #            rounded up to the nearest click boundary.
  #            There are 512 decimal words per click on
  #            Cray Research systems.  A dlim of 0
  #            indicates the maximum core file size
  #            allowed.  This option is supported only for
  #            processes.
  #   This command seems less flexible than advertised, however.
  #   In multiple intended crashes of a 1 PE test program with a
  #   variety of dlim settings, I found that any dlim value other
  #   than 0 resulted in a core file of 160 bytes.  A dlim value of
  #   0 resulted in core files of 2168992 bytes.  The limit command
  #   acts more like an on/off switch.

Q:  Here's a frustrating situation.  You are using the "copy" and
    "paste" features of your workstation's desktop environment to insert
    text from one window into a file you're editing with "vi" in a
    second window. You "copy" the text, then, in the vi window, hit "i"
    for "insert," then you execute the "paste," but the result comes
        out with an 
                    on each 
    What is (probably) wrong?

[ Answers, questions, and tips graciously accepted. ]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top