ARSC HPC Users' Newsletter 205, September 29, 2000

Cartesian Topologies in MPI: Part I

[ Thanks to Dr. Don Morton, University of Montana Missoula, for contributing this series of articles. This is part 1 of 2. ]

In addition to the commonly used MPI_COMM_WORLD communicator in MPI implementations, the MPI standard provides a mechanism for grouping processes in logical topologies, and leaves it up to the implementor to decide how best to work on a particular architecture. For example, from a programmer's viewpoint, the set of processors may be viewed as a 2D mesh or torus. In most cases, the user cannot be guaranteed that PE's will be allocated in a physical mesh configuration - in a simple Ethernet cluster, it is clearly impossible to have PE's allocated in a physical mesh configuration, but even on a machine like the Cray T3E, processors are not necessarily allocated in a manner that guarantees a particular processor alignment.

However, even if physical mesh configuration of processors cannot be guaranteed, the paradigm of logically viewing PE's as a mesh configuration can provide an alternative, and sometimes cleaner, way of doing things. This article begins by providing a brief overview of the MPI functions relating to logical Cartesian geometries, then provides an example of their use in a finite difference code. A useful reference for MPI man pages is:

MPI functions (in C++ syntax) commonly used for working with Cartesian geometries are listed below with a brief explanation. As they are used in the sample code, more details are explained.

MPI_Cart_create() -
each PE in the current communicator (typically MPI_COMM_WORLD) calls this in order to join a "new" communicator representing the logical mesh configuration. Arguments to the function include number of dimensions, number of processors in each dimension, whether wraparound is enabled in each dimension, and whether default reordering of PE numbers should be invoked. A new communicator is created, and is utilized in subsequent Cartesian group operations.
MPI_Cart_coords() -
this is called to determine the coordinates of a specified PE in the new Cartesian communicator. Arguments include the specified PE's rank in the original communicator, the number of dimensions in the processor mesh, and an array to be returned with the PE's coordinate in each dimension.
MPI_Cart_rank() -
this is called to determine the rank of a processor in the original communicator, given specified coordinates in the new Cartesian communicator. Arguments include the Cartesian communicator, an array providing the coordinate for each dimension, and a return argument with the desired rank.
MPI_Cart_shift() -
allows a process to determine the ranks of of processes in a particular dimension shifted a specified amount in a specified direction. For example, this can be used to determine the ranks of processes to the left and right, or even of processes two shifts to the left and right. Arguments include the Cartesian communicator, the dimension to shift in, and the amount of shift (and whether it's a downwards or upwards shift in the specified dimension).

MPI is famous for having "lots" of functions, and its coverage of Cartesian topologies is no exception. The above functions represent a small subset sufficient for many applications.

In part 2 of this series (i.e., in the next issue), a sample program will be presented which solves the time-dependent heat diffusion equation using finite difference methods.

Overview of SV1 Vector Cache Architecture

[ Many thanks to Mike Stewart, the author, and to the National Energy Research Scientific Computing Center (NERSC) for permission to redistribute this article. Mike is a Cray employee currently assigned to NERSC. The article is also available on the NERSC site, at: ]

Traditional Cray vector computers had no intermediate data cache between memory and the vector and scalar registers, so all operands were fetched directly from memory for all operations. With the J90, each CPU had a small 128 word cache for scalar loads and stores, but loads and stores to the vector registers still were not cached.

With the SV1, each CPU has a 32 KWord cache in which both scalar and vector stores and loads are cached. The latency for fetching data in this cache is four to five times less than that of fetches directly from memory. In addition, when a processor fetches data from its own cache instead of from the memory, this reduces contention with other processors for memory access.

One significant difference between the SV1 cache and that of other processors is in the size of the cache line. Commonly when data is fetched into a cache a set number of words are fetched at once to take advantage of any data locality. The set of words is called a cache line. For example, the DEC Alpha processor of the T3E has a cache line size of four words for its primary cache and eight words for its secondary cache. The SV1, on the other hand, has a cache line size of one for vector loads and stores to allow for gather/scatters and strided loads and stores without loading any unneeded data into cache.

For certain scalar load operations the SV1 CPU fetches not only the word asked for, but also prefetches seven surrounding words from memory that have the same address except for the lower three bits. This prefetch operation enhances performance on loops where locality of reference is an issue.

This cache is four way set associative. This means that a given cache line can reside in one of four locations within the cache. A memory address is mapped to a given cache line based on the last 13 bits of its physical address, so every 8192 words of memory are mapped to the same four way cache line. A line is replaced via a least recently used (LRU) algorithm.

The cache is both "write-through" and "write-allocate". This means that each stored word will always be placed both in the cache and memory. Note that no "read/update/write" sequence is required due to the single-word cache lines. Memory will always be coherent.

It is possible that if two processors are updating a common location in memory that one processor may obtain a "stale" value from its cache. The autotasking software automatically flushes the cache at important points in the code such that a user does not need to worry about this coherency issue.

All load requests go though the cache chips. If the cache does not contain the requested word, then the cache chips ask memory for the word. Note that in the SV1 architecture, the processor may have many outstanding memory requests. The processor is not forced to stall while waiting for the request to be satisfied.

Effective Use of Vector Cache

How can you tell whether a program is using data cache effectively? The hardware performance monitor (hpm) gives you the number of reads that are hitting in the cache (output reformatted for readability):

  % hpm ./a.out Group 0:  
  CPU seconds             : 0.05702      
  CP executing            : 17107068
  Cache hits/sec          : 70.77M     
  Cache hits              : 4035458 
  CPU mem. references/sec : 233.45M     
  CPU references          : 13312268
  Floating ops/CPU second : 157.83M

You can also see how cache usage is affecting the performance of your code by running with the cache turned off using the -m ecdoff option of the /etc/cpu command. This is the same program as the previous example:

  % /etc/cpu -m ecdoff hpm ./a.out Group 0:  
  CPU seconds             : 0.10927      
  CP executing            : 32780370
  Cache hits/sec          : 0.01M     
  Cache hits              : 608 
  CPU mem. references/sec : 121.83M     
  CPU references          : 13312268

You can see that the cache usage approximately doubled the performance of this program.

The SV1 has the same memory hardware as the J90s with the same memory speed despite having a processor with six times the peak performance. This means that it is possible for memory to be "oversubscribed" by the processors, i.e. the performance of a processor running a memory intensive application will be adversely affected if other processors are also using memory intensively because of contention for the limited number of physical ports to memory. Because of this fact and because of the cache, users should expect much more variability in programming timings than on a "traditional" PVP like a C90.

What are some techniques for optimizing cache usage? You should avoid memory accesses with strides that are a multiple of 8192, since four of these will fill up the cache slot. In general, memory accesses whose strides are large powers of two are to be avoided, since e.g. only eight loads of stride 4096 and sixteen of stride 2048 will fit in cache. Generally speaking strides of one are optimal. This is true with all Cray PVPs.

Announcements: Computer Art...SGI User Forum...Visualization

October 2, 2000 Talk by Visiting Computer Artist

ARSC is sponsoring a visit by computer artist, Piero Pierucci. He will speak about his work 2:00-4:00 p.m., this coming Monday, October 2, in the Sherman Carter Room, 204 Butrovich Building.

Mr. Pierucci uses genetic algorithms and chaos theory to create both audio and visual art. You might check out his web site:

November 1, 2000 Fall UAF SGI User's Forum

Date: November 1, 2000 Time: 1:30pm - 3:30pm Location: UAF, Sherman Carter Conference Room (Butrovich Bldg, Rm 204) Topic: SGI Origin 3000 architecture


  1. Introductions
  2. SGI training class at UAF
  3. Other issues, questions, & suggestions for the next meeting
  4. Origin 3000 systems available to UAF/ARSC users
  5. Origin 3000 technical presentation from SGI
October 19, 2000 Colloquium: Particle Systems for Visualization

[ You're invited to attend the following colloquium, jointly sponsored by the UAF Department of Mathematical Sciences and ARSC. ]

Professor Edward Angel Department of Computer Science and Electrical and Computer Engineering University of New Mexico

Date: 19 October 2000 Time: 2-3pm Location: UAF, Natural Sciences Facility, room 165. Topic: Particle Systems for Visualization

Although there have been major advances in scientific visualization, the large data sets that arise in simulation and medical diagnosis present significant challenges to the visualization community. For example, data sets from global climate simulations can generate 2.5GB per time step and scientists want to do simulations for time periods up to a century. This talk will discuss our efforts to develop new methods that can deal with such problems.

I will start with an overview of visualization strategies and how they can be implemented on clusters. Then I will discuss visualization methods based upon systems of particles that seek isosurfaces in scalar fields. Finally, I will discuss our efforts to develop generic particle systems that can be used for for flow visualization, numerical simulations, and inverse problems.

ARSC SV1 Upgrade News

Chilkoot has been upgraded to a 32-Processor Cray SV1 with 4 GW of memory and approximately 2 TB of disk storage. Access to all users was restored at 7:55 am, September 25th.

The following has been available as "news SV1_transition". Item 4) was updated this morning with the installation of PE3.3.

  1. Files: Your J90 files and directories have been moved to the SV1, including those on /tmp and /viztmp. Your account should look exactly as it did on the J90.

  2. Binaries: The J90 and SV1 are "binary compatible." This means that your J90 executables should run on the SV1 without recompilation and return the same results. However, we strongly recommend recompiling with the SV1 compilers to get optimal benefit from the faster CPUs.

  3. NQS: The maximum run time for all jobs has been reduced to 144000 seconds from 576000 seconds.

    Your qsub time limit requests, "-lT" (and possibly "-lt") must be reduced to 144000 seconds or less. Otherwise NQS will not schedule your job.

  4. Programming Environments: PE 3.4 is the default PrgEnv on the SV1.

    PE 3.3 (which was the default on the J90) has been installed in case it's needed for testing. It is available as PrgEnv.old. To use it, type:

    chilkoot$ module switch PrgEnv PrgEnv.old

  5. Operating System: The SV1 (like the J90) is running under UNICOS

  6. Please monitor the performance of your jobs using hpm and/or ja. If you have been doing this all along, we would appreciate hearing of the actual performance change you experience on the SV1.

  7. To detailed information on optimizing codes for the SV1, read Cray's document, "Benchmarker's Guide for Cray SV1 Systems":

Quick-Tip Q & A

A:[[ Should I stop using Fortran 90's "WHERE" construct?  It makes my code
  [[ more intelligible, but my sister-in-law's boyfriend told her it's
  [[ very slow.

  Well, the boyfriend (this would be, by the way, my wife's sister's
  boyfriend) does make a good point.

  Hand-written "DO" loops perform consistently better than equivalent
  "WHERE's" in simple tests on the T3E and SV1.  On an SGI O2, results
  are mixed.

  This table shows the CPU time (in seconds) to process a 40000 element
  array using WHERE vs DO loops.  Tests were run on ARSC's Cray T3E-900,
  Cray SV1, and an SGI O2.  As shown, the program was compiled with 
  both default optimization and -O3.

       ---------------------------    --------------------------    
       f90 ..          f90 -O3 ..     f90 ..          f90 -O3 ..
       ==========      ==========     ==========      ==========
       where   do      where   do     where   do      where   do
       ----- ----      ----- ----     ----- ----      ----- ----
  T3E  0.38  0.14      0.37  0.26     0.52  0.15      0.52  0.27 
  SV1  .083  .058      .091  .062     0.14  0.10      0.15  0.11
  O2   0.95  0.90      0.35  0.35     1.56  0.93      0.27  0.50

  Here's the basic "WHERE-ENDWHERE" construct.  The array "A" is
  initialized by the RANF() function, which generates a uniform
  distribution of pseudo-random numbers, such that 0<x<1.0.  The integer
  array "P" is initialized to 0 in advance.

       where (A .lt. 0.05)
         P = 1
  Here's the equivalent DO loop:

       do j=1,M
         do i=1,N
           if (A(i,j) .lt. 0.05) then
             P(i,j) = 1

  Here's the WHERE-ELSEWHERE-ENDWHERE construct (note that the entire
  construct was timed):

       where (A .lt. 0.05)
         P = 1
         P = 0
  Here's the do loop equivalent to the WHERE-ELSEWHERE-ENDWHERE:

       do j=1,M
         do i=1,N
           if (A(i,j) .lt. 0.05) then
             P(i,j) = 1
             P(i,j) = 0
  If anyone wants the complete test code, let me know.

Q: My NQS script is written in "sh" (the first line is, "#!/bin/sh").  
   I'd like to switch programming environments from within this script,
   but it claims it can't find the "module" command:

     CHILKOOT$  cat job.e71167
     sh-56 /usr/spool/nqe/spool/scripts/++GMz+++++0+++[2]: module: not found.

  I already export my environment to the script (#QSUB -x), and 
  in my interactive environment, module gives me no trouble.  Any

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top