ARSC HPC Users' Newsletter 214, February 23, 2001

UAF Colloquium Series: Jack Dongarra, March 5

The UAF Department of Mathematical Sciences and ARSC are jointly sponsoring a Mathematical Modeling, Computational Science, and Supercomputing Colloquium Series. The schedule and abstracts for the '00-'01 academic year are available at:

Next presentation:

Enhancing Performance, Measurement Tools, and the Computational Grid Dr. Jack Dongarra University of Tennessee Oak Ridge National Laboratories

Date: Monday, March 5, 2001 Time: 1:00-2:00 PM Location: Butrovich 109


In this talk we will look at some methods for generating automatically fast robust numerical kernels for numerical operations and methods for measuring the performance of today's processors.

In addition we will look at a system, called NetSolve that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easy-to-use, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which do not require any programming effort from the user. Good performance is ensured by a load-balancing policy that enables NetSolve to use the computational resources available as efficiently as possible. NetSolve offers the ability to look for computational resources on a network, choose the best one available, solve a problem (with retry for fault-tolerance), and return the answer to the user.


Jack Dongarra earned a Bachelor of Science in Mathematics from Chicago State University in 1972. A year later he finished a Master of Science in Computer Science from the Illinois Institute of Technology. By this time, he was already involved in the EISPACK project producing high quality, portable, Fortran implementations of state-of-the-art algorithms for numerical linear algebra. He formally received his Ph.D. in Applied Mathematics from the University of New Mexico in 1980. He worked at the Argonne National Laboratory until 1989, becoming a senior scientist. He now holds an appointment as University Distinguished Professor of Computer Science in the Computer Science Department at the University of Tennessee and is an Adjunct R&D Participant in the Computer Science and Mathematics Division at Oak Ridge National Laboratory (ORNL) and an Adjunct Professor in Computer Science at Rice University.

I/O Algorithms on Cray T3E

[ Thanks to Brad Chamberlain of the University of Washington for this contribution. ]


This article discusses experiences in I/O on the Cray T3E. The experiments were conducted in the context of the ZPL runtime libraries. Specifically, the routines being optimized implement the file I/O that ZPL uses to read sparse array patterns and data from disk.

These routines are implemented in C, perform file I/O using C's fseek() and ftell() routines, and pass data between processors (i.e., the tokens used for coordination) using the SHMEM library's shmem_put() call.

[ Editor's Note: ZPL is a parallel programming language developed at the University of Washington. Compilers are available at ARSC and other HPC Centers. For more, see ARSC Newsletters 122 , 186 , 188 , 189 , and 203 , or visit: ]

Input Files:

I'm trying to read a sparse array from disk. Assume that sparse arrays are a set of indices (row, column) and a set of values. For simplicity, indices and values are stored in two different files, and each is stored in row-major order.

Because we want the same input files to apply to an arbitrary number of processors in an arbitrary topology, we don't split the file into per-processor chunks, but rather leave it as one complete whole.

By default, I chose to represent the indices in plain text so that they would be human-readable, and because there would be no loss of precision. In addition, the text file is smaller than its binary equivalent due to the fact that integers (for the problem sizes we're looking at, less than 150,000 x 150,000) can be stored more compactly as text than in their binary representation (for example, 150,000 takes 7 bytes in text--6 digits plus a separator--but 8 bytes in binary).

I chose to represent the values in a binary format to avoid precision loss (or the extremely long text formatting that would be required to get all the significant digits).

Input Algorithm:

My algorithm for reading the files was essentially the following: At the start of runtime, each processor will grab a block of the problem space (the 150,000 x 150,000 possible indices) based on the number of processors and the logical processor grid. For example, running on 4 processors, I could organize them as a 1x4 grid, a 2x2 grid, or a 4x1 grid, which would result in blocks of 150,000 x 37,500, 75,000 x 75,000, or 37,500 x 150,000, respectively.


Version 1:

My original approach (for simplicity) was to have each processor read the entire file of indices, storing the ones that fell within its block, and dropping the ones that didn't. It turned out that this approach failed miserably once 16-32 processors were involved. The I/O time would get approximately 2.5 times worse every time the number of processors doubled resulting in outrageous amounts of time being spent in initialization. Presumably, this was due to contention on the disk I/O channels.

Version 2:

Next, I had the processors read the index file in a coordinated manner. For example, processor 0 would start reading indices until it got to one that fell outside of its block. At that point, it would pass a token to the next processor indicating where in the file it had left off (using C's ftell routine). That processor would set the file to the position indicated, and read until it found an index that fell outside of its block, and so on.

One weird thing about this approach was that once an index fell outside of a processor's block, that processor needed to tell the next processor the file position just prior to reading that index. However, C text files make it hard to back up in a file once something's read. Therefore, each processor would check the current file position before reading *every* index, passing that position along only if the index was outside of its range. I knew that this would result in a lot of wasted checking of file positions.

The result was an algorithm which was slower than version 1 for small numbers of nodes, but scaled much better, ultimately beating version 1 for large numbers of processors. In particular, the time to read the file pretty much leveled off for larger numbers of processors.

Version 3:

This version removed the overhead of using ftell to check the position at every step. The way I chose to do this was to pass not only the file position, but also the last index read with the flow of control from one processor to the next. Thus, the file position only had to be checked after an index fell outside of a processor's block of indices, and the misplaced index would be passed explicitly to the next processor rather than re-read by it. This resulted in a savings of about half over version 2.

Version 4:

My next step was to modify version 3 by using a binary file format for the indices rather than a text file format. The tradeoff was that more disk space would be required (and read by the program), but that the reading process would be faster--no parsing of a text file by fscanf(), but a quick series of binary reads by fread). This resulted in another savings of about half over version 3.


At this point, I have an algorithm that scales well and is 4 times faster than the first naive algorithm that scales (version 2). Here are some timing summaries for a particular problem size. Note that this isn't a formal experiment (there are some small inconsistencies between the way each run was performed), but the overall picture is correct:

    32 T3E-900 processors, 13,708,072 indices/values

    version 1: 282.847 seconds  (but 128 procs = 1007.47 seconds)
    version 2: 849.207 seconds
    version 3: 429.932 seconds
    version 4: 263.804 seconds


Being lazy about I/O does not pay off as problems and processor sets grow. Coordinating file I/O, though painful, is worthwhile, and binary files, though potentially larger than their text equivalents, will pay off greatly in terms of speed.

In more general terms, the complexity of parallel programming is often such that it's tempting to do the easy or naive thing, cross your fingers, and hope that it works out.

In reality, as you scale up your problem size or number of processors, it can start to matter to the point that you're wasting a lot of expensive supercomputing time, not only for yourself but that others might be able to use productively. (Our hope, of course, is that by solving hard problems well in the compiler, we'll save others both processing and programming time.)

This was a more important lesson than merely, "fseeks are expensive."

ARSC Training, Next Two Weeks

For details on ARSC's spring 2001 offering of short courses, visit:

Here are the courses available in the next two weeks:

User's Introduction to ARSC Supercomputers

Wednesday, Feb 28, 2001, 2-4pm

SGI Origin 2k/3k Series Applications Programming and Optimization

Mon-Wed, March 5-7, 9am-5pm

Quick-Tip Q & A

A:[[ Give an example of parallelism in the real world, and discuss
  [[  briefly with respect to concurrency, scalability, locality.

  ## This "question" drew no responses... 
  ## (By the way, we're like the "Car Guys"... we _really_ dig it when 
  ## people send in puzzlers.)

Q: Here's one for the Fortran 90 programmers out there.

   This situation appeared when porting a good-sized code from the Cray
   SV1 to an Origin 3000.  It's been mightily condensed, but the program
   below duplicates the problem and compiler error message.

   What's wrong on the Origin (or with the code) and how would you 
   fix it?




      ! The kind of this data type must represent integer values N, 
      !  where  -10**16 < N < 10**16

     &         TYPE_INT = SELECTED_INT_KIND (16)




      BASE = BASE + I
      INCR = BASE 


      INCLUDE 'interfaces.h'

      INTEGER(TYPE_INT), SAVE :: J = 0

      PRINT*, "Incremented J : ", INCR (J, 1)


        f90 -c selected_kinds.f
        f90 -c incr.f
        f90 -o t t.f incr.o selected_kinds.o


Make and run on Cray SV1 
chilkoot$ make
        f90 -c selected_kinds.f
        f90 -c incr.f
        f90 -o t t.f incr.o selected_kinds.o
chilkoot$ ./t
 Incremented J :  1


Failed make and explain on SGI Origin 3000 
sard$ make
        f90 -c selected_kinds.f
        f90 -c incr.f
        f90 -o t t.f incr.o selected_kinds.o

      print*, "Incremented j : ", incr (j, 1)
f90-389 f90: ERROR TEST, File = t.f, Line = 9, Column = 35 
  No specific match can be found for the generic subprogram call "INCR".

f90: MIPSpro Fortran 90 Version 7.3  (f61) Thu Feb 22, 2001  11:35:42
f90: 21 source lines
f90: 1 Error(s), 0 Warning(s), 0 Other message(s), 0 ANSI(s)
cf90: "explain cf90-message number" gives more information about each message
*** Error code 2 (bu21)

sard$ explain cf90-389
Error : No specific match can be found for the generic subprogram call "%s".

A function or subroutine call which invokes the name of a generic
interface does not match any specific subprogram interfaces in the
generic interface block.  All dummy arguments that do not have the
OPTIONAL attribute must match exactly all corresponding actual arguments
in type, kind type, and rank.

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top