ARSC T3E Users' Newsletter 193, April 14, 2000

Conquering the I/O Bottleneck

[[ Our thanks to the authors for submitting this article. ]]
Linear scaling during production runs: conquering the I/O bottleneck.

by Joseph Werne, Paul Adams and David Sanders

March 31, 2000

Ever turn off the I/O when benchmarking your code, knowing scaling with the number of processors is ruined by the massive data transfers current MPP platforms can generate? Fear no more. In this article we describe a technique for properly dealing with massively parallel I/O. We call the method Hierarchical Data Structuring (or HDS). Our experience with 500-PE tests on the Cray T3E demonstrates speed-increase factors that range from 1.4 to 2.1 or more, depending on the number of data files created by the application program. The method is simple, and we fully expect similar performance gains on other MPP platforms.

The work we report involves a DoD Challenge effort to simulate atmospheric turbulence in support of the Air Force AirBorne Laser (ABL) program. High-resolution simulations are conducted using a pseudo-spectral Fortran code written by one of us (JW) with as many as 1000 * 360 * 2000 spectral modes. Data-transfer needs for this project are large. A single 20-hour (wallclock), 500-processor run on the ERDC and NAVO DSRC's* T3E's generates 410 GB of data. Each processor opens and writes its own set of direct-access fixed-record-length binary output files. The files are opened just before writing and then closed again after writing is complete to minimize the number of simultaneously open files and to avoid problems with unflushed file buffers when a run is unexpectedly stopped. Write-behind I/O buffering is accomplished via the -F bufa:2000:4 option to the Cray assign command. **In order to facilitate real-time data migration to the ERDC Mass Storage Facility (MSF) or the NAVO Mass Storage and Archive Server (MSAS1), separate sets of files are created throughout the simulation; these sets are bundled together in tar files and moved off-line by a companion shell script as soon as the files are closed by the application program.

Through the course of the 20-hour simulation, 72,000 individual files are written to disk by the T3E. With the aid of our real-time data migration strategy, all of these files do not reside on the T3E's /tmp space at the same time; however, a minimum of 9,000 files do simultaneously occupy the local T3E disks. This number (9,000) is sufficient to adversely impact performance when HDS is not implemented.

The directory structure we initially used before employing HDS for massively parallel disk I/O was inhereted from our code's (author's) shared-memory, vector ancestry. For example, 3D volume files written at time levels 00, 01, etc to the "data" directory were previously (not) organized as follows:

         file1.00.0000, file1.00.0001, ... file1.00.0499,
         file2.00.0000, file2.00.0001, ... file2.00.0499, ...
         file1.01.0000, file1.01.0001, ... file1.01.0499,
         file2.01.0000, file2.01.0001, ... file2.01.0499, ...
In contrast, rather than simply placing all of the output data into a single directory, our new HDS strategy divides and subdivides the output data directory into multiple subdirectories and sub-subdirectories, ending with tails that contain no more than NCPU files (NCPU is the number of processors used for a given calculation). The new directory layout is

      data/00/file1/: file1.00.0000, file1.00.0001, ...
      data/00/file2/: file2.00.0000, file2.00.0001, ...
      data/01/file1/: file1.01.0000, file1.01.0001, ...
      data/01/file2/: file2.01.0000, file2.01.0001, ...
      file2.01.0499, ...
The finer resolution via the HDS directory tree insures that the application program does not spend time wading through an overwhelming number of filenames. For example, for the ABL simulations and data migration described above, the time required to complete 50 timesteps and the associated I/O when HDS is employed is 59 minutes. By comparison, when the older non-HDS directory layout was used, 82 minutes were needed. Furthermore, when data from previous (failed) runs also occupied the directory, as much as 2 hours could be required; and, in an extreme case while performing tests to diagnose the source of the variable execution time, we intentionly allowed the data directory to fill with O(100,000) files from previous failed tests. In this case our 50-timestep test did not complete in 6 hours, and we encountered the error "_sma_deadlock_wait" on the T3E.

With our new HDS strategy, execution time no longer varies. Dedicated 20-hour runs consistently tranfer data off-line every 50 timesteps in 59 minutes +/- 1 minute, regardless of the number of files in the directory tree.

Figure 1 shows a plot of the ABL turbulence-code performance as a function of NCPU. The wallclock time tc required per processor-gridpoint (Ng/NCPU) per timestep is reported. Ng= NX * NY * NZ is the total number of gridpoints. The per-processor problem size (Ng/NCPU) is held fixed as NCPU is varied. Deviation from exactly linear scaling on NCPU (i.e., tc= constant) results from 1) the nonlinear dependence of the FFT cost on NX, NY and NZ (most noteable as NX,NY,NZ tend to 0) and 2) differing cache utilization as the problem shape is varied to keep Ng/NCPU fixed. For the case with NCPU=500, we also include execution times without HDS (indicated by crosses).

Figure 1: Wallclock time tc per processor-gridpoint Ng/NCPU) per timestep for ABL spectral turbulence code. Ng/NCPU is held fixed for all tests. Exact linear scaling on NCPU would result in tc= constant. Circles indicate wallclock time when HDS is used to conduct I/O. Crosses indicate the best-case and typical times required when HDS is not used. They are 1.4 and 2.1 times slower than the same run with HDS. The difference between the two non-HDS timings result from larger numbers of files in the data directory for the slower case. Tests with even more files in the data directory did not finish in six hours (i.e., tc > 31 * 10-5). The dotted line shows the time for the 500-PE HDS case.

Though the execution times quoted here and reported in Figure 1 involve a specific code on the Cray T3E, we expect the performance enhancement exhibited by HDS will be universally applicable to other architectures. Given its simplicity, we believe it should be implemented for all MPP applications that generate large amounts of data and large numbers of individual files. An additional benefit of HDS is the user can easily navigate the directory tree and inspect data as it is being generated by the application program. This is not feasible with a non-HDS directory containing O(10,000) or more files.


Joe Werne is a research scientist at the CoRA division of NWRA. He has 11 years of experience in high-performance computing on shared-memory vector and distributed-memory parallel platforms. He has written highly optimized Fortran programs, including the ABL turbulence code, for both stably and unstably stratified turbulence simulations as part of NSF and DoD Grand Challenge applications since 1993. Paul Adams recently moved from Applications Support Analyst to User Services Manager at the ERDC DSRC. He has 12 years of experience in the computer industry, providing programming and user support on Cray, SGI, and IBM platforms. David Sanders is an Applications Support Analyst at the ERDC DSRC and a Systems Support Engineer with Silicon Graphics, Inc. He has 16 years of programming and user-support experience on shared-memory, distributed-memory parallel, and distributed- shared-memory platforms. David will join Cray, Inc. in April 2000. JW receives support for ABL simulations from AFRL F19628-98-C-0030 and AFOSR F49620-98-C-0029. Related supercomputer work is sponsored by ARO DAAD191-99-C-0037.

To learn more about the ABL turbulence simulations (the highest resolution stratified turbulence simulations on the planet), see the March 2000 issue of the ERDC DSRC Resource Newsletter or the Spring 1999 CEWES Journal. See also the publications listed below to learn more about the exciting and new scientific results from the project.

Questions and comments may be addressed to Joe Werne at:


[1] J.Werne and D.C.Fritts, Geophys. Res. Letters, 26 (1999) 439. [2] J.Werne and D.C.Fritts, Phys. Chem. Earth, (1999) in press. [3] R.J.Hill, D.E.Gibson-Wilde, J.Werne and D.C.Fritts, Earch Planets Space, 51 (1999) 499. [4] D.E.Gibson-Wilde, J.Werne, D.C.Fritts and R.J.Hill, Radio Science, (2000) in press. [5] R.Strelitz and J.Werne, CEWES Journal, (Spring 1999) 4. [6] ``Interview with Joseph Werne'', ERDC Resource Newsletter, (March 2000).


* U.S. Army Engineer Research and Development center (ERDC) and Naval Oceanographic Office (NAVO) DoD Supercomputing Resource Center (DSRC).

** We learned about this technique from David Cole and Mike Patterson at NAVO. Our specific implementation grew out of valuable correspondance with another DoD user, Alan Wallcraft, who conducts global and basin-scale ocean modeling on the NAVO T3E and the new ERDC IBM SMP.

SAR Workshops at UAF

ARSC is co-sponsoring the following workshop in Fairbanks:

> Intermap Technologies, Earthwatch Inc., AeroMap U.S. and ESRI are
> conducting two in-depth workshops on the STAR-3i interferometric
> synthetic aperture radar (IFSAR) system and related data products. The
> workshops will be held in Anchorage Alaska on April 18, 2000 and in
> Fairbanks Alaska on April 19-20, 2000. Participants can register and
> get more information for the Anchorage workshop on-line by following
> the Anchorage link:

> For information and registration for the Fairbanks > workshop please
> follow the Fairbanks link:


New Communication and Scalability Paper on PEMCS Site

See the electronic journal of Performance Evaluation and Modeling for Computer Systems (PEMCS):

To read the following paper:

Comparing the Communication Performance and Scalability of a Linux and an NT Cluster of PCs, a SGI Origin 2000, an IBM SP and a Cray T3E-600, Glenn R. Luecke, Bruno Raffin and James J. Coyle, Iowa Sate University, Ames, Iowa, USA, March, 2000

ARSC Job Vacancy Announcements

For details, see:

Here are the basics:

Visiting Assistant Professor of Computer Science


Teaching of undergraduate and graduate courses at all levels, research (including a 50% research position at ARSC), and service.

Visualization Research Specialist


The responsibilities of the ARSC Visualization Research Specialist are to perform independent research in the area of scientific visualization or human-machine interaction and to provide general and specialized visualization support and training for users and staff.

Quick-Tip Q & A

A:{{ Gol' Dern it an Dang Blast It All!
  {{ I opened an old file, "my_stuff.original", with vi, like this:
  {{   vi my_stuff.original
  {{ Then, before making ANY changes, realized I shouldn't mess with the
  {{ original, but should create and edit a new version.  So I used vi to
  {{ re-save it as a new file:
  {{   :w my_stuff.unoriginal
  {{ Then I typed in some limericks, etc., saved and quit:
  {{   :wq
  {{ But vi pulled a switch-er-roo.  "my_stuff.original" now contains the
  {{ NEW version, and "my_stuff.unoriginal" contains the original
  {{ What happened!!  And, if ":w" isn't right, how can I change the name
  {{ of the file I'm editing?

  We got three responses; here are two of them.  (By the way, 
  ":f <new name>" works in vi as well as vim.)
  On the version of vi I use (VIM 5.4), use :f (or :file) instead of
  :w.  This will change the name of the file you're editing AND save
  all changes to the new file instead of the old file.  If the new file
  already exists, you have to use :w! to save your file when you're
  done editing.

  Looks to me like vi did exactly as you told it to.

  It saved the edit buffer (containing the "original" file) to
  my_stuff.unoriginal and then saved the edited buffer (containing the
  limericks, etc.) to my_stuff.original.

  Why not just quit, do "cp my_stuff.original", and then

Q: When should I use SHMEM_FENCE versus SHMEM_QUIET?  Is there
   any difference between them?

[ Answers, questions, and tips graciously accepted. ]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top