[Menu Bar] Resourses at ARSC Science at ARSC Newsroom Support About ARSC ARSC Home

 

ARSC HPC Users' Newsletter 233, November 30, 2001

Newsletter Index Quick-Tip Index Search Newsletters

Contents

 

SV1ex Memory Upgrade at ARSC, Dec. 8

As announced by Cray:

http://www.cray.com/news/0111/sv1ex_enhmem.html
the first "X" memory upgrade to the SV1e has been shipped, and it's coming to ARSC.

Chilkoot will be the first SV1ex with the faster memory.

Our install date is Saturday, Dec. 8th. Chilkoot users should check "news downtime" for the schedule. We encourage you to time your codes now and again after the upgrade, and let us know what improvements you see.

In addition to faster main memory, the memory upgrade includes a 32GB solid-state storage device (SSD). SSD is auxiliary memory hardware used for high-performance temporary storage. It provides file access at about the speed of main memory (rather than at the speed of rotating disks). It will support three functions:

This is an exciting upgrade, and we'll keep you informed via this newsletter. Chilkoot users should (as always!) pay attention to MOTD and news items.

 

New ARSC MPP Specialists

ARSC welcomes Jim Long and Tom Logan, ARSC's new MPP Specialists. They will help users port, optimize, parallelize, and run codes on the IBM SP, T3E, and future ARSC parallel systems. Jim and Tom bring scientific and programming experience, and are welcome additions to ARSC User Services.

We hope they'll be frequent contributers to this newsletter, and, as the next two articles show, they're off to a good start.

 

Fast Fourier Transform Applications

[ Thanks to Tom Logan of ARSC for this contribution. ]

It was Jean Baptiste Joseph Fourier (1768-1830) who first determined that any periodic or any finite function can be represented as a series of sine and cosine terms. The corresponding sum of terms is referred to as the Fourier Series. This series can be calculated using the Fourier transform, which, in essence, decomposes or separates a waveform or function into sinusoids of different frequency which sum to the original waveform.

The Fourier transform has some interesting properties that make it particularly useful. Foremost of these is the Convolution Theorem, which states that convolution in the time domain is equivalent to a multiplication operation in the frequency domain and vice versa. This means that rather than convolving two series in the time domain (O(N^2) operation) we can perform a Fourier transform to the frequency domain, multiply the two series (O(N) operation), then perform an inverse Fourier transform. The problem is that the Fourier transform is itself an O(N^2) operation. Thus enters the Fast Fourier Transform (FFT) which is a clever method of decomposing a discrete Fourier transform (DFT) that reduces the computation to an O(NlogN).

The FFT is the cornerstone of many fields of mathematical research and a basic tool in many branches of engineering. The usefulness of the FFT is based on the transformation of complex functions from the time domain to the frequency domain where some problems become far more tractable. Two operations in particular, convolution and correlation, can be performed in a very simple way using the FFT. The input data are transformed into the frequency domain where the amplitude spectra are multiplied and the phase spectra are added (for convolution) or subtracted (for correlation). The resulting data is inverse transformed back into the time domain to produce the same output that would be produced by a (cyclic) convolution operation in the time domain.

My own background with the FFT has been in SAR signal processing, which requires convolution of a time sampled Radar return with a reference function in two dimensions (both range and azimuth) in order to create a focused image. Also, the FFT proves particularly useful for image co-registration using frequency correlation to determine sub-pixel offsets. This can be applied to both complex images (for instance to create interferograms) or to simple amplitude images (to create time series or seamless mosaicking).

However, there are many other applications of the FFT.

Here is a list of interesting web sites. They show the breadth of FFT applications, along with the specifics of the Fourier transform, DFTs, and FFT implementations.

I give excerpts from most of the sites, and offer my own comments in parentheses:

http://utcsl.phys.utk.edu/~forrest/papers/fourier/ http://astron.berkeley.edu/~jrg/ngst/fft/fourier.html http://www.spd.eee.strath.ac.uk/~interact/fourier/fourier.html http://members.ozemail.com.au/~robgc/visgde.htm http://www.meyersanalytics.com/epfft.htm http://www.ee.bilkent.edu.tr/~haldun/wileybook.html

 

Tuning a C++ MPI Code with VAMPIR

[ Part I of III. Thanks to Jim Long of ARSC for this series of articles. ]

The Terrestrial Ecosystem Model (TEM), used at the UAF Institute of Arctic Biology's Spatial Ecology Lab ( http://picea.sel.uaf.edu ) simulates a continental to global sized ecosystem with up to 48 output variables.

TEM, written in C++, was ported two years ago to run on a Cray T3E. A global run on a 300 MHz workstation at that time took 3 days on 60,000+ gridcells (0.5deg.lat x 0.5deg.lon). On an Alaska dataset using 1,100+ gridcells, speedups on the T3E run as high as ten using 64 slave processors, while the most cost-effective computations occur with 4 slave processors (speedups 3 to 4). The price-performance ratio is maximal for 8 slave processors (speedups 4 to 6). Amount of speedup is proportional to the number of outputs requested for a given number of slaves.

In TEM there are two distinct portions. In the equilibrium portion, MPI calls are first used to initialize the model and bring it to equilibrium. Then in the transient portion, MPI is used to implement a master-slave work sharing algorithm where the master reads input files to obtain climate data for each yearly time step. Future plans include coupling TEM synchronously with a climate model, rather than having the future climate prescribed.

The problem is naturally parallel, and it was felt that the scalability of the implementation should be better than observed, even considering the high bandwidth requirements when many outputs are requested. It was also desired to investigate the performance of the code on other parallel platforms, in particular, ARSC's linux cluster and IBM SP3 cluster, in the context of thousands of simulations for a sensitivity analysis study.

Using the Alaska dataset with 4 slaves and 200 years of transient simulations, the code was rerun on the T3E with links to the vampirtrace library. VAMPIR is used to view the tracefiles, showing a graphic timeline of MPI calls within user code. Overhead for vampirtrace calls is negligible.

A comparable run using vampirtrace was also made on ARSC's training cluster. This is composed of 8 dual processor pentium II 333MHz nodes connected by ethernet and myrinet networks, and running VALinux.


Figure 1 (click on icon for larger view)

Figure 1 shows complete VAMPIR traces for the T3E (top graph) and cluster runs using ethernet (bottom). It's clear from the graphs that the T3E run completed in less time, and that the overall message passing characteristics of the two runs were similar.

The T3E output showed that the total wall-clock time was 37:09. All slaves meet a barrier after finishing their equilibrium portion. This barrier was passed at 15:56, leaving 21:13 for the MPI-intensive transient portion.

A summary chart produced by VAMPIR for the transient portion only showed:

  T3E 5-CPU Run
  -------------
  1:46:38  total cpu time
  1:17:17  total application time
  0:29:20  total MPI time

Using ethernet to pass messages on the linux cluster showed an unexpected result. The MPI-intensive portion of the code ran in the same time as on the T3E. VAMPIR's summary chart for the transient portion only showed:

  Linux Cluster using Ethernet, 5-CPU Run
  ---------------------------------------
  1:46:55  total cpu time
  0:36:29  total application time
  1:10:25  total MPI time

In figure 2, VAMPIR was used to zoom in on the MPI-intensive portions of the T3E and cluster (with ethernet) runs. The graphs are again stacked with the T3E output on top, and show roughly equal time slices.


Figure 2 (click on icon for larger view)

The short compute time of the cluster compared to the T3E is conjectured to be due to faster pointer dereference time for a linked list of objects that save the state of each gridcell after a time step. This conjecture remains to be proved in the future, perhaps with some probe code to access hardware counters. In the graphs, green areas show computation, red shows time in MPI calls, and the lines between nodes show messages. Zooming in even closer (figure 3) makes it obvious that the T3E spends more time processing but less time in MPI calls than the linux cluster:


Figure 3 (click on icon for larger view)

With 4 slave CPUs, there are 275+ pointer dereferences per CPU per time step. The smaller cache structure on the T3E is designed to handle data with a higher degree of locality, whereas the cluster CPU caches are large enough to contain all the data of the entire linked list (~ 275 x 556 bytes = 153Kb). The shorter compute time is offset by the longer time to pass messages (fig. 2 & 3). The equilibrium portion of the code on the T3E, however, ran in half the time of the cluster (15:56 vs. 33:27). Constant driving data is used in this portion of the code (computational, or green, portion of fig.1).

Message passing on the cluster using myrinet, rather than ethernet, was also a surprise. The total wallclock time was only 7.5 minutes longer than the T3E, (44:45 vs. 37:09). The summary chart for the transient portion only showed:

  Linux Cluster using Myrinet, 5-CPU Run
  --------------------------------------
  0:57:09  total cpu time
  0:37:07  total application time
  0:20:02  total MPI time

At first glance, Myrinet apparently outperforms message passing on the T3E for this code (20:02 vs. 29:20, but see fig.4 and discussion). Combined with the conjectured cache performance of the cluster for linked list dereferences, the total wallclock time of the transient portion was about half that of the T3E. This is significant as only the transient portion of the code needs to be run in a sensitivity analysis.

The first step in optimizing the parallel performance of the code is to inspect the VAMPIR output more closely.

As shown in figure 4, the increased MPI time on the T3E is due to the master processor waiting longer at barriers while the slaves compute. The time to actually transfer data (using myrinet--not ethernet--on the cluster) is roughly equal.


Figure 4 (click on icon for larger view)

On both platforms, the barrier wait is the master's dominant activity, as well as a significant fraction of the total MPI time. Furthermore, all but the first slave hit their corresponding barrier before the master determines their next input data, and none of the slaves begin computing before all the slaves have their new data. These are clues that this code might be tuned by 1) overlapping computation on the master and slaves and 2) having the slaves begin computing as soon as they receive new data.

This series of articles expands on the discussion of lessons learned in a collaborative, for credit, course in parallel scientific computing put on by UAF/ARSC and the Universities of Montana and New Mexico over the Access Grid. The collaboration, and this work, was presented at SuperComputing 2001 in the SC Global ( http://www.scglobal.org ) Access Grid forum.

In the next part of this series we look at an improved version of the code as well as VAMPIR output from ARSC's new IBM SP3, Icehawk.

 

ARSC Training: Visualization Case Studies

As announced long ago, the final course in ARSC fall training schedule is on Dec 12.

Wednesday, Dec 12, 2001 2-4pm: Visualization Case Studies
For registration, see:
http://www.arsc.edu/user/Classes.html

 

Quick-Tip Q & A

A:[[ I'm having trouble linking my C program on the T3E. It uses Cray FFT
  [[ libraries which, in turn, use BLACS grid routines.  I've tried
  [[ linking with libsci, but it still can't find BLACS_GRIDINIT.  The
  [[ Fortran compiling system seems to take care of this by magic.  How
  [[ can I get my C program to work?

  The gridinfo routines are considered to be communication routines, and
  are in the library, libcom.a.  The blacs routines are in libsci.a.
  Thus, add -lcom and -lsci to the list of cc options:

    cc -lcom -lsci ...


Q: As I migrate my code between Crays, IBMs, and SGIs, I assume
   I can just stick with the default optimization levels.  Is this a
   good assumption?

[[ Answers, Questions, and Tips Graciously Accepted ]]

 

 


Current Editors:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619
Donald Bahls ARSC User Consultant ph: 907-450-8674
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
Contact:
Send comments and questions to the current editors using this Contact Form.
Email Subscriptions: Archives:

 

Newsletter Index Quick-Tip Index Search Newsletters

 

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:

home | search | about | support | news | science | resources