ARSC HPC Users' Newsletter 233, November 30, 2001
Contents
 SV1ex Memory Upgrade at ARSC, Dec. 8
 New ARSC MPP Specialists
 Fast Fourier Transform Applications
 Tuning a C++ MPI Code with VAMPIR
 ARSC Training: Visualization Case Studies
 Quick Tip
SV1ex Memory Upgrade at ARSC, Dec. 8
As announced by Cray:
http://www.cray.com/news/0111/sv1ex_enhmem.html
the first "X" memory upgrade to the SV1e has been shipped, and it's coming to ARSC.Chilkoot will be the first SV1ex with the faster memory.
Our install date is Saturday, Dec. 8th. Chilkoot users should check "news downtime" for the schedule. We encourage you to time your codes now and again after the upgrade, and let us know what improvements you see.
In addition to faster main memory, the memory upgrade includes a 32GB solidstate storage device (SSD). SSD is auxiliary memory hardware used for highperformance temporary storage. It provides file access at about the speed of main memory (rather than at the speed of rotating disks). It will support three functions:
 provide an extremely fast, scratch filesystem (/ssd). If your program performs large amounts of file I/O, you'll be able to store heavily used files on /ssd (as you might on /tmp) during runs of the program.
 provide a secondary data segment (SDS). This is accessible using the assign command, and is a second way to perform very fast file I/O.
 provide a logical device cache (ldcache). This is a file system buffer managed by the OS. It will be used to increase I/O transfer rates on heavily used file systems.
This is an exciting upgrade, and we'll keep you informed via this newsletter. Chilkoot users should (as always!) pay attention to MOTD and news items.
New ARSC MPP Specialists
ARSC welcomes Jim Long and Tom Logan, ARSC's new MPP Specialists. They will help users port, optimize, parallelize, and run codes on the IBM SP, T3E, and future ARSC parallel systems. Jim and Tom bring scientific and programming experience, and are welcome additions to ARSC User Services.
We hope they'll be frequent contributers to this newsletter, and, as the next two articles show, they're off to a good start.
Fast Fourier Transform Applications
[ Thanks to Tom Logan of ARSC for this contribution. ]
It was Jean Baptiste Joseph Fourier (17681830) who first determined that any periodic or any finite function can be represented as a series of sine and cosine terms. The corresponding sum of terms is referred to as the Fourier Series. This series can be calculated using the Fourier transform, which, in essence, decomposes or separates a waveform or function into sinusoids of different frequency which sum to the original waveform.
The Fourier transform has some interesting properties that make it particularly useful. Foremost of these is the Convolution Theorem, which states that convolution in the time domain is equivalent to a multiplication operation in the frequency domain and vice versa. This means that rather than convolving two series in the time domain (O(N^2) operation) we can perform a Fourier transform to the frequency domain, multiply the two series (O(N) operation), then perform an inverse Fourier transform. The problem is that the Fourier transform is itself an O(N^2) operation. Thus enters the Fast Fourier Transform (FFT) which is a clever method of decomposing a discrete Fourier transform (DFT) that reduces the computation to an O(NlogN).
The FFT is the cornerstone of many fields of mathematical research and a basic tool in many branches of engineering. The usefulness of the FFT is based on the transformation of complex functions from the time domain to the frequency domain where some problems become far more tractable. Two operations in particular, convolution and correlation, can be performed in a very simple way using the FFT. The input data are transformed into the frequency domain where the amplitude spectra are multiplied and the phase spectra are added (for convolution) or subtracted (for correlation). The resulting data is inverse transformed back into the time domain to produce the same output that would be produced by a (cyclic) convolution operation in the time domain.
My own background with the FFT has been in SAR signal processing, which requires convolution of a time sampled Radar return with a reference function in two dimensions (both range and azimuth) in order to create a focused image. Also, the FFT proves particularly useful for image coregistration using frequency correlation to determine subpixel offsets. This can be applied to both complex images (for instance to create interferograms) or to simple amplitude images (to create time series or seamless mosaicking).
However, there are many other applications of the FFT.
Here is a list of interesting web sites. They show the breadth of FFT applications, along with the specifics of the Fourier transform, DFTs, and FFT implementations.
I give excerpts from most of the sites, and offer my own comments in parentheses:
http://utcsl.phys.utk.edu/~forrest/papers/fourier/
(Excellent background of theory)
An Introduction to Fourier Theory by Forrest Hoffman
The Fourier transform is used in linear systems analysis, antenna studies, optics, random process modeling, probability theory, quantum physics, and boundaryvalue problems and has been very successfully applied to restoration of astronomical data. The Fourier transform, a pervasive and versatile tool, is used in many fields of science as a mathematical or physical tool to alter a problem into one that can be more easily solved. Some scientists understand Fourier theory as a physical phenomenon, not simply as a mathematical tool. In some branches of science, the Fourier transform of one function may yield another physical function.
http://astron.berkeley.edu/~jrg/ngst/fft/fourier.html
(Good background of theory along with several applications highlighted including communications, astronomy, seismic research, and optics.)
The Fourier transform has many applications, in fact any field of physical science that uses sinusoidal signals, such as engineering, physics, applied mathematics, and chemistry, will make use of Fourier series and Fourier transforms. It would be impossible to give examples of all the areas where the Fourier transform is involved, but here are some examples from physics, engineering, and signal processing.
http://www.spd.eee.strath.ac.uk/~interact/fourier/fourier.html
(Almost identical to last address, with a few more applications identified including sonar, signal filtering, and acoustics.)
http://members.ozemail.com.au/~robgc/visgde.htm
(This is a very interesting web site with many visual examples of the FFT in use including general theory and oil industry application.)
A Visual Guide to Digital Signal Analysis Principles by Robert Crowley
http://www.meyersanalytics.com/epfft.htm
(Fourier transforms aren't just restricted to hard sciences  here's a stock market prediction application based on the FFT.)
The End Point Fast Fourier Transform (EPFFT) System & Indicator for short term & intraday trading. This package contains the advanced mathematical technique called the end point fast fourier transform and noise filter. These mathematical techniques are currently used in today's spaceage missile and speech recognition applications and are applied here to stock and commodity trading.
http://www.ee.bilkent.edu.tr/~haldun/wileybook.html
(Here's an example one of many books that can be had on the subject)
The Fractional Fourier Transform with Applications in Optics and Signal Processing
Haldun M. Ozaktas, Zeev Zalevsky, M. Alper Kutay
So far applications of the transform have been studied mostly in the areas of optics and wave propagation, and signal analysis and processing. They include Fourier optics and optical information processing, beam propagation and spherical mirror resonators (lasers), optical systems and lens design, quantum optics, statistical optics, beam synthesis and shaping, optical and quantum wavefield reconstruction and phase retrieval, perspective projections, flexible and costeffective time or spacevariant filtering with applications in signal and image restoration and reconstruction, signal extraction, signal and system synthesis, correlation, detection, pattern recognition, phase retrieval, radar, tomography, multiplexing, data compression, linear FM detection, and study of timefrequency representations, differential equations, and harmonic motion. Many of these applications are discussed at length in this book. By consolidating knowledge on the transform and illustrating how it has been applied in diverse contexts, the book aims to make possible the discovery of new applications in other areas as well.
Tuning a C++ MPI Code with VAMPIR
[ Part I of III. Thanks to Jim Long of ARSC for this series of articles. ]The Terrestrial Ecosystem Model (TEM), used at the UAF Institute of Arctic Biology's Spatial Ecology Lab ( http://picea.sel.uaf.edu ) simulates a continental to global sized ecosystem with up to 48 output variables.
TEM, written in C++, was ported two years ago to run on a Cray T3E. A global run on a 300 MHz workstation at that time took 3 days on 60,000+ gridcells (0.5deg.lat x 0.5deg.lon). On an Alaska dataset using 1,100+ gridcells, speedups on the T3E run as high as ten using 64 slave processors, while the most costeffective computations occur with 4 slave processors (speedups 3 to 4). The priceperformance ratio is maximal for 8 slave processors (speedups 4 to 6). Amount of speedup is proportional to the number of outputs requested for a given number of slaves.
In TEM there are two distinct portions. In the equilibrium portion, MPI calls are first used to initialize the model and bring it to equilibrium. Then in the transient portion, MPI is used to implement a masterslave work sharing algorithm where the master reads input files to obtain climate data for each yearly time step. Future plans include coupling TEM synchronously with a climate model, rather than having the future climate prescribed.
The problem is naturally parallel, and it was felt that the scalability of the implementation should be better than observed, even considering the high bandwidth requirements when many outputs are requested. It was also desired to investigate the performance of the code on other parallel platforms, in particular, ARSC's linux cluster and IBM SP3 cluster, in the context of thousands of simulations for a sensitivity analysis study.
Using the Alaska dataset with 4 slaves and 200 years of transient simulations, the code was rerun on the T3E with links to the vampirtrace library. VAMPIR is used to view the tracefiles, showing a graphic timeline of MPI calls within user code. Overhead for vampirtrace calls is negligible.
A comparable run using vampirtrace was also made on ARSC's training cluster. This is composed of 8 dual processor pentium II 333MHz nodes connected by ethernet and myrinet networks, and running VALinux.
Figure 1 (click on icon for larger view)
Figure 1 shows complete VAMPIR traces for the T3E (top graph) and cluster runs using ethernet (bottom). It's clear from the graphs that the T3E run completed in less time, and that the overall message passing characteristics of the two runs were similar.
The T3E output showed that the total wallclock time was 37:09. All slaves meet a barrier after finishing their equilibrium portion. This barrier was passed at 15:56, leaving 21:13 for the MPIintensive transient portion.
A summary chart produced by VAMPIR for the transient portion only showed:
T3E 5CPU Run  1:46:38 total cpu time 1:17:17 total application time 0:29:20 total MPI time
Using ethernet to pass messages on the linux cluster showed an unexpected result. The MPIintensive portion of the code ran in the same time as on the T3E. VAMPIR's summary chart for the transient portion only showed:
Linux Cluster using Ethernet, 5CPU Run  1:46:55 total cpu time 0:36:29 total application time 1:10:25 total MPI time
In figure 2, VAMPIR was used to zoom in on the MPIintensive portions of the T3E and cluster (with ethernet) runs. The graphs are again stacked with the T3E output on top, and show roughly equal time slices.
Figure 2 (click on icon for larger view)
The short compute time of the cluster compared to the T3E is conjectured to be due to faster pointer dereference time for a linked list of objects that save the state of each gridcell after a time step. This conjecture remains to be proved in the future, perhaps with some probe code to access hardware counters. In the graphs, green areas show computation, red shows time in MPI calls, and the lines between nodes show messages. Zooming in even closer (figure 3) makes it obvious that the T3E spends more time processing but less time in MPI calls than the linux cluster:
Figure 3 (click on icon for larger view)
With 4 slave CPUs, there are 275+ pointer dereferences per CPU per time step. The smaller cache structure on the T3E is designed to handle data with a higher degree of locality, whereas the cluster CPU caches are large enough to contain all the data of the entire linked list (~ 275 x 556 bytes = 153Kb). The shorter compute time is offset by the longer time to pass messages (fig. 2 & 3). The equilibrium portion of the code on the T3E, however, ran in half the time of the cluster (15:56 vs. 33:27). Constant driving data is used in this portion of the code (computational, or green, portion of fig.1).
Message passing on the cluster using myrinet, rather than ethernet, was also a surprise. The total wallclock time was only 7.5 minutes longer than the T3E, (44:45 vs. 37:09). The summary chart for the transient portion only showed:
Linux Cluster using Myrinet, 5CPU Run  0:57:09 total cpu time 0:37:07 total application time 0:20:02 total MPI time
At first glance, Myrinet apparently outperforms message passing on the T3E for this code (20:02 vs. 29:20, but see fig.4 and discussion). Combined with the conjectured cache performance of the cluster for linked list dereferences, the total wallclock time of the transient portion was about half that of the T3E. This is significant as only the transient portion of the code needs to be run in a sensitivity analysis.
The first step in optimizing the parallel performance of the code is to inspect the VAMPIR output more closely.
As shown in figure 4, the increased MPI time on the T3E is due to the master processor waiting longer at barriers while the slaves compute. The time to actually transfer data (using myrinetnot etherneton the cluster) is roughly equal.
Figure 4 (click on icon for larger view)
On both platforms, the barrier wait is the master's dominant activity, as well as a significant fraction of the total MPI time. Furthermore, all but the first slave hit their corresponding barrier before the master determines their next input data, and none of the slaves begin computing before all the slaves have their new data. These are clues that this code might be tuned by 1) overlapping computation on the master and slaves and 2) having the slaves begin computing as soon as they receive new data.
This series of articles expands on the discussion of lessons learned in a collaborative, for credit, course in parallel scientific computing put on by UAF/ARSC and the Universities of Montana and New Mexico over the Access Grid. The collaboration, and this work, was presented at SuperComputing 2001 in the SC Global ( http://www.scglobal.org ) Access Grid forum.
In the next part of this series we look at an improved version of the code as well as VAMPIR output from ARSC's new IBM SP3, Icehawk.
ARSC Training: Visualization Case Studies
As announced long ago, the final course in ARSC fall training schedule is on Dec 12.
Wednesday, Dec 12, 2001 24pm: Visualization Case Studies
For registration, see:http://www.arsc.edu/user/Classes.html
QuickTip Q & A
A:[[ I'm having trouble linking my C program on the T3E. It uses Cray FFT [[ libraries which, in turn, use BLACS grid routines. I've tried [[ linking with libsci, but it still can't find BLACS_GRIDINIT. The [[ Fortran compiling system seems to take care of this by magic. How [[ can I get my C program to work? The gridinfo routines are considered to be communication routines, and are in the library, libcom.a. The blacs routines are in libsci.a. Thus, add lcom and lsci to the list of cc options: cc lcom lsci ... Q: As I migrate my code between Crays, IBMs, and SGIs, I assume I can just stick with the default optimization levels. Is this a good assumption?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
Email Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 9074508669 Kate Hedstrom ARSC Oceanographic Specialist ph: 9074508678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 997756020

Subscribe to (or unsubscribe from) the email edition of the
ARSC HPC Users' Newsletter.

Back issues of the ASCII email edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.