ARSC HPC Users' Newsletter 390, July 11, 2008
Pingo Code Optimization
[ By: Lee Higbie ]
Life is full of rules of thumb that may not be mathematical or physical laws, but seem close to the mark. One that I learned decades ago is: For the largest performance improvement look at the algorithm. For intermediate speedup gains, buy more or faster hardware and for incremental performance improvements, optimize your code. With today's HPC machines and software, the order of the last two should be reversed.
The first part of the rule of thumb is well illustrated by the Fourier transform. On N data, Fourier transforming times are typically: DFT algorithm time: constant * n * n FFT algorithm time: 4 * constant * n * lg2 (N) where lg2 is the base 2 logarithm, DFT is the discrete Fourier transform, and FFT is the fast Fourier transform.
For n = 16, the two times are equal. For n = 1K, the FFT time is about 4% of the DFT time. For n = 1M, the FFT takes less than 0.01% of the DFT time. You may be able to apply 10K processors to a million-point Fourier transformation to achieve comparable speedup, but using the right algorithm is certainly the proper optimization/parallelization approach. (See Sidebar 1 for other examples)
So, after you've done all the good algorithmic stuff, what next? Most HPC users have little control over the hardware beyond throwing more cores at a job and presumably they have selected good algorithms, so this article focuses on optimization. For machines like all those we are buying today, the performance can differ by up to about a factor of 100 if operations are performed directly from memory instead of from cache. (Even 100:1 is small compared to the ratio for large DFT:FFT, but it's a bunch, and much larger than most speedups from changing platforms.)
Let me set the stage by briefly describing the majority of scientific codes. They solve (usually partial, sometimes ordinary) differential equations. The domain of interest is divided into a grid of "cells." The grid may be 1D (violin string), 2D (drum membrane), 2 1/2 D (most weather codes), or 3D (cloud modeling). [2 1/2 D described in Sidebar 2 below]
No matter what algorithm you use, locality of data access is a performance key. Specifically, you should do what you can to keep your program accessing data that will fit in about half of the L1 cache, then half the L2 cache, etc. Notice that this introduces a machine dependency, but one that is often parameterizable.
For Pingo, there will be another consideration. In addition to adequate locality, you should avoid data with addresses that differ by a multiple 16. This means avoiding arrays with initial dimensions that are multiples of 100 (= 400 bytes = 25 * 16 bytes). Changing dimensions or allocations so that each value is odd will sometimes dramatically increase performance.
real, dimension(104, ...) :: or, sometimes better for multi-dimensional arrays, real, dimension(105, ...) :: means that successive final subscripts are more likely to be in separate cache sets, which means that data with consecutive final subscripts are more likely to be in cache when needed.
Summarizing, on Pingo, if your code is like most:
real, dimension(iTopMax, jTopMax, ...) :: x ... do i = 1, iTop do j = 1, jTop x(i, j, .... ... enddo enddo
try to make iTopMax an odd number. If there are many subscripts, try to make all dimensions except the last, odd numbers. If you use OpenMP, try to chunk your data coarsely on the OpenMP threading subscript.
Sidebar 1: Sorting is another common operation where the timing for the obvious algorithm is proportional to n * n but the good algorithms are proportional to n * log(n). Many less common problems have solution time proportional to exp(n) or n!, but very good approximations can be found in time proportional to n * n or n * n * n. Many scheduling problems fall in this category.
Sidebar 2: For 2 1/2 D weather grids, the domain of interest is divided into cells that are usually several kilometers on a side. Typical grids have hundreds to thousands of grid cells in longitude and latitude and only a few dozen vertical layers.
For 2 1/2 D codes, some physics and chemistry is modeled within a column above each terrain grid cell. This modeling is mostly independent of its surroundings (smog chemistry, e.g.). (If a program only has this type of physics and chemistry, it is called embarrassingly parallel, because it is easy to apply large number of processors to the model.)
Sidebar 3: Another recently discovered hazard of array syntax in Fortran: email often turns arrays into smileys (colon-paren becomes a smiley :).
SSH Allowed Version Changes
The following two releases of the kerberized ssh packages will be disallowed after July 28, 2008: OpenSSH_4.7p1b and OpenSSH_5.0p1a These versions will no longer be allowed to connect to ARSC systems after this date.
You can run "ssh -V" to determine the version of ssh on your system. E.g.: iceberg % ssh -V OpenSSH_5.0p1b, OpenSSL 0.9.8h 28 May 2008 Version OpenSSH_5.0p1b of the ssh kit is available here:
https://www.hpcmo.hpc.mil/security/kerberos/
Watch for news items on ARSC systems for details and additional information.
Quick-Tip Q & A
A:[[ Is there a way to tell which flags the MPI compiler is passing to
[[ the system compiler? Specificially I would like to see which
[[ include files and libraries are being passed to the system compiler.
#
# Ashwin Ramasubramaniam and Jed Brown both suggested the following:
#
mpixx -show
#
# From the editor...
#
On the IBM, use one of the following:
mpcc -v
mpxlf90 -v
A: BONUS ANSWERS to last week's question:
#
# Ken Irving
#
I realize the question was from the previous newsletter, but I was
somewhat dismayed to see poor old AWK not represented in any of the
several answers to a recent "Quick-Tip Q & A" question. Most answers
used various forms of head and/or grep, but AWK was developed for just
this sort of thing. I guess it's the Rodney Dangerfield of scripting
languages and tools, but, despite being extended, enhanced, and
replaced by great tools like Perl, it's still there and still useful.
Whew, that little diatribe out of the way, here's a solution using AWK:
ken@hayes:~/ 0$ ps -ef
awk '/ken/;NR==1'
UID PID PPID C STIME TTY TIME CMD
ken 13043 2969 0 Jun03 tty1 00:00:00 -bash
ken 21079 13043 0 Jun05 tty1 00:00:00 ssh werc
...
ken 23793 23076 0 13:47 pts/2 00:00:00 ps -ef
ken 23794 23076 0 13:47 pts/2 00:00:00 awk /^ken/;NR==1
AWK checks each line against each pattern, so the header line (NR==1)
doesn't even need to be specified first, but can follow the more
particular regular expression pattern or other matching operators.
A great advantage here is that the offending command does not need
to be duplicated, a feature of the other solutions that is doomed
to eventually cause someone confusion.
#
# Scott Kajihara
#
Suprised that no one suggested going back to the origins of grep:
g/RE/p in ed
Not as pretty as grep(1), but the solution might be
<command>
sed -n -e '1p' -e '/RE/p'
which would print out the header line (assumed line 1) and the
associated matches. Granted, sed(1) has a more primitive RE selection
than grep(1), but the examples given are not that sophisticated even:
% ps -ax
head -1 ; ps -ax
grep `whoami`
becomes
% ps -ax
sed -n -e '1p' -e '/'`whoami`'/p'
#
# Ed Anderson
#
You don't want to use grep in this case, you want to use awk or gawk
and tell it to print the first line of the output plus any lines
where the first field matches the string "mortimer". Here is an
example using gawk:
ps -ef
gawk 'FNR==1
$1=="mortimer" {print}'
Q: When I do an "ls -lt" to see what directories are available,
I *frequently* want to "cd" into the directory which sorts to the
top of the listing.
Is there a shortcut, so I can do this without typing the name of
the directory, or even worse, using a GUI interface?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
