ARSC HPC Users' Newsletter 390, July 11, 2008

Pingo Code Optimization

[ By: Lee Higbie ]

Life is full of rules of thumb that may not be mathematical or physical laws, but seem close to the mark. One that I learned decades ago is: For the largest performance improvement look at the algorithm. For intermediate speedup gains, buy more or faster hardware and for incremental performance improvements, optimize your code. With today's HPC machines and software, the order of the last two should be reversed.

The first part of the rule of thumb is well illustrated by the Fourier transform. On N data, Fourier transforming times are typically:   DFT algorithm time: constant * n * n   FFT algorithm time: 4 * constant * n * lg2 (N) where lg2 is the base 2 logarithm, DFT is the discrete Fourier transform, and FFT is the fast Fourier transform.

For n = 16, the two times are equal. For n = 1K, the FFT time is about 4% of the DFT time. For n = 1M, the FFT takes less than 0.01% of the DFT time. You may be able to apply 10K processors to a million-point Fourier transformation to achieve comparable speedup, but using the right algorithm is certainly the proper optimization/parallelization approach. (See Sidebar 1 for other examples)

So, after you've done all the good algorithmic stuff, what next? Most HPC users have little control over the hardware beyond throwing more cores at a job and presumably they have selected good algorithms, so this article focuses on optimization. For machines like all those we are buying today, the performance can differ by up to about a factor of 100 if operations are performed directly from memory instead of from cache. (Even 100:1 is small compared to the ratio for large DFT:FFT, but it's a bunch, and much larger than most speedups from changing platforms.)

Let me set the stage by briefly describing the majority of scientific codes. They solve (usually partial, sometimes ordinary) differential equations. The domain of interest is divided into a grid of "cells." The grid may be 1D (violin string), 2D (drum membrane), 2 1/2 D (most weather codes), or 3D (cloud modeling). [2 1/2 D described in Sidebar 2 below]

No matter what algorithm you use, locality of data access is a performance key. Specifically, you should do what you can to keep your program accessing data that will fit in about half of the L1 cache, then half the L2 cache, etc. Notice that this introduces a machine dependency, but one that is often parameterizable.

For Pingo, there will be another consideration. In addition to adequate locality, you should avoid data with addresses that differ by a multiple 16. This means avoiding arrays with initial dimensions that are multiples of 100 (= 400 bytes = 25 * 16 bytes). Changing dimensions or allocations so that each value is odd will sometimes dramatically increase performance.

real, dimension(104, ...) :: or, sometimes better for multi-dimensional arrays, real, dimension(105, ...) :: means that successive final subscripts are more likely to be in separate cache sets, which means that data with consecutive final subscripts are more likely to be in cache when needed.

Summarizing, on Pingo, if your code is like most:

real, dimension(iTopMax, jTopMax, ...) :: x
do i = 1, iTop
do j = 1, jTop
x(i, j, ....

try to make iTopMax an odd number. If there are many subscripts, try to make all dimensions except the last, odd numbers. If you use OpenMP, try to chunk your data coarsely on the OpenMP threading subscript.

Sidebar 1: Sorting is another common operation where the timing for the obvious algorithm is proportional to n * n but the good algorithms are proportional to n * log(n). Many less common problems have solution time proportional to exp(n) or n!, but very good approximations can be found in time proportional to n * n or n * n * n. Many scheduling problems fall in this category.

Sidebar 2: For 2 1/2 D weather grids, the domain of interest is divided into cells that are usually several kilometers on a side. Typical grids have hundreds to thousands of grid cells in longitude and latitude and only a few dozen vertical layers.

For 2 1/2 D codes, some physics and chemistry is modeled within a column above each terrain grid cell. This modeling is mostly independent of its surroundings (smog chemistry, e.g.). (If a program only has this type of physics and chemistry, it is called embarrassingly parallel, because it is easy to apply large number of processors to the model.)

Sidebar 3: Another recently discovered hazard of array syntax in Fortran: email often turns arrays into smileys (colon-paren becomes a smiley :).

SSH Allowed Version Changes

The following two releases of the kerberized ssh packages will be disallowed after July 28, 2008:   OpenSSH_4.7p1b and   OpenSSH_5.0p1a These versions will no longer be allowed to connect to ARSC systems after this date.

You can run "ssh -V" to determine the version of ssh on your system. E.g.:   iceberg % ssh -V   OpenSSH_5.0p1b, OpenSSL 0.9.8h 28 May 2008 Version OpenSSH_5.0p1b of the ssh kit is available here:

Watch for news items on ARSC systems for details and additional information.

Quick-Tip Q & A

A:[[ Is there a way to tell which flags the MPI compiler is passing to
  [[ the system compiler?  Specificially I would like to see which
  [[ include files and libraries are being passed to the system compiler.

  # Ashwin Ramasubramaniam and Jed Brown both suggested the following:

  mpixx -show

  # From the editor...

  On the IBM, use one of the following:

  mpcc -v
  mpxlf90 -v

A: BONUS ANSWERS to last week's question:

  # Ken Irving
  I realize the question was from the previous newsletter, but I was
  somewhat dismayed to see poor old AWK not represented in any of the
  several answers to a recent "Quick-Tip Q & A" question.  Most answers
  used various forms of head and/or grep, but AWK was developed for just
  this sort of thing.  I guess it's the Rodney Dangerfield of scripting
  languages and tools, but, despite being extended, enhanced, and
  replaced by great tools like Perl, it's still there and still useful.

  Whew, that little diatribe out of the way, here's a solution using AWK:

      ken@hayes:~/ 0$ ps -ef 
 awk '/ken/;NR==1'
      UID        PID  PPID  C STIME TTY          TIME CMD
      ken      13043  2969  0 Jun03 tty1     00:00:00 -bash
      ken      21079 13043  0 Jun05 tty1     00:00:00 ssh werc
      ken      23793 23076  0 13:47 pts/2    00:00:00 ps -ef
      ken      23794 23076  0 13:47 pts/2    00:00:00 awk /^ken/;NR==1

  AWK checks each line against each pattern, so the header line (NR==1)
  doesn't even need to be specified first, but can follow the more
  particular regular expression pattern or other matching operators.

  A great advantage here is that the offending command does not need
  to be duplicated, a feature of the other solutions that is doomed
  to eventually cause someone confusion.

  # Scott Kajihara
  Suprised that no one suggested going back to the origins of grep:

    g/RE/p in ed

  Not as pretty as grep(1), but the solution might be

 sed -n -e '1p' -e '/RE/p' 

  which would print out the header line (assumed line 1) and the
  associated matches. Granted, sed(1) has a more primitive RE selection
  than grep(1), but the examples given are not that sophisticated even:

    % ps -ax 
 head -1 ; ps -ax 
 grep `whoami`


    % ps -ax 
 sed -n -e '1p' -e '/'`whoami`'/p'

  # Ed Anderson
  You don't want to use grep in this case, you want to use awk or gawk
  and tell it to print the first line of the output plus any lines
  where the first field matches the string "mortimer".  Here is an
  example using gawk:

    ps -ef 
 gawk 'FNR==1 

 $1=="mortimer" {print}'

Q: When I do an "ls -lt" to see what directories are available,
   I *frequently* want to "cd" into the directory which sorts to the
   top of the listing.

   Is there a shortcut, so I can do this without typing the name of
   the directory, or even worse, using a GUI interface?

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top