ARSC HPC Users' Newsletter 280, October 24, 2003

A Tale of Porting to the Cray X1

[ Thanks to Kate Hedstrom of ARSC for this article. ]

My primary application is an ocean model called ROMS (the Regional Ocean Modeling System). Before I can even try to compile it, I need to have access to the NetCDF library. I have compiled NetCDF several times over the years, but I am by no means an expert on it. It is usually very easy to compile the C part, with the f77, f90 and C++ parts being trickier. On the X1, I had some trouble with the C part because of one of those #ifdef cray do something special, #else do the default. On the X1, you want to be using the default POSIX code. Beyond that, I accepted the cray-compiled NetCDF library.

Back to ROMS, the first thing to do is to find out what arguments to give selected_real_kind for real*4, real*8, and real*16. (See an article devoted to "kind" issues in newsletter #263 .) Our code is written to use the r8 kind for 64-bit reals in most of the computations, including literal constants. Since we are managing our own kinds, I didn't feel that Cray ftn's "-s default64" option would be necessary.

With the "kind" issue settled, I went ahead and compiled and ran a serial job. The only change I had to make to the code was to work around a missing getpid function (informational, not really necessary). The job ran fine. With the serial code running, I went ahead and tried to run an MPI job. It died pretty quickly inside the MPI message passing. It was another #ifdef CRAY problem regarding the size of default reals. It cropped up in the size of messages to pass, and also in the writing of NetCDF files.

The moral of the story is to search out those special Cray cases, assuming your code currently runs on both Cray and other flavors of Unix.

Performance Issues

Once the code runs and results have been verified for correctness, the question of performance comes up. We have a standard ROMS case needing no input files but including most of the complicated physics of our more realistic domains. A number of obvious interest is the run time on the IBM Power4 system (iceflyer).

ROMS has its own built in timing, to find the relative costs of the major model components. The IBM times vary depending on the other jobs in the system, the best achieved being 1020 seconds, broken out as follows:

  Model 2D kernel ..........................     237.580  (23.5765 %)
  KPP vertical mixing parameterization .....     176.150  (17.4804 %)
  3D equations predictor step ..............     113.330  (11.2464 %)
  Atmospheric boundary layer coupler .......     103.560  (10.2769 %)
plus smaller stuff. The Mflips rating on this is:

  Flip rate (flips / WCT)                   :         521.414 Mflip/sec
  Flips / user time                         :         530.482 Mflip/sec
which is exactly 10% of peak, peak being 5.2 Gflip/sec on this 1.3 GHz power4 processor.

The Cray timing for the original ROMS code is 2510 seconds, broken out as follows:

  Model 2D kernel ..........................      38.333  ( 1.5286 %)
  KPP vertical mixing parameterization .....     855.806  (34.1271 %)
  3D equations predictor step ..............      15.775  ( 0.6291 %)
  Atmospheric boundary layer coupler .......    1531.398  (61.0678 %)

Obviously, some parts of the code are faster on the X1, but some parts are vastly slower. About 95% of the time is spent in two vertical mixing routines which didn't vectorize particularly well. The first of these routines was sped up by inlining one of its constituent functions, by adding "-O inlinefrom=lmd_wscale.f90", to the compile line for the two routines which use it. For some loops, inlining will enable vectorization which would otherwise be inhibited by procedure and function calls.

This extra option cannot be added to the compile for all the files because the function we are inlining uses f90 modules which have to be compiled before the function can be inlined.

With inlining, the X1 time dropped to 1652 seconds:

  Model 2D kernel ..........................      38.531  ( 2.3333 %)
  KPP vertical mixing parameterization .....      27.370  ( 1.6575 %)
  3D equations predictor step ..............      15.783  ( 0.9558 %)
  Atmospheric boundary layer coupler .......    1502.950  (91.0149 %)
It is pretty clear that this bulk_flux atmospheric boundary layer code needed some attention. A specialist from Cray and I rewrote the code to further improve vectorization and clean up some unnecessarily weird logic in it. The final X1 timing was 171 seconds:

  Model 2D kernel ..........................      38.004  (22.3496 %)
  KPP vertical mixing parameterization .....      27.273  (16.0387 %)
  3D equations predictor step ..............      15.648  ( 9.2024 %)
  Atmospheric boundary layer coupler .......      23.059  (13.5604 %)
The answer was even the same! I never imagined it would go so fast. The flop rating was:

  Total  FP ops           2679.144M/sec  452281560816 ops
which is 20% of peak for one Cray X1 MSP (which has a peak theoretical performance of 12.8 Gflops).

MSP vs. SSP Experiment

All timings shown above for the X1 are for one MSP, or one multi-streaming processor. Each MSP is actually composed of four SSPs, or single-streaming processors, which share cache and have other hardware and software support (multi-streaming) to make them behave as one processor. SSPs can, however, be accessed individually.

Early on, I ran some timings to assess SSP vs. MSP performance and found that four SSPs were about twice as fast as one MSP. However, 16 MSPs were vastly faster than 64 SSPs, in fact I couldn't get the 64 SSP run to finish - it kept timing out. With the X1 optimized version of ROMS, I sincerely doubt that I could get the four SSP run to be faster than one MSP, now that the compiler is able to vectorize and multistream both vertical mixing routines.

Butrovich Police Blotter

Residents of the UAF Butrovich building (which includes ARSC) received this actual email about a week ago. I've only changed the license numbers...

  > It has been brought to my attention that there was a small accident
  > in the parking lot this morning.
  > A silver Subaru sedan, license plate #nn-nnn apparently started
  > rolling backwards in the parking lot, gained momentum and slammed into
  > a parked white GMC truck, silence plate #nn-nnn.  Neither vehicle
  > was occupied at the time.
  > These vehicles need to be moved immediately.
  > Thanks!

[ The vehicles seem to have been removed... and, no, the police blotter is not expected to become a regular Newsletter feature.]

Quick-Tip Q & A

A:[[ Special characters for egrep include "$" (match end of line) and
  [[ "^" (match beginning of line).  I had a file containing dollar
  [[ signs ("$") at the beginning of many lines, and of course, these were
  [[ the lines I wanted to extract with egrep.
  [[ By trial and error, I discovered I had to DOUBLE escape the dollar
  [[ signs.  My curiousity aroused and my day already shot, I then
  [[ discovered that to extract lines beginning with carats ("^"), I would
  [[ only have to escape the carat once.  Like this:
  [[   mywkstn> cat test.txt
  [[   wwwww
  [[   $ xxx
  [[   yyy $ 
  [[   ^ zzz
  [[   000 ^
  [[   11111
  [[   mywkstn> egrep "^\\$" test.txt
  [[   $ xxx
  [[   mywkstn> egrep "^\^" test.txt
  [[   ^ zzz
  [[   mywkstn> 
  [[ Am I cursed?  Or is this rational behavior which someone can explain?

  # Thanks to Martin Luthi: 
  This is one of the fine distinctions of the different types of
  quotes. For example in the bourne shell:
  'xxx'     disable all special characters in xxx
  "xxx"     disable all special characters in xxx except $, ', and \.
  \x        disable the special meaning of character x
  mywkstn> egrep '^\$' test.txt
  mywkstn> egrep '^\^' test.txt
  gives you the behaviour that you want.
  # Thanks to Rich Griswold:
  A few examples using echo should explain what is going on:
    ~> echo "$"
    ~> echo "\$"
    ~> echo "\\$"
    ~> echo "^"
    ~> echo "\^"
    ~> echo "\\^"
  Shells perform expansion of special characters inside double quotes, so \$
  gets turned into $, but \\$ gets turned into \$.  This is because a dollar
  sign is a special character (used for shell variables), so escaping it
  with a backslash gets a plain dollar sign.  Since the backslash is also a
  special character (used to escape other special characters), you need two
  of them to get a single backslash.
  Since a carat is not a special character, escaping it with a backslash 
  results in \^.  However if you choose to escape the backslash with another 
  backslash, that will turn into a single backslash.
  Confused yet?  In summary:
    "$"   -> $
    "\$"  -> $
    "\\$" -> \$
    "^"   -> ^
    "\^"  -> \^
    "\\^" -> \^
  These examples were done using bash, but the results should be similar for
  sh and ksh.  To save some confusion, you can use single quotes so that the
  shell does not do any expansion of special characters.  What you see is
  what you get:
    '$'   -> $
    '\$'  -> \$  
    '\\$' -> \\$ 
    '^'   -> ^ 
    '\^'  -> \^
    '\\^' -> \\^
  In closing, if you are wondering what the difference between "$" and "\$"  
  (and "\\$", and "\\\$", ...) is, this should clear it up:
    ~> echo "$HOME"
    ~> echo "\$HOME"
    ~> echo "\\$HOME"
    ~> echo "\\\$HOME"
  # And thanks to Greg Newby:

  You are probably not cursed.  Find a good tutorial on Regular
  Expressions on the Internet or in a shell script book for a good (though
  not necessarily illuminating) treatment of this.
  The short answer is that the dollar sign is how your Unix shell
  identifies a variable, but it's also a regular expression for the end of
  line.  Some shells interpret the caret as a substitution symbol for
  command-line editing (tcsh, zsh and others).  The caret is also a
  regular expression symbol for the start of line.  The trick you
  discovered with escapes (which could also work with quotatation marks)
  is a way of telling the shell whether to interpret these special
  characters before passing them on to the grep command as arguments.
  Then, if they're not already interpreted, grep gets to decide whether to
  interpret them as regular expressions or as simple strings.
  That is the short answer.  Regular expressions are very powerful, but
  pretty complex, too.  There's even a special version of grep, "egrep",
  to make more powerful use of them than standard "grep."

Q: Here are the first four lines of a file that just goes on and on:

USER     DATE           CMDS         REAL          SYS         USER
jimbob   ALL          2267.0     674011.6       2113.5     258037.0
bobbob   ALL           109.0       1335.0        570.8         98.2
amysue   ALL            58.0    1223863.9       3003.7    1186547.6
  The columns remain consistent for the entire file.  
  I want to sort this on fields like "REAL" and "USER," and thought I
  could just use Unix "sort"... but the data is delimited by varying
  numbers of spaces and I can't figure it out.  Doing it by hand
  is taking forever!  Can anyone help?

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top