ARSC HPC Users' Newsletter 389, June 27, 2008



On the Hazards of Array Syntax, ( Or How I Achieved Six Orders of Magnitude Speedup )

[ By: Lee Higbie ]

For a few computational problems the phenomenon being modeled is completely local. What happens at one point in the model does not depend on what goes on around it. To a first order, a forest fire's burning depends on the weather and the fuel, but not on the fire's behavior a mile away, in another part of the forest. These programs, sometimes called embarrassingly parallel, are the easiest to parallelize for supercomputers because the model can easily run on many cores.

Most programs require information from neighboring points, which means that the cores running the program must communicate- the major source of speed and programming bottlenecks for many parallelized programs. For example, the weather here depends not only on the local temperature, pressure, wind speed, but also on that of the areas nearby. Wind and barometric pressure changes propagate rapidly compared to fires.

I received a program from an Outside source ("Outside" is Alaskan for "Lower 48"). It is a first order production forest fire model. We wanted to adapt it to use for the fires we have in Interior Alaska, ones that often account for more than half the acreage burned in the entire US.

In the middle of this fortran code, inside the main triply nested loop, is the innocuous looking code block:

MC1 = 1.03*EMC
where ( PCP > 2. )
MC1 = 35.
end where

where ( MC1 < 1. )
MC1 = 1.
end where

After changing the geometry for the larger region needed for Alaska; updating the grid to use 1 km cells; and changing to a projection that is more suitable for our latitude, I had a program grid with about 2.7M grid cells instead of the original 10K. In the course of modifying the program, we also changed from 24 time cells to 49.

I ran the program. It bombed when it ran out of time, so I started a performance analysis. With gprof I quickly found that the main routine accounted for nearly all the execution time. After inserting a few:

call cpu_time(time)
print *, [ cumulative elapsed time in code block]

I isolated the time to those simple lines above. Twice I had ignored them--no loop, nothing much going on. Right? Wrong!

I changed them slightly to:

real, dimension(1800, 1500, 49) :: PCP, mc1, EMC
where ( PCP > 2. )
mc1 = 35.
else where
mc1 = max(1, 1.03 * EMC)
end where

I figured this would stop trashing the cache and reduce storing the elements of the array mc1, to one store per element. This change sped up the loop nest by about a factor of three, pretty good for an hour's work. But still on the order of a second per innermost loop iteration.

But, I kept thinking about the physics. The inner pair of loops was working on the mesh's 2.7 M grid cells, why would any cell require more than a second's worth of computation? And if you do the arithmetic, that works out to a couple core-weeks of CPU time, still far from feasible for a production code that should run daily or twice daily during the fire season. Even using all of Midnight's 2312 cores, it would take thousands of seconds. Embarrassing.

I reread the paper describing the fire model, the one the original program was based on, and decided that nearly all the work implied by the array statements was redundant. Each variable in both of the statement blocks above is triply subscripted for longitude, latitude and time. The innermost loop pair is also over longitude and latitude. Only the value for the current cell is being used in any iteration.

Conveniently i and j are used for longitude and latitude. With the redundant calculations removed, the statement block becomes:

where ( PCP(i, j, :) > 2. )
mc1(i, j, :) = 35.
else where
mc1(i, j, :) = max(1, 1.03 * EMC(i, j, :))
end where

The innermost loop pair includes more than 100 lines of code, but now it runs about a million times faster than before. 396M loads and stores have been eliminated from each loop iteration. (The third subscript, the colon above, is over the time steps.)

Those 396 million extra loads and stores took so much time that the one hundred line loop pair runs about a million times faster than the original. Embarrassment gone.

[Ed. Note:  On the entire program, which performs extensive I/O, the 
changes would save more than 100,000,000,000,000,000 stores and reduce 
the run time by a factor of about 5000.]

Handling Little Endian Files with XLF

[ By: Don Bahls ]

By default IBM systems such as iceberg generate big endian binary files. This can cause problems when you attempt to use binary files that were generated on a little endian system, such Linux systems using AMD or Intel processors. Version 10 of the XLF compiler has a runtime option which allows Fortran I/O to use unformatted little endian files.

Here's an example using the reader and writer codes from newsletter 344 see: /arsc/support/news/hpcnews/hpcnews344/index.xml#article1 .

  1. Compile the code as you normally would. E.g.
    iceberg1 46% xlf90 reader.f90 -o reader
    ** reader   === End of Compilation 1 ===
    1501-510  Compilation successful for file reader.f90.
    iceberg1 47% xlf90 writer.f90 -o writer
    ** writer   === End of Compilation 1 ===
    1501-510  Compilation successful for file writer.f90.

    If the application is run, we get a standard unformatted big-endian file. The command "od -x" lets us verify the output.dat file is in the expected format.

    iceberg1 51% od -x output.dat 
    0000000  0000 0018 cede d0da ce0f aded dead beef
    0000020  cafe feed deaf bead edef face 0000 0018
  2. When the XLFRTEOPTS environment variable is set, little-endian unformatted I/O will be performed for each specified unit. In these two examples, unit 11 is used for the I/O.

       export XLFRTEOPTS=ufmt_littleendian=11
       iceberg1 53% ./writer   
       iceberg1 54% od -x output.dat 
       0000000  1800 0000 edad 0fce dad0 dece efbe adde
       0000020  edfe feca cefa efed adbe afde 1800 0000

The reader application confirms the output.dat file is formatted properly.


iceberg1 61% ./reader 
 Integer*4 matches.
 Real*4 matches.
 Integer*8 matches.
 Real*8 matches.

If you happen to need to do little-endian I/O on a range of units you can specify a range with the XLFRTEOPTS environment variable:


# specifies that little-endian unformatted I/O be done on units 10 
# to 100
export XLFRTEOPTS=ufmt_littleendian=10-100  

For additional details on this feature see the following documentation available on the IBM website:

Iceberg Farewell

Back in 2004, iceberg became the first allocated IBM system at ARSC. It was one of the first systems that IBM deployed with, the then new, Federation 2 switch. Over the last 4 years iceberg has been a work-horse HPCMP and academic users alike. In a little under a month, iceberg will be retired to make way for our latest allocated system.

Accounts on iceberg will be inactivated at 1 PM AKDT on July 18th.


Quick-Tip Q & A

A:[[ Some Unix commands, like ps and finger, spit out a line of column
  [[ headers followed by a table of information.  Being self-absorbed,
  [[ I often want to grep for myself in these tables, but I need to see
  [[ the column headers too, because they change depending on the command
  [[ options.  Check out the different "ps" column headers!
  [[   mg56:~ % ps -e  
 head -1
  [[     PID TTY          TIME CMD
  [[   mg56:~ % ps -l  
 head -1
  [[   mg56:~ % ps -ef 
 head -1
  [[   mg56:~ % ps -elf 
 head -1
  [[ Apparently the developers did NOT consider my needs when they wrote
  [[ these tools.  To see the header and information rows, am I really
  [[ stuck doing this?!
  [[   mg56:~ % ps -ef 
 head -1
  [[   UID        PID  PPID  C STIME TTY          TIME CMD
  [[   mg56:~ % ps -ef 
 grep mortimer
  [[   mortimer    9996  9994  0 10:07 ?        00:00:00 sshd: mortimer [priv]
  [[   mortimer    9998  9997  0 10:07 pts/8    00:00:00 -ksh
  [[   mortimer   18706  9998  0 10:27 pts/8    00:00:00 ps -ef

  # Thanks to Ryan Czerwiec

  The short answer is: yes, you're stuck because the problem is with the
  behavior of the grep / head and not the original command.  It's not
  really a developer issue since they all behave as intended.  There are
  solutions for some particular commands, but the broad scope of every
  header-producing command effectively precludes any generic solution or
  the ability of developers to predict the need and implement one.

  It wouldn't be a big deal to replace each of these commands with
  an alias or script on a case-by-case basis.  There happens to
  be a workaround for ps, but as an example, you could change your
  pattern so it would also catch the header too, but likely nothing
  else additional.  You'd have to know ahead of time what the header
  looks like and what strings in it wouldn't be found elsewhere,
  for instance PID for ps, as in:

    ps -elf 
 grep "[Pm][Io][Dr]"

  In standard grep (definitely not foolproof), or:

    ps -elf 
 grep "PID\

  in gnu grep.

  # Greg Newby

  With apologies for BSD vs. SysV Unix options for the PS command,
  I'll show a few examples from my Mac, where there are plenty of
  processes owned by my username.

  There are several scripts and aliases I can imagine writing to meet
  your needs, but this doesn't seem that hard to do by just separating
  the two commands by a semicolon.  First command gets the header,
  second gets your processes.  It's pretty, and doesn't limit you to
  particular ps options.

  % ps -ax 
 head -1 ; ps -ax 
 grep `whoami`

   2587  ??  Ss    87:40.94
    391  p1  Ss     0:00.02 login -pf newby
   5085  p1  S      0:00.10 /Users/newby/.local/bin/mutt -F
    889  p2  Ss     0:00.01 login -pf newby
   3105  p2  S+     0:04.26 ssh gbnewby@
    933  p3  Ss     0:00.01 login -pf newby

  The semicolon simply tells the shell to run both commands, without
  giving a prompt in between.  This will work on all shells I know of.

  "ps -ax" is about the same as "ps -ef" on midnight.  "ps -aux" is
  about the same as "ps -elf" on midnight.  Here's another, sorted on
  the PID field:

  % ps -aux 
 head -1 ; ps -aux 
 grep `whoami` 
 sort +1

  newby     102   0.0 -0.4    82904   9024  ??  Ss   Fri08PM 0:06.56 /System/Li
  newby     103   0.0 -0.4   359984   8536  ??  Ss   Fri08PM 0:16.66 /System/Li
  newby     252   0.0 -0.1   349660   2692  ??  S    Fri08PM 0:00.29 /System/Li
  newby     344   0.0 -0.1    55992   2256  ??  Ss   Fri08PM 0:00.40 /System/Li
  newby     347   0.0 -0.0    27316    364  ??  S    Fri08PM 0:00.01 /sbin/laun
  newby     353   0.0 -0.3   878884   5504  ??  S    Fri08PM 2:20.76 /System/Li
  newby     354   0.0 -0.1    27936   1580  ??  S    Fri08PM   0:00.49 aped

  If any of these suit you, then take the ones you use most commonly
  and add them as aliases or functions or shell scripts.  Adding a
  pipe to "sort" opens many neat opportunities [you can even pipe one
  sort to another].

  # Brad Havel

  Usage of egrep, which allows for more complex regular expressions to
  be grepped, would allow the command to be combined into a single pipe
  (assuming there is at least something on each line which could be
  searched, such as "PID" and userid for the ps -[elf] example above.

  [=> ps -ef 
 egrep "(PID
 grep -v "grep"
  UID        PID  PPID  C STIME TTY          TIME CMD
  root     10064 31279  0 07:21 ?        00:00:00 sshd: havel [priv]
  havel    10066 10064  0 07:21 ?        00:00:00 sshd: havel [priv]
  havel    10067 10064  0 07:21 ?        00:00:00 sshd: havel@pts/0
  havel    10068 10067  0 07:21 pts/0    00:00:00 -bash
  havel    10102 10068  0 07:21 pts/0    00:00:00 ps -ef

  [=> ps -el 
 egrep "(PID
 grep -v "grep"
  1 S  3889 10066 10064  0  77   0 -  4406 schedu ?        00:00:00 sshd
  5 S  3889 10067 10064  0  75   0 -  4375 schedu ?        00:00:00 sshd
  0 S  3889 10068 10067  0  75   0 -  2728 wait4  pts/0    00:00:00 bash
  0 R  3889 10196 10068  0  77   0 -   867 -      pts/0    00:00:00 ps
  While probably not useful for everything it is a bit cleaner and a
  little more graceful than the "head -1; ps -ef" method.

  # Derek Bastille

  One dead-simple way to do this is to use an OR with grep like this:

  [mira-e0]:/bastille >ps -aux 
 grep -E "bastille
  bastille   843  90.4 15.1   593456 197696  ??  R    Thu10AM 151:22.68 /Applications/ -psn_0_4194305
  bastille   793   1.0  1.2   228884  15704  ??  S    Thu09AM
  94:18.28 /System/Library/CoreServices/
  bastille  2785   0.9  3.1   562604  41228  ??  S    Fri05PM
  55:54.16 /Applications/MeetingMaker 8.6.2/Meeting Maker
  bastille   872   0.4  4.1   292420  53500  ??  R    Thu10AM
  47:24.02 /Applications/Utilities/ -psn_0_4718
  bastille   801   0.1  0.5   199112   6000  ??  S    Thu09AM
  5:50.56 /Applications/Microsoft

  (I did this with OS X, so the ps is a bit different).

  Basically, the -E is for including a real expression.  So I'm telling
  grep I want lines with 'bastille' or 'USER'.  For your ps header
  example, you'd want to use:  ps -ef 
 grep -E "UID

  # An Editor

  Derek forgot about this option he showed me a while back...

  % ps -aux 
 grep -e mortimer -e USER

Q: Is there a way to tell which flags the MPI compiler is passing to
   the system compiler?  Specificially I would like to see which
   include files and libraries are being passed to the system compiler.

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top