ARSC HPC Users' Newsletter 402, March 31, 2009

ARSC is a DSRC

ARSC is now one of six centers designated by the High Performance Computing Modernization Program (HPCMP) as a DoD Supercomputing Resource Center (DSRC). HPCMP has six DoD Supercomputing Resource Centers throughout the country: two in Mississippi and one each in Maryland, Ohio, Alaska and Hawaii. ARSC provides open research systems and is the only DSRC in the program that is not affiliated with a branch of the military.

Mount Redoubt Has Awakened!

Following a week of increasing seismic activity and volcanic water vapor emissions, large scale eruptions of Alaska's Mount Redoubt began on March 22, 2009. At this time, Redoubt has erupted 18 times in the past nine days, producing an ash cloud topping 65,000 feet. Situated on Alaska's southern coast, Mount Redoubt's last phase of large scale eruptions began December 14, 1989, exploding 23 times by April of 1990.

The Arctic Region Supercomputing Center DSRC, located in interior Alaska hundreds of miles away from Redoubt, remains well insulated from the commotion, except for flight cancellations. Anybody expecting to fly in or out of Alaska in the very near future should not be caught off guard by abrupt flight changes, and may wish to plan accordingly.

For more information and updates, please visit the Alaska Volcano Observatory's Mount Redoubt web page at:

    http://www.avo.alaska.edu/activity/Redoubt.php

A Parallel I/O Benchmarking Story

[ By Kate Hedstrom ]

When timing any model, there is a processor count at which the model doesn't get faster as you add more processors. This is known as a limit to the scalability. For the model I use, the horizontal domain decomposition works great for modest processor counts, but the I/O all happens on the 'head' processor, creating a bottleneck. When we were talking to vendors about the acquisition of Midnight, two vendors made a big deal about the need for parallel I/O to get better scalability. I've been wanting to try parallel I/O ever since (but it's not supported on Midnight).

Enter Pingo with it's support for MPI I/O. Also, the advent of NetCDF4 based on HDF5 and its use of MPI I/O makes it possible for us to use parallel I/O without an outrageous amount of coding. Note that you need to use versions of the HDF5 and NetCDF4 libraries which were compiled with --enable-parallel. On Pingo, such libraries are in


   /u2/wes/PET_HOME/pkgs/hdf5-1.8.1-parallel/lib
and

   /u2/wes/PET_HOME/pkgs/netcdf-4.0-parallel/lib
respectively.

First, we need to set a baseline. The model comes with a benchmark case in three sizes. I'm going to focus on the largest, starting with a processor count of 32 in a 4x8 tiling.

Some Midnight walltime numbers:

   Without I/O:


        4-way nodes: 387 seconds
       16-way nodes: 1030 seconds

   With I/O:


        4-way nodes: 462 seconds
       16-way nodes: 1136 seconds

The big message here is that I should stick to the 4-way nodes.

Now for Pingo (where all nodes are equal):

   32 cores:


       no I/O:                           428 seconds
       creating classic Netcdf3 files:   489 seconds
       creating serial HDF5 files:       477 seconds
       creating parallel HDF5 files:    1481 seconds

   256 cores:


       no I/O:                            90 seconds
       creating classic Netcdf3 files:   138 seconds
       creating serial HDF5 files:       124 seconds
       creating parallel HDF5 files:     817 seconds

Clearly, we'd be saving time by not getting any output at all. Also, the HDF5 writing is slightly faster than that of NetCDF3. The parallel I/O is so slow I thought it was hanging at first! It has to be in independent mode, since it blows up using collective mode (two MPI I/O options) with errors from the HDF5 layer. I don't know why I'm not getting the great speedups out of parallel I/O that I'd hoped for, but I'm running back to serial mode for now.

[ Editor's note: We are aware of another Pingo MPI I/O test that saw a performance improvement of 10-20% although it wasn't using HDF5. ]

Row/Column Major Array Disposition

[ By Craig Stephenson ]

In my "Valgrind's Cachegrind Profiler Tool" article from Issue 401, I had reported some interesting results arising from accessing multi- dimensional arrays in the least efficient order for both Fortran 90 and C:

    /arsc/support/news/hpcnews/hpcnews401/index.xml#article3

I had declared the array with the largest dimension first. In Fortran 90, the array declaration looked like this:


  INTEGER, DIMENSION(1000,100,10) :: A

Accessing array elements by row in Fortran 90, a column-major language, produced 92,258 D1 cache write misses.

In an attempt to created an analogous example in C, I had declared an array as follows:


  int A[1000][100][10];

As with the Fortran 90 example, I deliberately stepped through this array in the least efficient order. Since C is a row-major language, I stepped through the array in column order. This produced an alarming 623,193 D1 cache write misses, compared with Fortran 90's 92,258. This was noted as a curiosity, but not explained.

Newsletter reader Jed Brown was quick to point out that it is not fair to C to treat the two array declarations above as equivalents. The differences between row/column major languages affect the data structures themselves, not just how they are accessed. From a performance perspective, a more fitting C "equivalent" to my Fortran 90 array should order the dimensions from smallest to largest:


  int A[10][100][1000];

Indeed, when I changed the array declaration in the C example to this (and adjusted the array indexes accordingly in the "for" loops), stepping through the array in the least efficient order produced only 92,978 D1 cache write misses, comparable to Fortran 90's 92,258.

The code examples in Issue 401 produced a situation where Fortran 90's least efficient order was many times faster than C's least efficient order. Why? Because if we are accessing the array in the least efficient order for either language, if the array declaration is biased towards column-major access, the row-major language will take longer strides through memory. This, in turn, leads to more cache misses. We see again that Valgrind is very handy for gaining insight into the behavior of our code.

Many thanks to Jed Brown for bringing this to our attention!

Quick-Tip Q & A



A:[[ I saved an important file in the local /scratch directory on one of
 [[ 20+ Linux workstations, but I don't remember which one.  The file
 [[ name is "coastline.inp", and it may or may not be in a subdirectory.
 [[ Since the /scratch directory is not shared between the workstations,
 [[ I need to find the specific machine that has this file.  With so
 [[ many workstations available, what's the most efficient way to
 [[ determine which workstation has the file I need?

#
# Jed Brown, Greg Newby and Rahul Nabar submitted similar solutions
# using a shell "for" loop and ssh.  Greg explains the benefits of
# passing a single command through ssh:
#

You can use ssh to execute a single command on the workstations,
rather than needing to login to each one separately.

For example, if I want to look for a file on aphrodite (one of the
workstations), I could do this:

 ssh aphrodite ls /scratch/newby/data50v80.mat

Much faster than logging in and getting a shell prompt.

With quotes, I can use a wildcard to look for matching files:
 ssh aphrodite ls /scratch/newby/"data50v80*"

If I know a list of hostnames, I can put this in a loop.  Using
sh/ksh/bash, it would look something like this:

 for i in dionysus demeter hades hera hestia ; do
  ssh $i ls /scratch/newby/"data50v80*" 
 done

Or, if I have a file with system names:

 for i in `cat sysnames.txt` ; do
  ssh $i ls /scratch/newby/"data50v80*" 
 done

This assumes you know the system names you are interested in.  Get
them here (among other places):
 http://www.arsc.edu/support/status/systemstatus/

The only downside of this approach on the ARSC systems is that you get
a long pre-login banner.  To suppress seeing it, try this:

 for i in dionysus demeter hades hera hestia ; do
  ssh $i ls /scratch/newby/"data50v80*" 2>&1
 sed '/^#/d'
 done

(The 2>&1 is needed to combine standard output with standard error,
because the warning banner comes on standard error.  Try without that
part, to see what I mean)

It would be easy to make this into a script that takes an argument for
a filename and destination to search through.

#
# Ryan Czerwiec provided an excellent solution once again:
#

I don't know too much about the networking side of things, so there may
be a command that can tell you any machine names that are on the local
domain.  I'll start by assuming the user can create a text file
containing the names of all the candidate machines they want to try.
Something simple like this will work in csh:

------------------------------
#!/bin/csh -f
set a = `cat $1`
set index = 1
while ( $index <= $#a )
 ssh $a[$index] find /scratch -name "$2"
 @ index++
end
exit
------------------------------

Unfortunately, the security needs of the modern world will often
confound this.  Ideally, you'd be on a system where you have a ticket
such that you can ssh or rsh to other machines without retyping a
password.  This script would prompt you for a password on each machine
in the list if retyping it is necessary (I've never found a way to
automate that), and will fail if ticket forwarding fails or is disabled.
So this might well not work on your particular system, but if you get
lucky, use it as:

./scriptname.csh <file listing machines to check> <file name to search for>

#
# Focusing on efficiency, Tom Baring's solution adds a twist to the
# "find" solution by backgrounding each ssh "find" process:
#

You can use "find" to search for the file. For efficiency, limit the
search to directories and files owned by you (-user ${USER}) and stop
searching as soon as the file is located (-quit).  To shorten the
wallclock time, let each system perform its own find and let them run in
parallel by starting all the find commands with ssh and backgrounding
the ssh commands, so they can all be issued without waiting.  Have
"find" echo the name of the system where the file is found using -exec:

------------------------------
#!/bin/ksh

SYSTEMS="hera mallard aphrodite zeus delta beta "
FILE="coastline.inp"

for SYS in $SYSTEMS
do
CMD="find /export/scratch -user ${USER} -name ${FILE} -exec echo -n FOUND ${SYS}:  \; -print -quit"
ssh -qq ${SYS} ${CMD} 2> /dev/null &
echo "Started on ${SYS}"
done
------------------------------

The output's a little ugly because the script ends and the local prompt
is printed before the remote output from the backgrounded process is
returned.  If it mattered, this could be cleaned up:

pedro:/scratch/baring % ./where_it_is.ksh
Started on hera
Started on mallard
Started on aphrodite
Started on zeus
Started on delta
Started on beta
pedro:/scratch/baring % FOUND hera:/export/scratch/coastline.inp

#
# Also, a trick from an editor:
#

Granted, "efficiency" could refer as much to human efficiency as
computational efficiency.  But if you had some extra time on your hands
and hope to perturb as few Linux workstations as possible, you might try
the following trick...

If you use ssh regularly, presumably every workstation you have ever
connected to from your own local workstation has an entry in your ssh
known_hosts file.  Assuming you've only connected to a handful of these
workstations in the past, you may be able to leverage the ssh
known_hosts file to substantially reduce the number of workstations
searched.

First, create a text file listing the 20+ workstations.  E.g.,

beta
delta
aphrodite
apollo
...

Once this file has been created, it can be compared against your
known_hosts file:

cut -d ',' -f 1 ~/.ssh/known_hosts 
 cut -d ' ' -f 1 
 grep -f workstations.txt

This admittedly inelegant command will compare all of the hosts in your
known_hosts file with the list of workstations in workstations.txt.
Hence, the command will print only those Linux workstations that you
have connected to in the past via ssh, which (making several
assumptions) are the only workstations that could possibly have the file
you seek.  This command can be combined with the other answers provided
to this quick tip question.  Modifying Tom Baring's script, for example:

------------------------------
#!/bin/ksh

SYSTEMS=`cut -d ',' -f 1 ~/.ssh/known_hosts 
 cut -d ' ' -f 1 
 grep -f workstations.txt`
FILE="coastline.inp"

for SYS in $SYSTEMS
do
CMD="find /export/scratch -user ${USER} -name ${FILE} -exec echo -n FOUND ${SYS}:  \; -print -quit"
ssh -qq ${SYS} ${CMD} 2> /dev/null &
echo "Started on ${SYS}"
done
------------------------------


Q: My code just crashed and generated over a hundred core files.  The
  log files don't have anything meaningful, so I have no idea which
  task(s) were having problems.  How can I get a stack trace from each
  of these core files in an automated fashion?  I'm getting really tired
  of running gdb on one core file at a time!
 

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top