ARSC HPC Users' Newsletter 413, July 1, 2010

ARSC Summer Tours

Learn how scientists at UAF and around the world use supercomputers to solve some of the world's most pressing problems. ARSC's high-performance computing resources include petabyte-scale data storage facilities and supercomputers that can perform trillions of arithmetic calculations per second.

You and your guests are invited to attend the ARSC summer tours hosted in the Butrovich machine room viewing area, Wednesdays at 1:00 PM from June 2 through August 25. Meet at 009 Butrovich Building.

Attendees will also be given the chance to test drive remote controlled bots developed by ARSC's undergraduate Research Projects Assistants.

Call 907-450-8600 for more information.

Introducing Chugach

ARSC has continued the tradition of naming its supercomputers with Alaska- themed monikers by dubbing the newest member of its HPC family Chugach (pronounced CHOO-gatch) after the Chugach Mountains of Southcentral Alaska. The Chugach range about 250 miles over an area that includes Anchorage, located on Cook Inlet, to the East and the Bering Glacier to the West. Early Russian explorers to the region recorded the Eskimo tribal name of the "Chugachmiut" as "Chugatz" and "Tchougatskoi." Today, Chugachmiut is the consortium created to promote self-determination to the seven Alaska Native communities of the Chugach Region.

Chugach isn't the first set of mountains ARSC has memorialized in name. Denali, the Alaska Athabascan Indian name for Mount McKinley, was the name of ARSC's first supercomputer, a Cray 4-CPU Cray YMP that came online in 1993.

Using PuTTY SCP

[ By Don Bahls ]

A few months ago I had to move several large files with similar name patterns from the long term storage on seawolf.arsc.edu to a Windows system. After attempting to use FileZilla for a few files, I was reminded that PuTTY includes a command line scp command. So, I abandoned FileZilla and moved to the command line.

The file transfer process is straightforward.

  1. Get a ticket using the krb5.exe command as you normally do on Windows.
  2. Start a Windows command interpreter. Click "Start" then "Run" and enter "cmd".
  3. In the command interpreter switch to the directory where you'd like to transfer the files. For example:
    
       % cd Desktop
    
  4. Issue the "pscp.exe" command as you would issue scp on a Linux/Unix/Mac system:
    
       % "C:\Program Files\HPCMP Kerberos\pscp.exe" "user@seawolf.arsc.edu:/archive/u1/uaf/user/directory.*" .
    
       directory.use.9           
     1 kB 
       1.3 kB/s 
     ETA: 00:00:00 
     100%
       directory.use.1           
     0 kB 
       0.7 kB/s 
     ETA: 00:00:00 
     100%
       directory.use.2           
     0 kB 
       0.7 kB/s 
     ETA: 00:00:00 
     100%
       directory.use.3           
     0 kB 
       0.7 kB/s 
     ETA: 00:00:00 
     100%
    

The one drawback is that pscp.exe will encrypt the file being transferred, so that can slow transfer rates a bit. But the ability to use "*", "?", etc., more than made up the difference in my case because I could easily skip the files I didn't need.

Running Multiple MPI Jobs on a Single Node

[ By Ed Kornkven ]

Many of our users use our supercomputers not to run single jobs of many processors, but rather many single-processor jobs. Since nodes are not shared on either Midnight or Pingo, and the smallest allocatable unit of each machine is a node, some of the users of those machines may be able to get better throughput and more bang for their allocation buck by running multiple jobs on a single node. We might expect this issue to become more important as node core counts continue to increase.

Back in issue 378 of this newsletter, Don Bahls demonstrated how to run multiple serial jobs on a single node using Matlab as an example. Users running serial codes are encouraged to take a look at that article here:

     /arsc/support/news/hpcnews/hpcnews378/index.xml#article3

However, what if your job is not serial, but an MPI job using small-scale parallelism? This article demonstrates how you can extend Don's technique to Midnight MPI jobs. This time we will use a multi-threaded HPL job running on the 16-core nodes of Midnight as an example.

HPL is the portable parallel Linpack code that is most famous as the benchmark that is used to rank computers on the Top500 list of the world's fastest machines. It uses MPI for solving a dense linear system of equations on parallel machines. The version that we will show here also uses two threads per MPI task in the core BLAS routine using the GotoBLAS library. See the following website for more information on GotoBLAS:

     http://www.tacc.utexas.edu/tacc-projects/

We will run this on Midnight, ARSC's Sun cluster, using Midnight's X4600 nodes which have 8 dual-core sockets. We will run two simultaneous HPL jobs on a node, four sockets per job. Getting these jobs set up requires some careful cooperation between PBS and the launch of the executable. The PBS script will execute mpirun for each job in our "for" loop. The mpirun executes a script which sets up the processor affinity so the MPI tasks and threads are properly placed on the node.

Here is the PBS script. We launch 2 jobs, each running 4 MPI tasks, not 8, because two BLAS threads are created for each MPI task, one per core, for a total of 8 threads per job and 16 threads for the two jobs. For the same reason, we request ncpus=8, not 16, for the two jobs. We launch the HPL executable from the socket_hpl.sh script which is passed the job number (either 0 or 1 in this example) and the number of tasks per job (4).


#!/bin/ksh 
#PBS -N mult_hpl_jobs
## We're using 2 GotoBLAS threads per task so ncpus is half the core count
#PBS -l select=1:ncpus=8:node_type=16way
#PBS -l walltime=0:30:00
#PBS -q standard
#PBS -j oe

#
# Set up
SOCKETS=8
JOBS_PER_NODE=2
let TASKS_PER_JOB=$SOCKETS/$JOBS_PER_NODE
OUTPUT_PREFIX=output_${PBS_JOBID}
[ ! -z $PBS_O_WORKDIR ] && cd $PBS_O_WORKDIR

#
#  Launch HPL jobs.  The socket_hpl.sh script computes core
#  numbers for a job and execs the executable on those processors.
let lastjob=$JOBS_PER_NODE-1
for job in `seq 0 $lastjob`; do
    RUN_CMD="mpirun -np $TASKS_PER_JOB -noaffin ./socket_hpl.sh $job $TASKS_PER_JOB"
    echo "Running: $RUN_CMD"
    $RUN_CMD > ${OUTPUT_PREFIX}.${job} 2>&1 &
done

#
# Wait for each mpirun to finish
wait

#
# Peek at results
grep -h WR13R2R4 ${OUTPUT_PREFIX}.*

exit 0

######################################################################

The script socket_hpl.sh is next. Each MPI rank calls this script when mpirun executes in the PBS script. Given the job index, the number of MPI tasks that we want to use for a job, and the index of the MPI task that is invoking this script (passed in via the automatically-set variable $MPIRUN_RANK), this script calculates the cores that will run the two threads for this rank and runs the xhpl executable using the taskset utility to place the threads on the cores.


#!/bin/ksh 
#
# Parameters:
#   $1 - Job number (of all jobs per node)
#   $2 - MPI tasks per job
#
# GOTO_NUM_THREADS tells the GotoBLAS library how many threads per task.
# Note that the PBS script has got to help out a little by requesting only 
# half the cpus per node (ncpus).  On Midnight, the mpirun must also be
# told to let us handle processor affinity via the -noaffin flag
# because we're going to set it ourselves with taskset.
export GOTO_NUM_THREADS=2

#
# Given the job number and number of tasks per job,
# compute the core set for each thread.  
if [[ $# != 2 ]]; then
    print "Usage: $0 <job> <MPI tasks per job>"
    exit 2
else
    job=$1
    tasks=$2
    let starting_core=$job*$tasks*$GOTO_NUM_THREADS+$MPIRUN_RANK*$GOTO_NUM_THREADS
    let ending_core=$starting_core+$GOTO_NUM_THREADS-1
    cpulist="$starting_core-$ending_core"
fi

#
# Launch the executable using taskset to specify the cores of this job.
print "Binding cores $cpulist on `uname -n` for job $job"
exec /usr/bin/taskset -c $cpulist ./xhpl

######################################################################

The HPL.dat file, for readers who are familiar with HPL. The key parameters are N, the problem size, and P and Q, the processor grid. Again, P*Q is 4, not 8, because two BLAS threads are created for each MPI task, one per core, for a total of 8.


HPLinpack benchmark input file 
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
20000        # N
1            # of NBs
232          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
3            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
232           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

######################################################################

The output of a run of the PBS job is here:


Currently Loaded Modulefiles:
  1) voltairempi-S-1.pathcc   3) PrgEnv.path-3.2
  2) pathscale-3.2
0 news items.
Running: mpirun -np 4 -noaffin ./socket_hpl.sh 0 4
Running: mpirun -np 4 -noaffin ./socket_hpl.sh 1 4
WR13R2R4       20000   232     2     2             159.35          3.347e+01
WR13R2R4       20000   232     2     2             161.29          3.307e+01

######################################################################

Hopefully users who run modestly scaling codes that use MPI or OpenMP or a hybid of the two, can envision how this example might be adapted to their situation and enable them to double up a node. As Don's article points out, the Midnight nodes are equipped with 4 GB memory per core so running multiple jobs on a node is certainly feasible for many applications. The PBS, socket_hpl.sh and HPL.dat files can be easily modified to run more jobs on an X4600 node. I ran eight 2-core HPL jobs (also using N=20000) which interestingly, gave slightly better total throughput than the two 8-core jobs in this example. Modifying the PBS script and the HPL.dat file for that case is left as an exercise.

On a final note, I haven't figured out how to make this work on the Cray XT5. If the jobs don't need to be launched by aprun, the Cray counterpart to mpirun, the setup is straightforward as in Don's article. However, MPI jobs must be launched by aprun and, examples on the Internet notwithstanding, I can't make it work on Pingo. Any reader who can send in a solution for Pingo will have their newsletter subscription extended for a year, absolutely free!

Quick-Tip Q & A


A:[[ Every so often I find myself needing to verify that a file is the same 
  [[ on two different hosts.  Usually what I end up doing is performing an 
  [[ "md5sum" command on the file on each of the two hosts, then visually 
  [[ skimming the md5 checksums to make sure they look similar.
  [[ 
  [[ This seems sloppy to me though.  It's possible that the two checksums 
  [[ could look similar enough that a hasty glance would lead me to believe 
  [[ they are identical when they are not.  Is there a way to perform some 
  [[ kind of remote "diff" command between the two hosts instead?

#
# Dan Stahlke provided the following diff tips:
#

Just run the results of the md5sum through diff like so:

diff --brief <(ssh host1.edu "md5sum < file1.txt") <(ssh host2.edu "md5sum < file2.txt")

If they differ, diff will give a failure return code and will print a 
message like this:

Files /proc/self/fd/63 and /proc/self/fd/62 differ

Of course, the md5sum is optional and is only there to save network 
bandwidth.  You could also just do:

diff --brief <(ssh host1.edu cat file1.txt) <(ssh host2.edu cat file2.txt)

I learned about the "<(cmd)" trick from the excellent commandlinefu site:
 
http://www.commandlinefu.com/commands/browse/sort-by-votes

#
# In matters of Linux there is never just one solution, as Jed Brown makes 
# clear with the following examples:
#

A straightforward remote diff is

 ssh remote-host cat path/to/file 
 diff local/path/to/file -

If you just want to check whether the files are the same, I recommend
using rsync, something like

 rsync --dry-run --itemize-changes --checksum local/path/to/file remote-host:path/to/file

See the man page to interpret the summary of differences.  It easily
extends to diffing trees, perhaps subject to various filters.

If the file names are the same, the simple

 ssh remote '(cd path/to; sha1sum file)' 
 (cd local/path/to; sha1sum -c)

will work, it gets messier if the file names are different: [bash]

 trim(){ cut -d\  -f1;}
 diff <(ssh remote-host sha1sum path/to/file 
 trim) <(sha1sum local/path/to/file 
 trim)

#
# Bill Homer submitted a similar rsync command, using two verbose (-v) 
# flags instead of Jed's --itemize-changes to report whether a file is up 
# to date:
#

rsync -v -v -n -c $this_file $that_file

#
# Rahul Nabar pointed out that ssh's output can be piped to the diff 
# command, similar to Jed's example, but you can also use vimdiff to 
# streamline the process:
#

vimdiff scp://remotemachine/remotefile localfile


Q: I'm trying to create a tar file of a directory on my system.  
Unfortunately the disk is so full that I can't actually create the tar 
file without filling this disk.

Ultimately I would like to get a gzipped tar file on my other system.  I 
would rather not have to copy over the directory tree in order to get 
enough space.

Is there a way I can create the tar file on the remote system without 
copying the directory tree to the remote system first?
 

 [[ Answers, Questions, and Tips Graciously Accepted ]]



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Editors:
--------
  Ed Kornkven, ARSC HPC Specialist, kornkven@arsc.edu, 907-450-8669
  Craig Stephenson, ARSC User Consultant, cstephen@arsc.edu, 907-450-8653

Subscription Information:
-------------------------
  Subscribing and unsubscribing:
    http://www.arsc.edu/support/news/newslettersubscribe.html

  Quick tip answers and other correspondence:
    HPCNewsletter@arsc.edu

Back Issues are Available:
--------------------------
  Web edition:   
    http://www.arsc.edu/support/news/HPCnews.shtml

  E-mail edition archive:
    ftp://ftp.arsc.edu/pub/publications/newsletters/

-----------------------------------------------------------------------
Arctic Region Supercomputing Center          ARSC HPC Users' Newsletter
-----------------------------------------------------------------------

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top