ARSC HPC Users' Newsletter 305, December 3, 2004

ARSC Director Named CASC Chair

Arctic Region Supercomputing Center (ARSC) Director Frank Williams has been elected chair of the Coalition for Academic Scientific Computation (CASC). CASC is a nonprofit organization of supercomputing centers, research universities and federal laboratories that offer leading edge hardware, software, and expertise in high performance computing resources and advanced visualization environments. Founded in 1989, CASC has grown into a national association representing 41 centers and programs in 28 states.

As chair, Williams will lead the organization in promoting scientific computing in academic centers across the country, and providing a forum for discussion of and possible action on opportunities and challenges the members share. Additionally, ARSC will host the group's annual meeting this spring, bringing approximately 60 visitors to Fairbanks and UAF.

IBM: Running Jobs Interactively with Loadleveler

IBM's Parallel Operating Environment (poe), allows jobs to be run interactively using resources administered by Loadleveler. In Issue 303 of the HPC Users' Newsletter we showed how this functionality could be used in conjunction with Totalview for debugging purposes. In this issue we take a more general look.

First, create a Loadleveler script.

iceberg2 1% cat llinter-mpi
# @ job_type         = parallel
# @ node             = 2
# @ tasks_per_node   = 8
# @ network.MPI      = sn_single,shared,us
# @ node_usage       = not_shared
# @ class            = standard
# @ wall_clock_limit = 1800
# @ queue

There are a few things to keep in mind when writing your script:

  1. The 'output' and 'error' keywords are not valid with interactive work. If you want to save stdout and/or stderr from your interactive work you will need do it manually.

  2. Using a short 'wall_clock_limit' will increase the likelihood that your job will be able to take advantage of LoadLeveler's backfill scheduling algorithm and therefore run.

    NOTE: Interactive work will not wait in the queue for resources to become available. If insufficient resources are available to start the job you may get an error message like the following:

        iceberg2 76% ./hello_mpi -llfile ./llinter
        ERROR: 0031-365  LoadLeveler unable to run job, reason:
        LoadL_negotiator: 2544-870 Step b1n2.20740.0 was not considered to
        be run in this scheduling cycle due to its relatively low priority
        or because there are not enough free resources.
  3. Requesting fewer nodes will also improve the likelihood that your job will be scheduled promptly.

  4. Environment variables cannot be set in an interactive LoadLeveler script.

Next we run the application.

If you have used the mpi compilers (e.g. mpcc_r, mpxlf_r, etc.) you can launch the executable using the "-llfile" option. E.g.,

  iceberg2 2% ./hello_mpi -llfile ./llinter-mpi
  Hello from proc 8 of 16. Name=b5n6
  Hello from proc 0 of 16. Name=b7n9
  Hello from proc 9 of 16. Name=b5n6
  Hello from proc 1 of 16. Name=b7n9
  Hello from proc 10 of 16. Name=b5n6
  Hello from proc 2 of 16. Name=b7n9
  Hello from proc 11 of 16. Name=b5n6
  Hello from proc 3 of 16. Name=b7n9
  Hello from proc 12 of 16. Name=b5n6
  Hello from proc 13 of 16. Name=b5n6
  Hello from proc 4 of 16. Name=b7n9
  Hello from proc 14 of 16. Name=b5n6
  Hello from proc 15 of 16. Name=b5n6
  Hello from proc 5 of 16. Name=b7n9
  Hello from proc 6 of 16. Name=b7n9
  Hello from proc 7 of 16. Name=b7n9

You might see some warnings as LoadLeveler is starting the application, these can be disregarded. E.g.:

  ATTENTION: 0031-408  16 tasks allocated by LoadLeveler, continuing...
  ATTENTION: 0031-722  can't set priority to 0

If you think you will be repeatedly running your application you could alternatively use the MP_LLFILE environment variable to set the name of the LoadLeveler script. E.g.:

  iceberg2 3% setenv MP_LLFILE /wrkdir/username/llinter-mpi
  iceberg2 4% ./hello_mpi

Jobs not compiled with the mpi compilers, for instance, serial or OpenMP jobs, must use the poe command when being launched. E.g.,

  iceberg2 5% setenv MP_LLFILE /wrkdir/username/llinter-omp
  iceberg2 6% poe env OMP_NUM_THREADS=8 ./omp-hello
   HELLO FROM  0  of  8  threads
   HELLO FROM  7  of  8  threads
   HELLO FROM  1  of  8  threads
   HELLO FROM  5  of  8  threads
   HELLO FROM  6  of  8  threads
   HELLO FROM  3  of  8  threads
   HELLO FROM  2  of  8  threads
   HELLO FROM  4  of  8  threads

Here the env call is used to set the environment variable OMP_NUM_THREADS for the executable omp-hello. Below is the LoadLeveler script we used with omp-hello.

  iceberg2 7% cat llinter-omp
  # @ job_type         = parallel
  # @ node             = 1
  # @ tasks_per_node   = 1
  # @ node_usage       = not_shared
  # @ class            = standard
  # @ wall_clock_limit = 1800
  # @ queue

Since interactive jobs are invoked from a front-end node the poe process will actually reside on the front-end. This allows one to attach a debugger (such as pdbx) to the poe process. (Discussion of this will be left for a future newsletter article.) It also allows the user to send a signal to the job. These capabilities can be incredibly useful.

Here's an example of how signals can be used to cause a core dump of an interactive LoadLeveler executable. This might be used if the code has deadlocked, for example. First we find the process id for the poe process and then send the signal:

  iceberg2 7% ps -u $USER 
 grep poe
   1234  925804  pts/5  0:00 poe
  iceberg2 8% kill -QUIT 925804

In the working directory there should now be a "coredir" directory for each process that was running. The core files within these directories can be inspected with your favorite debugger.

X1 Batch Jobs using pat_hwpc Can't be Checkpointed

X1 users: please don't use pat_hwpc in long-running jobs. The system will be unable to checkpoint these jobs in preparation for scheduled downtimes, and the jobs will be lost. Gathering performance statistics on short production runs and test runs is a great idea... and you have our strong encouragement to do so. If you need help interpreting the output from pat_hwpc, and/or assistance optimizing your X1 codes, please contact

ScaLAPACK Intro: Part II of IV

In this article, we'll show how a program can configure processes into a BLACS 2-dimensional process grid, in preparation for calling solvers and other functions from the ScaLAPACK library.

The concept of a virtual processor grid was described in Part I of the series, which appeared in the last issue : There are multiple ways to perform the task, and you can readily find other sample programs and tutorials on the web. The ScaLAPACK User Guide, available on-line and in print, is the ultimate authority.

In the example given here, the following three steps are used to create the processor grid:

  1. Determine the total number of processors with a call to BLACS_PINFO.

    Remember that BLACS is a library of lower-level communication routines, which, like MPI, provides basic functions which can be used by high-level libraries, like ScaLAPACK. The BLACS routines, like MPI routines, are also directly available to the user program and, in fact, you have no choice but to use some of them.

    In the sample code, each process makes the following call to discover which process it is in the startup 1D process grid, and how many processes there are in total:

          call blacs_pinfo( iam, nprocs )
  2. Next, manually factor the number of processors into a rectangle as close to square as possible. I've borrowed a function written by Carlo Cavazzoni of CINECA to to this (it's not done by BLACS).
          call gridsetup (nprocs,nprow,npcol)

    This computes the number of rows and columns we'll have in the process grid, such that NPROW x NPCOL == NPROCS.

  3. Inform BLACS of this grid extent:
         call blacs_get( -1, 0, context )
         call blacs_gridinit( context, 'r', nprow, npcol )

    MPI has communicators, BLACS has contexts. For this example, the default context, which includes all processors, is all we need. The BLACS_GET call simply returns a handle to the default context for use in subsequent BLACS calls. BLACS_GRIDINIT informs BLACS of the grid extent which we computed in step 2. Behind the scenes, BLACS then assigns processors to points in this virtual grid.

The grid is now defined.

Each process in the example program now turns around and asks BLACS_GRIDINFO to repeat back what we've told it, and to give us new information, its coordinate, e.g., COL and ROW, in the grid. In this call, all variables are output except "context":

   call blacs_gridinfo( context, nprow, npcol, myrow, mycol )

When a program is done with a BLACS context, it should release it with BLACS_GRIDEXIT. When it's done with BLACS altogether, it should call BLACS_EXIT. (These calls are similar to the MPI functions, MPI_Comm_free and MPI_Finalize.) The sample program makes these calls, and terminates. Here's our parallel solver thus far:

      program slv
! Test program, sets up and solves a simple system of equations
! using ScaLAPACK routines.
      implicit none
      integer   ::       n,istat,info,i,j
      real,dimension(:,:),allocatable  :: a
      real,dimension(:,:),allocatable  :: c
      integer,dimension(:),allocatable :: ipiv

      parameter (n = 13)

      integer   ::       context, iam, pnum
      integer   ::       mycol, myrow, nb
      integer   ::       npcol, nprocs, nprow
      integer   ::       l_nrowsa,l_ncolsa
      integer   ::       l_nrowsc,l_ncolsc

      integer,parameter :: descriptor_len=9
      integer   ::       desca( descriptor_len )
      integer   ::       descc( descriptor_len )

      integer   ::       numroc

        subroutine say_hello ( context,                          &
        implicit none
        integer :: context
        integer :: iam,nprocs,myrow,nprow,mycol,npcol
        integer :: i,j,pnum
        end subroutine say_hello
      end interface 
! -----    Initialize the blacs.  Note: processors are counted starting at 0.
      call blacs_pinfo( iam, nprocs )
! -----    Set the dimension of the 2d processors grid.
      call gridsetup(nprocs,nprow,npcol)
! -----    Initialize a single blacs context.  Determine which processor I 
!          am in the 2D process or grid.
      call blacs_get( -1, 0, context )
      call blacs_gridinit( context, 'r', nprow, npcol )
      call blacs_gridinfo( context, nprow, npcol, myrow, mycol )

      call say_hello (context,                     &
! -----    Exit BLACS cleanly -----
      call blacs_gridexit( context )
      call blacs_exit( 0 )

      end program slv
      subroutine gridsetup(nproc,nprow,npcol)
! This subroutine factorizes the number of processors (nproc)
! into nprow and npcol,  that are the sizes of the 2d processors mesh.
! Written by Carlo Cavazzoni
      implicit none
      integer nproc,nprow,npcol
      integer sqrtnp,i

      sqrtnp = int( sqrt( dble(nproc) ) + 1 )
      do i=1,sqrtnp
        if(mod(nproc,i).eq.0) nprow = i
      end do
      npcol = nproc/nprow

      subroutine say_hello ( context,                          &
! Each processor identifies itself and its place in the processor grid
        implicit none
        integer :: context
        integer :: iam,nprocs,myrow,nprow,mycol,npcol
        integer :: i,j,pnum

        do i=0,nprocs-1
          call blacs_barrier (context, 'a')

          if (iam.eq.i) then
            write(6,100) iam,nprocs,myrow,nprow,mycol,npcol

100         format(" PE=",i2,":",i2," PROW=",i2,":",i2," PCOL=",i2,":",i2)

        call flush(6_4)
        call blacs_barrier (context, 'a')
      end subroutine say_hello

Here's some sample output. After the grid creation procedure, each process calls the "say_hello" routine which causes it to dump information about itself, like this:

   PE= 7:12 PROW= 2: 4 PCOL= 1: 3

Interpret as follows: process 7 of 12 wrote the above line, and in the processor grid, it has been defined as row 2 of 4 and column 1 of 3. Here's full output from runs on 1, 3, 4, and 12 processors:

  KLONDIKE$ aprun -n 1 ./slv_part2
   PE= 0: 1 PROW= 0: 1 PCOL= 0: 1

  KLONDIKE$ aprun -n 3 ./slv_part2
   PE= 0: 3 PROW= 0: 1 PCOL= 0: 3
   PE= 1: 3 PROW= 0: 1 PCOL= 1: 3
   PE= 2: 3 PROW= 0: 1 PCOL= 2: 3

  KLONDIKE$ aprun -n 4 ./slv_part2
   PE= 0: 4 PROW= 0: 2 PCOL= 0: 2
   PE= 1: 4 PROW= 0: 2 PCOL= 1: 2
   PE= 2: 4 PROW= 1: 2 PCOL= 0: 2
   PE= 3: 4 PROW= 1: 2 PCOL= 1: 2

  KLONDIKE$ aprun -n 12 ./slv_part2
   PE= 0:12 PROW= 0: 4 PCOL= 0: 3
   PE= 1:12 PROW= 0: 4 PCOL= 1: 3
   PE= 2:12 PROW= 0: 4 PCOL= 2: 3
   PE= 3:12 PROW= 1: 4 PCOL= 0: 3
   PE= 4:12 PROW= 1: 4 PCOL= 1: 3
   PE= 5:12 PROW= 1: 4 PCOL= 2: 3
   PE= 6:12 PROW= 2: 4 PCOL= 0: 3
   PE= 7:12 PROW= 2: 4 PCOL= 1: 3
   PE= 8:12 PROW= 2: 4 PCOL= 2: 3
   PE= 9:12 PROW= 3: 4 PCOL= 0: 3
   PE=10:12 PROW= 3: 4 PCOL= 1: 3
   PE=11:12 PROW= 3: 4 PCOL= 2: 3

Here's one way to compile the sample program:

IBM Cluster:

   %  mpxlf90_r -q64  -qextname -qsuffix=cpp=f90 -qipa -c slv.f90
   %  mpxlf90_r -q64  -qipa -qextname  -o slv slv.o     \
                  -lblacssmp -lesslsmp -lpesslsmp

Cray X1:

   %  ftn -O ssp -o slv slv.f90

In Part III of this series comes the hard part, distributing the data arrays in block-cyclic fashion.

Preview of Cray PAT Training

An important goal of next week's training is to simplify the use of the X1 performance analysis tool, CrayPAT. The training will center on 4 or 5 simple scripts which help perform one analysis task each.

For instance, here's a script to produce a basic calling tree of your code. Script Name: "patcalltreefunc."


# This script generates a PAT report.  
# Tom Baring, ARSC, Dec 2004


SCRPT=$(basename ${0}) 
SYNTX="Syntax: $SCRPT [-h] <instrumented_exec_name> <name_of_xf_file>" 
case $1 in
  "-h" ) 
echo $SYNTX
echo "Preparation:"
echo  " 1. Instrument the executable: pat_build <executable_name> <instrumented_exec_name>"
echo  " 2. In the PBS script (or interactive environment):"
echo  "    [ksh]  export PAT_RT_EXPERIMENT=$PAT_RT_EXPERIMENT"
echo  "    [csh]  setenv PAT_RT_EXPERIMENT $PAT_RT_EXPERIMENT"
echo  " 3. Run the instrumented executable. E.g., mpirun -np 4 ./<instrumented_exec_name>"
echo  " 4. This produces the .xf file needed by this script "
return 0 ;;


D="-d samples%"
B="-b ssp,calltree,function"
S="-s percent=relative"

echo "Generating report:   ${RPTFILE}"
pat_report $D $B $S -i $EXEFILE -o $RPTFILE $XFFILE

The script's "-h" option gives information to help you get started:

  KLONDIKE$ ./patcalltreefunc -h
  Syntax: patcalltreefunc [-h] <instrumented_exec_name> <name_of_xf_file>

   1. Instrument the executable: pat_build <executable_name> <instrumented_exec_name>
   2. In the PBS script (or interactive environment):
      [ksh]  export PAT_RT_EXPERIMENT=samp_cs_time
      [csh]  setenv PAT_RT_EXPERIMENT samp_cs_time
   3. Run the instrumented executable. E.g., mpirun -np 4 ./<instrumented_exec_name>
   4. This produces the .xf file needed by this script 

Once you've accomplished step 3, you run the script against the .xf file generated. This creates the performance report, which is given a descriptive, if obnoxiously long, name:

  KLONDIKE$ ./patcalltreefunc slv.inst slv.inst+103720sdt.xf
  Generating report:   slv.inst+103720sdt.xf.patcalltreefunc.rpt
  KLONDIKE$ more slv.inst+103720sdt.xf.patcalltreefunc.rpt
  Table 1:  -d samples%
            -b ssp,calltree,function

















A nice thing about CrayPAT is that by tweaking the parameters to pat_report, you can generate multiple reports from the same .xf file, to expose different features of the run.

Fall Training: Dec 7, Cray PAT; Dec 14, AVS/Express

The conclusion of ARSC's Fall Training calendar:

Title: Cray Performance Analsys using CrayPAT Instructor: Tom Baring Date: Dec 7 at 1pm Location: WRRB 009

Title: AVS/Express Visualization Quickstart Instructor: Roger Edberg Date: Dec 14 at 1pm Location: WRRB 009

Complete training schedule:

Quick-Tip Q & A

A:[[ I was recently on a walk and discovered a phone number, that I 
  [[ wanted to remember, posted on a tree.  I had time to cogitate, but
  [[ nothing on which to write.
  [[  Do you have any tips, methods, or systems for memorizing numbers?
  [[ If you want, you could demonstrate your system using one or more of
  [[ the following numbers (these are the ARSC Consult line, our latitude,
  [[ and the ISBN for everyone's favorite book, "Parallel Programming in
  [[ OpenMP").
  [[  907-450-8602
  [[  64 deg. 51.590 min.
  [[  1-56592-054-6

# Ed Anderson
I find it easier to remember 2 or 3 numbers than 4 or more.  So, for the
ARSC Consulting line (a fascinating number, nearly an anagram of 0-9,
with no number repeated except 0) I would try to remember "450", and
then 907 is just 2*450 + 7 (a very common number, 7, if you play board
games with 2 dice on those cold Arctic nights) and 8602 = 450*19 + 52
(where of course 52 is the number of cards in a deck).

By the way, 1-56592-054-6 is the ISBN of "Learning the Korn Shell", by
Bill Rosenblatt.  The ISBN for Chandra et al. is 1-55860-671-8.

# Guy Robinson
Joking And Helping, Each Faultlessly Answers Intelligently, Graciously,
And Completely.

jah efa igac, where a=0 etc.

907 450 8602, or y0r gl0 tm0c as the buttons might mean.

# The Editor
Lacking any particular associations, like, hey, that number just happens
to be my birthday, the approach that works for me is rote memorization
of different, overlapping patterns and subsequences.  It's easy to
mistakenly swap digits or invert patterns, but overlapping of rules
helps prevent this.

Forced to remember "1-56592-054-6,"  I'd work on memorizing:
  -the fact that digit number 1 is the actual number 1
  -the fact that there are 10 digits, total
  -the three pairs of numbers which start with 5_:
  -the two pairs of numbers which end with _5:
  -the three coherent subsequences:
  -the rhythm of the entire sequence 

# Ed Kornkven
I learned a device for memorizing long numbers back in, well, I don't
recall the year at the moment.  But it was from a book called "The
Memory Book," by Harry Lorayne and Jerry Lucas.

Their system assigns specific consonant sounds to the digits.  From
those sounds, words are made, and word pictures from those words.  The
more unusual the word pictures, the more memorable.  From the word
pictures, we can get the words back, then the list of consonant sounds,
and then the digits.

Their complete mapping is as follows: 
  1: T, or D
  2: N
  3: M
  4: R
  5: L
  6: J, SH, CH, or soft G
  7: K, hard C, or hard G
  8: F, V, or PH
  9: P, or B
  0: Z, S, or soft C

Thus, the ISBN of 1-56592-054-6 would be translated to
        {T,D} L {J,SH,CH} L {P,B} N {Z,S} L R {J,SH,CH}

Let's see about some words...
        doll shawl pins large

Or how about
        tall chalupa nozzle rash

You can probably come up with an outrageous picture for remembering
these words.  No?  Well, check out the book for a fuller explanation of
the system and the foundation behind it.

Q: Holiday fun... Share a tip on ANY SUBJECT!  Anything!  Teach us
   something.  What do you do well?

     Cook?  Fix things?  Commute?  Choose good movies to rent?
     Volunteer?  Teach kids?  Wash dishes?  Invest?  Keep warm?
     Meditate?  Photograph?  Punctuate?  Fish?  Brew?  Travel?

  (Don't feel limited by these categories.) All tips will appear
  anonymously... try for 35 words or less per tip.  

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top