ARSC HPC Users' Newsletter 269, May 23, 2003

Cray X1 Installed at ARSC

Yesterday afternoon, ARSC's two cabinet Cray X1 was delivered. The X1 is the latest offering from Cray, Inc. The system should be available for allocated usage by Oct. 1. Here's an image of it in our machine room:

The system is named "klondike". It took 70 hours in a semi-tractor trailer rig to cover the 3400 miles from Chippewa Falls, Wisconsin to Fairbanks, Alaska, and 6 hours to be unloaded, set up, and powered up in our machine room.

Processors are being installed in two stages. Currently, it has 64, 12.8 gflop multi-streaming processors, and the second set of 64 will be installed later this summer.

Here's a 10-minute movie, courtesy of Leone Thierman of ARSC, of unloading and installation:

Stay tuned! We'll have a lot to say about the X1 in this newsletter.

SX-6 Memory Requirements for Parallel Jobs

[ Thanks to Ed Kornkven of ARSC. ]

SX-6 users may be familiar with the steps necessary to run a program under autotasking. Namely, the program must first be compiled with the -Pauto compiler option. Second, the NQS script for an autotasked batch job must specify the number of processors for NQS to allocate to the job (the -c option) as well as the number of microtasks to use in parallelizing the program (via the F_RSVTASK environment variable). These two numbers will typically be equal. Here is a simple NQS script for running "a.out" with two processors:

# # # # # # # # # # #
# Per-request CPU time
#@$-lT 20:00
# Per-request Memory
#@$-lM 64MB
# Number of CPUs to use
#@$-c 2
# Combine stderr and stdout into one file
# Name of queue
#@$-q batch
# Shell to use
#@$-s /usr/bin/csh

# Change to work directory

# Detailed hardware stats
# Number of microtasks
setenv  F_RSVTASK  2

# Execute the program
# # # # # # # # # # #

It isn't our purpose to give an in-depth NQS tutorial here, but rather to illustrate a potential pitfall and its obscure manifestation. When this script was executed recently, it seemed to only run on one processor as evidenced by the hardware statistics that were printed. A portion of those are reproduced here:

     ******  Program Information  ******
  Real Time (sec)       :         12.620525
  User Time (sec)       :         11.534289
  Sys  Time (sec)       :          0.032625
  Vector Time (sec)     :         11.463276
  Inst. Count           :         917380607.
  V. Inst. Count        :         874561395.
  V. Element Count      :      104537180864.
  FLOP Count            :       82143494463.
  MOPS                  :       9066.878771
  MFLOPS                :       7121.678195
  MOPS   (concurrent)   :       9066.903926
  MFLOPS (concurrent)   :       7121.697953
  VLEN                  :        119.530980
  V. Op. Ratio (%)      :         99.959056
  Memory Size (MB)      :         48.000000
  Max Concurrent Proc.  :                 1.
   Conc. Time(>= 1)(sec):         11.534257

Note the last two lines. Only one processor is reporting. We were expecting to see something like:

     ******  Program Information  ******
  Real Time (sec)       :          6.091186
  User Time (sec)       :         12.125577
  Sys  Time (sec)       :          0.028495
  Vector Time (sec)     :         11.468935
  Inst. Count           :         937278529.
  V. Inst. Count        :         874561403.
  V. Element Count      :      104537182776.
  FLOP Count            :       82143574467.
  MOPS                  :       8626.385359
  MFLOPS                :       6774.405413
  MOPS   (concurrent)   :      17179.574337
  MFLOPS (concurrent)   :      13491.328818
  VLEN                  :        119.530981
  V. Op. Ratio (%)      :         99.940041
  Memory Size (MB)      :         80.000000
  Max Concurrent Proc.  :                 2.
   Conc. Time(>= 1)(sec):          6.088620
   Conc. Time(>= 2)(sec):          6.037041

In the second report we see clearly that two processors are at work and in fact giving a very nice speedup.

What happened in the first run? The hint is on the third line from the bottom in each of the hardware stat displays. The first program ran with insufficient memory for autotasking and the only sign of the fact is the odd hardware statistics output. When plenty of memory was added, program worked fine. It should be noted that 64MB is sufficient when autotasking is turned off.

How much is the "plenty" of memory that we need to use for autotasking? Well, the machine tells us: it used exactly 80MB. So we adjust our script to request 80MB instead of 64MB and the program... crashes! It also crashes with 81MB. 82MB works fine though. So "plenty" seems to mean "what the program reports plus a little more" and it is going to be more than what was required without autotasking.

Watch your OpenMP Environment

On the Cray SV1ex the default OpenMP schedule type for parallel DO loops is DYNAMIC. From CrayDoc, the chunk size is set as follows:

  DYNAMIC is the default schedule type, depending on the type of loop, as

    * For non-innermost loops, DYNAMIC is the default SCHEDULE type. The
      chunk size is set to 1.

    * For innermost loops without a REDUCTION clause, DYNAMIC is the
      default SCHEDULE type. The chunk size is set to the target machine's
      maximum vector length.

      For innermost loops with a REDUCTION clause, GUIDED is the default
      SCHEDULE type. This scheduling mechanism is described in the
      following paragraphs.

If your loops, inner or outer, are large, you may get better performance by overriding the default chunk sizes with larger values.

Here's a dramatic example, using the SMP version of the STREAM benchmark (see: ), which is designed to be memory bandwidth limited and vector heavy, and not necessarily indicative of how your real application will respond. It is especially susceptible to having the flow of work chopped up into small pieces. The runs were made on a busy system.

  • The scheduling type and chunk sizes are as noted.
  • The "Triad" value is the memory bandwidth reported by STREAM's "Triad" test.
  • The STREAM array size used was 2000000 array elements.
  • These were run with 4 threads.

  OMP_SCHEDULE:  <<default>>
    Triad:           57.0 MB/sec
    CPU seconds   :  140.87375      
    Triad:         5779.9 MB/sec
    CPU seconds   :    1.28558      
    Triad:         8624.3 MB/sec
    CPU seconds   :    1.01457      
    Triad:        10399.1 MB/sec
    CPU seconds   :    0.87952      
    Triad:         9634.7 MB/sec
    CPU seconds   :    0.92272      

It's pretty clear that for this code, the default scheduling type isn't the best. You might play with some of these setting as you optimize your own codes.


>         WOMPAT 2003: Workshop on OpenMP Applications and Tools
>            June 26 - 27, 2003 in Toronto, Ontario Canada

> <2003-May-26> Early registration deadline. 
> <2003-Jun-26> Opening of the WOMPAT 2003 Workshop.
> Registration information and the preliminary program are now available:

> The OpenMP API is a widely accepted standard for high-level
> shared-memory parallel programming.  Since its introduction in 1997,
> OpenMP has gained support from the majority of high-performance compiler
> and hardware vendors.
> WOMPAT 2003 is latest in a series of OpenMP-related workshops, which
> have included the annual offerings of WOMPAT, EWOMP and WOMPEI.
> WOMPAT 2003 will be held at the Hilton Toronto in Toronto, Ontario,
> Canada.  There are a number of events and festivals to be held during
> June 2003, you can find more information about events occurring around
> the time of WOMPAT at the web site. 
> On 20-May-2003, the Center for Disease Control in the United States
> removed its travel alert for Toronto, Canada.  This was done because
> more than 30 days (or 3 times the SARS incubation period) had elapsed
> since the onset of the last case.  You can find the complete
> announcement at the link below:

> On 14-May-2003, the World Health Organization removed Toronto from the
> list of areas with recent local transmission of SARS. This step was
> taken after 20 days (twice the incubation period) passed since the last
> locally acquired case of SARS had been isolated.  According to the WHO,
> "the chain of transmission is considered broken."  For the complete text
> of the WHO Update refer the the website below:


Quick-Tip Q & A

 A:[[ I run a series of batch jobs (NQS, LoadLeveler, PBS, whatever). Each
  [[ run must create its own directory for output. My current method is to
  [[ manually edit the batch script for each run, typing the name for the
  [[ output directory, like this:
  [[   OUTDIR="results.028"
  [[ this variable is used later in the script, e.g.,:
  [[   mkdir $OUTDIR 
  [[   cd    $OUTDIR
  [[ I don't care much what names are used for the directories.  Can you
  [[ recommend a way, if there is one, to come up with these names
  [[ automatically?

  # Thanks to Richard Griswold:
  One method is to append the PID to the directory name:
    mkdir $OUTDIR
    cd    $OUTDIR
  A safer way is to use the mktemp command.  If your system doesn't have
  mktemp, you can get it from

    OUTDIR=`mktemp -dq results.XXXXXX` 

 exit 1
    cd $OUTDIR

  # Thanks to Brad Chamberlain:
  I use the following technique in NQS.  Suggestions for generalizing it
  for other queuing systems are mentioned at the end.

  Each NQS job has a unique identifier associated with it stored in an
  environment variable called QSUB_REQID.  This number corresponds to
  the number you'll see when submitting jobs or checking on their
  status.  As an example, if I submit a job as follows:

        yukon% qsub mg.8W
        nqs-181 qsub: INFO 
          Request <20858.yukon>: Submitted to queue <mpp>

  ...for this submission, QSUB_REQID is 20858.
  I use this variable to make qsub output filenames unique using the
  following lines in my qsub script.

        ### towards the top with other QSUB options, I insert:
        #QSUB -o output/mg.8.out
        # (this specifies that output should go in my output subdirectory
        # and should be named mg.8.out)

        ### at the top of my actual set of commands, I insert:
        cd ~/qsub/output
        # (cd to the same output subdirectory named above)

        mv mg.8.out mg.8.$QSUB_REQID.out
        # (rename the previous output file created by this script to a
        # new unique name, created using this job's QSUB_REQID)

  This technique works because the output file generated by the QSUB -o
  directive doesn't appear in this directory until the script completes
  running.  Thus, the mv command executes before the new mg.8.out file
  is ever created.

  Note that this technique will not store the output of job 20858 in
  file mg.8.20858.out as one might like.  Rather, it will store the
  output of job 20858 in mg.8.out (for now), and the output of the
  previous job in mg.8.20858.out.  I find I don't care much about the
  actual job number...  keeping my files unique is sufficient, so this
  trick works.  I then get summary information across a number of runs
  using commands like:

        grep Time mg.8.*out

  While it's tempting to put the $QSUB_REQID directly in the -o
  directive, it seems that variable names are not expanded there, so you
  will literally get a file called mg.8.$QSUB_REQID.out, which isn't
  terribly useful.

  Other queueing systems typically have similar built-in variables that
  are unique to each submission, but I don't know them offhand, so
  you'll need to read some man pages to find out how to do that.

  Another approach would be to use the built-in $$ variable provided by
  csh-like scripts to refer to a script's process number.  This could be
  used instead of $QSUB_REQID above, for example (I prefer QSUB_REQID
  because it corresponds to a number that I have a better grasp of, even
  though it has the imperfect "off-by-a-submission" issue mentioned

  # Editor's method:
  This NQS script create a directory with the name
  "outdir.YYYYMMDD.HHMM" where YYYY is the year, MM is the month,

    #QSUB -q batch
    #QSUB -lM 100MW
    #QSUB -lT 8:00:00
    #QSUB -s /bin/ksh


    OUTDIR=outdir.$(date "+%Y%m%d.%H%M")
    mkdir $OUTDIR
    cd $OUTDIR

  And the result of a test: 

    CHILKOOT$ qsub t.qsub
    nqs-181 qsub: INFO 
      Request <11293.chilkoot>: Submitted to queue <batch>
    CHILKOOT$ ls -l -d  outdir.*
    drwx------   2 staff       4096 May 23 15:24 outdir.20030523.1524

Q: OUCH!!!!!!!!!

   I had, yes, note past tense... a couple files to save, several to
   delete, and some of those to delete had permission 400.  They looked
   something like this:

   $ ll
    total 144
    -rw-------   1 saduser  sadgroup    4280 May 23 15:36 d
    -rw-------   1 saduser  sadgroup     535 May 23 15:36 e
    -rw-------   1 saduser  sadgroup   17120 May 23 15:35 f
    -r--------   1 saduser  sadgroup    8560 May 23 15:35 a
    -r--------   1 saduser  sadgroup    2140 May 23 15:35 c
    -r--------   1 saduser  sadgroup    1070 May 23 15:35 b

   To simplify my life, I did this, 

     rm -i -f ?

   I expected "rm" to ask about each file before deleting it, and to
   take care of the "400" files automatically. Oh well..  it blasted
   them all, and didn't even ask.

   If there's a question in all this, maybe you could answer it.  I'm
   too upset to think.

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top