[Menu Bar] Resourses at ARSC Science at ARSC Newsroom Support About ARSC ARSC Home

Introduction to Using the IBM p655+/p690+ Complex

Contents

 

Introduction

The Arctic Region Supercomputing Center (ARSC) operates an IBM p655+/p690+ supercomputer (iceberg). This system is available to allocated ARSC users only and access requires proof of citizenship for all users regardless of country.

The two p690+ nodes are ideal for large memory OpenMP programs. The bulk of the computing capability consists of p655+ servers. These servers can be used as small 8-way shared memory systems, but can also be used for large message passing jobs using multiple servers per job.

 

"Iceberg"

The ARSC IBM p655+/p690+ or p6x has the following attributes:

 

Operating System / Shells

The operating system on iceberg is IBM's Unix-variant, AIX. The following shells are available on iceberg:

If you would like to have your login shell changed, please contact User Support.

 

System News & Status

System news is available via the news command when logged on to iceberg. System status and news items are available on the web.

 

Storage

On all ARSC systems, several storage related environment variables are set by default. Please use these environment variables:

Name Purpose Quota1 Purge Policy Back Up Policy
$HOME "dot" and other small files 100MB not purged backed up
$WORKDIR2
$WRKDIR3
compilation, program execution, storage for active program data files 20GB Initial purged not backed up
$ARCHIVE_HOME2
$ARCHIVE3
long-term storage none not purged backed up

NOTES:

  1. Home directories are intended primarily for basic account info (e.g. dot files). Please use $WORKDIR (your /workdir/$USER directory) for compiles, input files, output files, etc.
  2. Long-term backed up storage is only available in your $ARCHIVE_HOME directory. $ARCHIVE_HOME is an NFS-mounted filesystem from ARSC's Sun Storage Server (seawolf) and is available from iceberg1 and iceberg2. Compute nodes do not have access to $ARCHIVE_HOME, with the exception of the data class. I/O performance in this directory will be slower than $WORKDIR. Compilations in $ARCHIVE_HOME are not recommended.
  3. $WRKDIR and $ARCHIVE are deprecated versions of the environment variables and may be phased out in the near future.

Requests for quota increases should be sent to User Support.

See http://www.arsc.edu/support/howtos/storage.html for more information on storage policies at ARSC.

 

Sample Code Repository ($SAMPLES_HOME)

The $SAMPLES_HOME directory on iceberg contains a number of examples including, but not limited to:

A description of items available in the Sample Code Repository is available on the web, however you must login to iceberg to access these examples.

If you would like to see additional samples or would like to contribute an example, please contact User Support.

 

Parallel Programming Models

The following programming models are available through compiler directives and/or libraries.

Hardware Level Model Description
Shared-memory node OpenMP This is a form of explicit parallel programming in which the programmer inserts directives into the program to spawn multiple shared-memory threads, typically at the loop level. It is common, portable, and relatively easy.
Shared-memory node pthreads The system supports POSIX threads.
Distributed memory system MPI

This is the most common and portable method for parallelizing codes for scalable distributed memory systems. MPI is a library of subroutines for message passing, collective operations, and other forms of interprocessor communication. The programmer is responsible for implementing data distribution, synchronization, and reassembly of results using explicit MPI calls.

Using MPI, the programmer can largely ignore the physical organization of processors into nodes and simply treat the system as a collection of independent processors.

Distributed memory system SHMEM This implements one-sided data passing, and other communication between tasks. Like MPI, SHMEM is a library of subroutines and the programmer must find and code all parallel work. SHMEM is supported on the IBM through the TurboMP library.
Distributed memory system Hybrid Many of the models/methods described above can be combined in a single application. In particular:
  • MPI and threads are explicitly combined by a growing number of programmers. For example one MPI task could be launched per p655+ node each of which would then spawn 8 threads to use all 8 processors on the node. Thread models available on iceberg include: pthreads and OpenMP.

 

Programming Environment

The IBM p6x has many different compiler variations available. The table below lists the most common compilers. See man xlf, man xlc or man xlC for additional options.
Compilers Standard Thread-Safe MPI Thread-Safe MPI
Fortran 77 compiler xlf xlf_r mpxlf mpxlf_r
Fortran 90 compiler xlf90 xlf90_r mpxlf90 mpxlf90_r
Fortran 95 compiler xlf95 xlf95_r mpxlf95 mpxlf95_r
C compiler xlc xlc_r mpcc mpcc_r
C++ compiler xlC xlC_r mpCC mpCC_r

NOTE: Thread-Safe compilers are recommended for all parallel code whether threads are explicitly used or not.

Other Programming Tools Executable
Debuggers totalview, totalviewcli
Performance analysis prof, gprof, tprof, xprofiler
Batch queuing system LoadLeveler

Modules

Iceberg has the modules package installed. This tool allows a user to quickly switch between different versions of a package (e.g. compilers).

Before the modules package can be used on iceberg, the init file must first be sourced.

To do this using tcsh or csh, type:

   source /usr/local/pkg/modules/init/<shell>
To do this using bash, ksh, or sh, type:
   . /usr/local/pkg/modules/init/<shell>
For either case, replace <shell> with the shell you are using. If your shell is bash, for example:
   . /usr/local/pkg/modules/init/bash

Once the modules init file has been sourced, the following commands become available:


Command Example Use Purpose
module avail module avail lists all available modules for the system.
module load pkg module load PrgEnv loads a module file from the environment
module unload pkg module unload PrgEnv unloads a module file from the environment
module list module list displays the modules which are currently loaded.
module switch old new module switch PrgEnv PrgEnv.gcc replaces the module old with module new in the environment
module purge module purge unload all module settings, restoring the environment to the state before any modules were loaded.

Compiling and Linking Fortran programs

The Fortran 90 compilers are:

xlf90 (standard Fortran Compiler)
mpxlf90 (Fortran MPI Compiler)
mpxlf90_r (Thread Safe Fortran MPI Compiler)

See table above or man xlf for additional options.

Here are sample compiler commands showing several common options:

xlf90_r program.f -o program (Compilation of a serial FORTRAN program)
xlf90_r program.f90 -o program (Compilation of a serial FORTRAN program with .f90 suffix)
xlf90_r openmp_prog.f -qsmp=omp -o openmp_prog (Compilation of a simple OpenMP program)
mpxlf90_r program.f -o program (Compilation of a simple MPI program)

See "man xlf" for more information and additional compiler options not listed here.

Common Fortran Compiler Directives:
xlf option Description
-q64 Enables 64 bit compilation.
-qwarn64 Produces warnings for possible 32 bit data size issues. Useful when porting 32 bit code to 64 bit environment.
-O3 High level of optimization. Note additional levels of optimization are available. See man xlf for more details
-qhot Attempts to perform high order transformations.
-qautodbl=dbl Promotes REALs to 64 bit double precision REALs. This can be useful when porting codes from systems where REALs are 64 bit by default. See man xlf for additional options.
-qstrict Ensures that the semantics of a program are not altered when using -O3
-qmaxmem=<num> Specifies the memory limit in kilobytes used by space intensive optimizations. The special value -1 is used to indicate there is no limit to memory used by such optimizations.
-bmaxdata:0x70000000 Specifies the number of bytes to reserve for the heap (dynamic memory allocation). In this case the data segment is 0x70000000 bytes or 1.75GB. Recommended values are 0x20000000 (0.50 GB) to 0x70000000 (1.75 GB). Programs requiring greater than 1.75 GB for the data segment should be compiled in 64 bit mode (-q64).
-bmaxstack:0x70000000 Specifies the number of bytes to reserve for the stack**. In this case the stack is 0x70000000 bytes or 1.75 GB. Recommended values are 0x20000000 (0.50 GB) to 0x70000000 (1.75 GB). Programs requiring greater than 1.75 GB of stack space should be compiled in 64 bit mode (-q64).
-g Produces debugging information.
-p Adds profiling support code.
-pg Adds profiling support code with BSD profiling support.
-qfullpath Full path to source and include files is included in output files for debugging or profiling purposes.
prog.f Name of the input Fortran source file. (File name extensions have an effect on compilation; the .F extension cause the cpp pre-processor to be invoked while .f does not.)

Many other compiler options are available. Check the man page for the compiler for more details. If you can't find something you'd expect, contact User Support for additional help.

Compiling and Linking C/C++ programs

The C compiler is accessible by the commands:

xlc (Standard C Compiler)
mpcc (C MPI Compiler)
mpcc_r (Thread Safe C MPI Compiler)

C++ uses the command:

xlC (Standard C++ Compiler)
mpCC (MPI C++ Compiler)
mpCC_r (Thread-Safe MPI C++ Compiler)

Sample Compilation:

xlc_r test.c -o test (Compilation of a serial C program)
xlc_r -qsmp=omp openmp_test.c -o openmp_test (Compilation of an OpenMP C program)
mpcc_r mpi_prog.c -o mpi_prog (Compilation of a MPI C program)

See "man xlc" for more information and additional compiler options.

Common C/C++ Compiler Directives:
xlc/xlC option Description
-q64 Enables 64 bit compilation.
-qwarn64 Produces warnings for possible 32 bit data size issues. Useful when porting 32 bit code to 64 bit environment.
-O3 High level of optimization. Note additional levels of optimization are available. See man xlf for more details
-qstrict Ensures that the semantics of a program are not altered when using -O3
-qmaxmem=<num> Specifies the memory limit in kilobytes used by space intensive optimizations. The special value -1 is used to indicate there is no limit to memory used by such optimizations.
-bmaxdata:0x70000000 Specifies the number of bytes to reserve for the heap (dynamic memory allocation). In this case the data segment is 0x70000000 bytes or 1.75GB. Recommended values are 0x20000000 (0.50 GB) to 0x70000000 (1.75 GB). Programs requiring greater than 1.75 GB for the data segment should be compiled in 64 bit mode (-q64).
-bmaxstack:0x70000000 Specifies the number of bytes to reserve for the stack. In this case the stack is 0x70000000 bytes or 1.75 GB. Recommended values are 0x20000000 (0.50 GB) to 0x70000000 (1.75 GB). Programs requiring greater than 1.75 GB of stack space should be compiled in 64 bit mode (-q64).
-g Produces debugging information.
-p Adds profiling support code.
-pg Adds profiling support code with BSD profiling support. (use with gprof and xprofiler)
-qfullpath Full path to source and include files is included in output files for debugging or profiling purposes.

Many other compiler options are available. Check the man page for the compiler for more details. If you can't find something you'd expect, contact User Support for additional help.

Performance analysis

IBM has several different tools such as prof, gprof and xprofiler for run-time profiling and tracing of codes. prof and gprof offer a command line interface, while xprofiler has a GUI and produces a graphical call tree.
  1. Build your program with the flags -g -pg -qfullpath. This will add code to support profiling, and debugging. Be sure to include the -pg flags on the linking line for you code.
    xlf90 program.f90 -pg -g -qfullpath -qsuffix=f=f90 -o profiled-program
  2. Run the executable as normal. This will produce a file (gmon.out), containing output statistics for the run.
    ./profiled-program
  3. Generate a human-readable report from the gmon.out file using a profiling tool such as gprof. Since gprof sends the results to stdout you may want to redirect the output to a file as shown below.
    gprof > profiled-program.gprof_report
  4. View the report:
    more profiled-program.gprof_report

See ARSC HPC Users' Newsletter 275 for an overview of xprofiler.

Running Interactive Jobs

You are encouraged to use the LoadLeveler batch system, but may run interactive jobs which require less than 15 minutes of CPU time and run on a single node. An interactive command is simply typed at the prompt in a terminal window. Standard error and standard output may displayed on the terminal, redirected to a file, or piped to another command using appropriate Unix shell syntax.

Here are some examples of running serial and parallel jobs interactively.

Running Batch Jobs

All production work on iceberg is run through the LoadLeveler batch scheduler. Many users find that it is convenient to run short test jobs this way because memory, run-time, and CPU limits are larger, stdout/stderr are saved to file(s) for each run, and the test will continue even if the user logs off.

A batch job is a shell script (which could be as simple as two commands: a cd to the working directory and a run command like ./a.out) prefaced by a statement of resource requirements and instructions which LoadLeveler will use to manage the job.

LoadLeveler scripts are submitted for processing with the command llsubmit.

As outlined below, five steps are common in the most basic batch processing:

  1. Create a batch script:

    In a batch script conforming to LoadLeveler syntax, all LoadLeveler options must precede all shell commands. Each line containing a LoadLeveler option must commence with the character string, "# @" (spaces and tabs do not matter) followed by an option. In addition the last LoadLeveler keyword must be "# @ queue".

    This demonstrates the most common LoadLeveler options for running a parallel job. See the Using Loadleveler page for a more in depth description of these and other Loadleveler keywords.

    #!/bin/csh
    #
    # @ environment = MP_SHARED_MEMORY=yes; COPY_ALL
    # @ error = $(executable).$(jobid).err
    # @ output = $(executable).$(jobid).out
    # @ notification = error
    # @ job_type = parallel
    # @ node = 1
    # @ tasks_per_node = 8
    # @ network.MPI = sn_single,shared,us
    # @ class = standard

    # @ queue

    ./a.out

  2. Submit a batch script to LoadLeveler, using llsubmit:

    The script file can be given any name. If the above sample were named myprog.7.cmd, it would be submitted for processing with the command

    llsubmit myprog.7.cmd
  3. Monitor the job

    To check the status of the submitted LoadLeveler job, execute this command:

    llq
  4. Delete the job

    Given its LoadLeveler job identification number (shown when you use "llq"), you can delete the job from the batch system with this command:

    llcancel <LoadLeveler Job-ID>
  5. Examine Output

    When the job terminates, LoadLeveler will save the stdout and stderr from the job to a file in the directory from which it was submitted to the filenames listed in the "#@ error" and "#@ output" lines of your LoadLeveler script. These files will be named according to the naming convention provided in the loadleveler script. For example the stdout for a run of myprog using the example LoadLeveler script might be,

    myprog.7102.out

LoadLeveler Classes

List all available classes (also called commonly called queues) with the command llclass. List details on any class, for instance, "standard," with the command llclass -l standard. You may also read news queues for information on all queues, but note that the most current information is always available using the llclass commands.ß

See Introduction to Using LoadLeveler for more example LoadLeveler scripts for iceberg.

Memory Settings

There are three different memory settings which need to be set appropriately to make best use of system memory on iceberg.

-bmaxdata / -bmaxstack settings

AIX can have memory settings included within the executable. The -bmaxdata and -bmaxstack can be used during linking to set the maximum space that the executable is allowed to use for the heap and stack. If these flags are not used the heap and stack will be limited to 256MB for 32 bit executables. Generally the heap setting (i.e. -bmaxdata) causes problems more frequently than the stack setting (i.e. -bmaxstack).

Example Use:

   # set maxdata to 1.75 GB
   xlf90_r -bmaxdata=0x70000000 test.o -o test 
  

The -bmaxdata and -bmaxstack values for an existing executable can be altered without recompiling using the ldedit command. (See also: ARSC HPC Users Newsletter issue 293 )

Example Use:
   # change the maxdata setting to 1.75 GB for an executable.
   ldedit -bmaxdata=0x70000000 test 
  
NOTE: The -bmaxdata and -bmaxstack settings should not be used for programs compiled with -q64.

Shell Limits

Even thought the limit (csh/tcsh) and ulimit (sh/ksh/bash) can be used to alter the soft and hard memory limits for a process, they should not be used for jobs run through Loadleveler. The shell memory limits within Loadleveler are set appropriately to allow all available memory on a node to be used by a single process, therefore there should not be a need to include limit and ulimit within Loadleveler script. Memory limits for batch jobs are enforced via Workload Manager.

Workload Manager (WLM) Consumable Memory

WLM is enabled on iceberg to prevent jobs from using excessive amounts of memory. When a Loadleveler script is submitted a ConsumableMemory value will be automatically added to the script. For more information see: news memory

Instructions for Users in Multiple Projects

Users in more than one project can select an alternate project to charge use to by using the account_no Loadleveler keyword. If the account_no keyword is not specified the account number will default to your primary group (i.e. project). Users in a single project do not need to specify an account_no.

For Example:

# @ account_no = proja

This statement should be added to your Loadleveler script prior to the queue keyword.

Each project has a corresponding UNIX group, therefore the groups command will show all projects and other groups of which you are a member.

For Example:

iceberg2 1% groups
proja projb


In this case, use would be charged to proja by default, but could be charged to projb by setting the account_no = projb in the Loadleveler script.

Utilization Information

All projects on iceberg have an allocation of CPU time on the system. CPU allocation and utilization information can be viewed using the show_usage command. Please contact User Support with any questions about project allocation.

For more information on the show_usage command see the Resource Accounting page.

Debuggers

There are several debuggers available on iceberg including totalview, totalviewcli, gdb, dbx, and pdbx.

From news totalview on iceberg:

Totalview
=========
   Totalview 6.4 is currently available on iceberg.  This version will work with
   MPI and openMP programs.

   Totalview can be used in two ways.  
   * Using the front-end nodes on iceberg.  
   * Via Loadleveler using the 'debug' class.
   
I. Using the front-end nodes.

   To use totalview on the front-end nodes, a host file is necessary: 

   1) Create a host file.  This contains a list of hosts where totalview
      will start your jobs (1 entry per processor).  In this case, processes
      will start on 'b1n2' so this file should contain one line of 'b1n2' for
      each debug process you would like to start.  Keep in mind that there are
      8 processors on the front-end nodes and potentially multiple users 
      on the machine.  Should you need to use more than 4 tasks (or threads)
      you might consider using Loadleveler.

      An example host file, for running no more than 4 debug processes:

      iceberg1 1% cat Hostfile
      b1n2
      b1n2
          b1n2
      b1n2

   2) Set the environment variable MP_HOSTFILE with the full path to your
      host file.  Below are example usage for both csh/tcsh and ksh.

      csh/tcsh:
      setenv MP_HOSTFILE ~/Hostfile

      ksh:
      export MP_HOSTFILE=~/Hostfile
    
      You may want to put this setting in your .profile or .cshrc file.

   3) Example four-processor run:

      totalview poe -a ./a.out -procs 4

      Note: the totalview flag -a tells totalview to pass all following
            arguments to the user executable (in this case poe). 

      single processor (non-MPI) code can be started without poe:

      totalview ./a.out    


   ==========================================================================


II. Interactive using the debug class with Loadleveler.

    1) To use Loadleveler you must provide a loadleveler command 
       file specifying the resources that will be used.  

       Note: The 'debug' class must be used when using totalview.
       ARSC's current totalview license allows for up to 32 tasks.

       iceberg1 2% cat lltotalview 
       #!/bin/csh
       #
       # @ job_type         = parallel
       # @ node             = 2   
       # @ tasks_per_node   = 8
       # @ network.MPI      = sn_single,shared,us
       # @ class            = debug
       # @ wall_clock_limit = 0:30:00
       # @ queue
       
       ##############################################
       # Note: no executable need to be specified
       #       in this loadleveler script.
       # 
       ##############################################

    2) Invoke totalview:

       You will need to specify the Loadleveler command when starting poe.  
       The command file can be specified on the command line or using
       the environment variable MP_LLFILE:

       a) Command line version:
          totalview poe -a ./my_mpi_exe -llfile ./lltotalview
        
          The flag -llfile specifies the name of the Loadleveler command 
          file from step 1.   

       b) Using the environment variable MP_LLFILE 
    
          ksh:
          export MP_LLFILE=./lltotalview 

          csh/tcsh:
          setenv MP_LLFILE ./lltotalview 

          Once the environment variable is set you can start totalview 
          without using the -llfile flag.    

          totalview poe -a ./my_mpi_exe 

       NOTE: If there are not enough nodes available to start your job, 
             totalview will launch, but an error will be echoed to the 
             terminal.        

       For example:                            
       ERROR: 0031-365  LoadLeveler unable to run job, reason:
              LoadL_negotiator: 2544-870 Step b1n1.12345.0 was not considered
              to be run in this scheduling cycle due to its relatively low
              priority or because there are not enough free resources.

    3) When you are done using totalview close the application.  If loadleveler
       doesn't release the nodes, you may need to cancel your job.
    
       For example:
       iceberg1 3% llq
       Id                       Owner      Submitted   ST PRI Class        Running On 
       ------------------------ ---------- ----------- -- --- ------------ -----------
       b1n1.12345.0             username   11/01 08:00 R  50  debug        b7n1       

       llcancel b1n1.12345.0


Additional hints:

    1) code should be compiled with -g.  This makes it possible for 
       totalview to refer back to the source code.  Code compiled without
       -g will appear as assembler and you will not have meaningful access
       to variable values.

    2) if you compile your code in one location and run the executable in 
       another location you should also add the flag -qfullpath.

    3) when starting jobs through poe, totalview will initially come up
       with a screenful of assembler.  Do not despair -- this is just the
       poe executable.  Click on 'go' to start the job.  poe will start the
       main run.  A dialog box will appear asking whether you wish to stop the job
       or continue.  Stopping in this case means letting you set breakpoints, etc.
       in your main code.  Not stopping will run the code on the number of
       processors you've specified until it either completes or one of the
       processors halts at an error.

     4) totalview works best with the _r compilers (i.e. mpxlf90_r)

     5) source, object, and executable files must be in the same directory for
        totalview to refer back to source lines by default. Otherwise:
        a) add the directory/directories to the PATH environment variable
        b) add paths with the File > Search Path command within totalview
  
     6) you can view core files with totalview by passing the executable and 
        core file to totalview.

        totalview ./a.out core -a -hostfile hostfile  

For additional Documentation, see:

Totalview

Online Manuals and Documentation

Active users may type the command:

iceberg1 25% news documents
Topics include:

Documentation is also available directly from IBM. Below are some links which may be of interest to iceberg users:

C / C++

Fortran

ESSL & PESSL

Loadleveler

Parallel Operating Environment (poe)

More Information

General information on ARSC and it's other resources is available in a number of forms

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:

home | search | about | support | news | science | resources