ARSC HPC Users' Newsletter 251, August 2, 2002

Performance Monitoring : hpmcount on Icehawk

[Contributed by Tom Logan, ARSC]

ARSC has installed a performance monitoring utility on Icehawk called hpmcount . Hpmcount , which was originally designed by Luiz DeRose from the Advanced Computing Technology Center (ACTC) at IBM research, tracks the number of operations that a user's code performs and gives a summary at the end of the run. One can use hpmcount to report the GFlop rate of an application, as well as other metrics including cache and memory usage.

The beauty of hpmcount is that, like "hpm" on Cray PVP systems, you do not need to recompile or in anyway modify your existing code in order to use the tool; you simply insert it on the command line in your loadleveler script. For parallel jobs, this would look like:

poe hpmcount -o timing <user_command>

Where "timing" is the base name of the hpmcount files to be created, and <user_command> is the normal command that you use to run your code. This will work even if you originally did not use "poe" in your loadleveler script.

Hpmcount creates one output file for each process in your parallel job. Using the command above, these files would be named timing_<PE>.<PID>, where PE is the rank of the process, and PID is the process ID. The files thus created contain a plethora of information about what your program did during execution. From the obvious, like the Mflip/s rating * , to the detailed, like the Average number of loads per TLB miss, one can gain useful insight into a code's operation, and, hopefully, some hints on where an application could benefit from optimization.

ARSC is collecting performance information for user codes on icehawk using the hpmcount utility. This information will allow us to better serve our users in the future by providing long-term tracking, and, thus, guidelines for typical "good" and "bad" performance numbers. Currently, we have only a small database to compare to, however the average Mflip/s seem to be in the 80-100 range (5-7% of peak) per processor for typical user codes.

For further information about hpmcount and interpretation of the output, please refer to the wonderful web page designed by NERSC, at:

http://hpcf.nersc.gov/software/ibm/hpmcount/

Additional information can also be found at the IBM Alphaworks website, at:

http://www.alphaworks.ibm.com/tech/hpmtoolkit.

(*) hpmcount uses a Mflip/s (million floating point instructions per second) instead of Mflop/s in order to distinguish the floating point multiply add operation. For general intents and purposes, the two numbers are the same.

Cray Bioinformatics User Group

SANBI, the South African National Bioinformatics Institute, has started a Cray Bioinformatics User Group. To join the list, go to:

http://www.sanbi.ac.za/mailman/listinfo/cbug

BLUIsculpt at SIGGRAPH for Those Who Missed it

Some highlights of BLUI's presence in the SIGGRAPH "Studio" are available on line. Go to:

http://www.blui.org/

BLUIsculpt is a nifty 3D VR sculpting application. Teamed up with UAF's 3D printer, there's a growing collection of interesting little wax sculptures around here these days. It's been a big hit on ARSC's summer tours (Wednesdays at 1pm through August), and has been invited back to SIGGRAPH 2003.

Faculty Camp 2002

As announced earlier,

http://www.arsc.edu/pubs/bulletins/FacultyCamp2002.shtml

ARSC's annual "Faculty Camp" starts on Monday, August 15th. Most sessions are also open to general ARSC users and UAF affiliates, not just camp registrants. Guy Robinson is scheduling the event, and requests that you just let him know before dropping by, so he can worry about available seating, etc.

The final schedule of presentations and activities will be posted on our welcome page as a "Hot Topic" next week. Check back:

http://www.arsc.edu

HDF Cray Support

HDF Newsletter 68, July 12, 2002, available at:

http://hdf.ncsa.uiuc.edu/newsletters/current.html

contained this item:

"In our last newsletter we had mentioned that we were considering dropping support for the Cray computers. We received several responses regarding this, as well as help obtaining new Cray accounts, and will be able to continue supporting the Crays."

---

This reminds me of a remark I heard this week that, in addition to reporting problems, you should tell people (and vendors) when they're doing something you value. Otherwise, they might never know, and stop doing it.

Vector intrinsics on the T3E

The last issue discussed vectorized versions of common math intrinsics on the IBM SP. Vectorized intrinsics are also available to T3E users.

As far as I can tell, Cray doesn't advise you to code up explicit calls to the vector routines and doesn't even provide the interfaces, as IBM does with "libmassv."

Using them is similar to vectorizing code on the PVP systems or using the " xlf -qhot -O3 ... " on the IBM. You write vectorizable loops, and the compiler does the translating.

Here's the relevent documentation for Fortran, taken from CrayDocs:

http://www.arsc.edu:40/


   Cray T3E TM Fortran Optimization Guide - 004-2518-002
   =====================================================

4.6. Vectorization

The CRAY T3E compiler offers a method to vectorize select math
operations inside loops. This is not the same kind of vectorization
available on Cray PVP systems.  On a CRAY T3E system, the compiler
restructures loops containing scalar operations and generates calls to
specially coded vector versions of the underlying math routines. The
vector versions are between two and four times faster than the scalar
versions. 

The compiler uses the following process:

  1.Stripmine the loop. (For more information on stripmining, see
  Example 4-6.)

  2.Split vectorizable operations into separate loops, if necessary.
  (For more information on loop splitting, see Example 4-6.)

  3.Replace loops containing vectorizable operations with calls to
  vectorized intrinsics.

Vectorizing reduces execution time in the following ways:

    By reducing aggregate call overhead, including the subroutine
    linkage and the latency to bring scalar values into registers needed
    by the intrinsic routine.

    By improving functional unit utilization. It provides better
    instruction scheduling by processing a vector of operands rather
    than a single operand.

    By producing loops that can be pipelined by the software. (For more
    information on pipelining, see Section 4.2.)

The programming environment also offers the libmfastv library of faster,
but less accurate, vector versions of the libm routines. These routines
deliver results that are usually one-to-two bits less accurate than the
results given by the libm routines. Less accurate scalar versions of the
library routines are also used to provide identical results between
vector and non-vector invocations within the same program. The libmfastv
routines reduce execution time spent in math intrinsics by 50 to 70%.

Because the vector routines may not provide a performance improvement in
all cases (due to necessary loop splitting), vectorization is not turned
on by default. It is enabled through the compiler command-line option -O
vector3. The default is vector2, which currently does not do intrinsic
vectorization. The vector1 and vector0 options have their own meanings
on Cray PVP systems, but on the CRAY T3E system, they are the same as
vector2: they turn vectorization off.

If you have selected -O vector3, you can further control vectorization
by using the following:

    Vector directives NEXTSCALAR and [NO]VECTOR. They allow you to turn
    vectorization on and off for selected parts of your program.

    In the case of ambiguous data dependences within the loop, you can
    express a loop's vectorization potential by including the IVDEP
    directive (see Section 4.6.1). IVDEP tells the compiler to proceed
    with vectorization and ignore vector dependencies in the loop that
    follows.

    Access to the libmfastv routines can be controlled with the -l
    compiler option. For example, the following command line links in
    the faster, but less accurate math routines rather than the slower,
    default routines in libm:

    % f90 -Ovector3 -lmfastv test.f

Transformation from scalar to vector is implemented by splitting loops.
This may cause extra memory traffic due to the expansion of scalars into
arrays and reduce the opportunity for other scalar optimizations. This
could negatively impact the profitability of the vectorization.

The less accurate version offered through libmfastv varies from default
libm results generally within 2 ulps, although some results could
differ by larger amounts. Exceptions may also differ from the libm
versions, where some calls to libmfastv may generate only a NaN for a
particular operand rather than an exception, causing exceptions later in
the program.

Vectorization is only performed on loops that the compiler judges to be
vectorizable. This determination is based on perceived data dependencies
and the regularity of the loop control. These loops will likely be a
significant subset of those seen as vectorizable by the Cray PVP
compiler.  Vectorization of conditionally executed operators is
deferred. Vectorization of loops that contain potentially early exits
from the loop is also deferred.

Vectorization will be performed on the following intrinsics and
operators. The first set supports both 32-bit and 64-bit floating-point
data:

  SQRT(3)
  
  1/SQRT (replaced by a call to SQRTINV(3)
  
  LOG(3)
  
  EXP(3)
  
  SIN(3)
  
  COS(3)
  
  COSS(3) (replaced by a combined call to SIN(3) and COS(3)
  
  The following support 64-bit floating-point only:
  
  RANF(3)
  
  X**Y
  
  POPCNT(3)

    Note: Early versions of the Programming Environment 3.0 release may
    not vectorize loops with multiple RANF(3) calls. The IVDEP directive
    enables RANF vectorization, but the values returned may be different
    than those returned without vectorization.

The vector intrinsic routines are designed to read an arbitrary number
of operands from memory and write their results to memory. They can also
handle operands and results that do not have a stride of one.

The compiler stripmines and splits (if necessary) any loop for which
intrinsic vectorization is indicated by the programmer. The stripmine
factor is currently 256. The loop is stripmined to limit the size of
scalar expansion arrays and to decrease the likelihood of cache
conflict.

The following example illustrates the kind of loop to which the
vectorization optimization can be applied:

Example 4-18. Transforming a loop for vectorization


       DO I = 1, N
           A(I) = B(I) * SQRT( C(I) + D(I) )
       ENDDO

The loop in this example will be transformed into the following loop.
The vector version of the square root function is called after the first
ENDDO statement.

       DO II = 1, N, 256
           NN = MIN( II+255, N )
           DO I = II, NN
               T(I-II+1) = C(I) + D(I)
           ENDDO
           SQRT_V( NN - II + 1, T, T2, 1, 1 )
           DO I = II, NN
               A(I) = B(I) * T2(I-II+1)
           ENDDO
       ENDDO

The C/C++ documentation (also available from CrayDoc) is nearly identical, but you use "-h" to specify vector3:

% cc -hvector3 test.c

Quick-Tip Q & A


A:[[ I received a "not enough memory" error when trying to compile a large 
  [[ subroutine (part of a big code) on the T3E with -O3,unroll2 . Is 
  [[ there any way to increase the memory allocation to f90 or am I stuck 
  [[ compiling that subroutine with -O2 ? Thanks!


As far as we know, the compilers can't use any memory beyond that
available on a single PE. (Any chance the particular subroutine could be
split?)

One thing that didn't help, but might be interesting to someone, is this
little C program to let you compile on an application, or APP, PE.  APP
PEs aren't shared, so the hope was that the compiler would scare up a
bit more memory:


/*--------------------------------------------------------------------
* MPImake.c
* Compile as:  cc -o MPImake MPImake.c
--------------------------------------------------------------------*/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

/*-------------------------------------------------------------------*/
main (int argc, char **argv) {
  int mype,totpes,mproc;

  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &mype);

  if (mype == 0) {
     system ("make");
  }

  MPI_Finalize();
}

Run this as: 

  mpprun -n2 ./MPImake



Q: I'm using "vi" and want to "yank" some text in one document, and
   "put" it into another.  I tried this:
  
     $ vi file1     # start editing file1
       22j          # move to desired text
       y10y         # yank next 10 lines
       :n file2     # open file2
       p            # attempt to put the 10 lines
    
   It just beeps at me. This is such an obvious operation, there
   must be a way... Ideas?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top