ARSC T3E Users' Newsletter 129, November 7, 1997

San Jose Bound

SC97 begins in 10 days! Everyone is invited to stop by ARSC's booth, number R211.

Come by and meet some of the staff: Guy Robinson, our MPP Specialist who hails from England, Sergei Maurits our Visualization Specialist in the Scientific Services area, from Russia, and Roger Edberg our newest Visualization Specialist, moved up from Australia. ARSC truly is a melting pot of fantastic resources, human and hardware alike.

We're changing and we'd love to tell you about our upgraded facility. Improvements include: a new kiosk for our visitor display area, a complete video production lab with 2 SGI supercomputers, and an upgrade to our latest CRAY, the T3E named Yukon. Yukon now has 90 450 MHz processors and 23 GB RAM available to user applications.

We are also working with a new 3D projector. We hope to debut a demo tape with 3D visualizations on it from our Scientific Services area. Sergei Maurits with the help of Art professor Bill Brody and Computer Science professor Chris Hartman are showcasing some of the scientific research and engineering of our users.

VAMPIR

VAMPIR has recently been installed on the ARSC T3E and SGI workstation network. VAMPIR was developed by a number of partners and is sold by PALLAS, more information on the background of the tool can be found at

http://www.pallas.de/pages/products.htm

It provides an interactive graphical interface for the investigation of post mortem traces generated by MPI programs.

Users compile their MPI programs with VAMPIR's tracing options enabled and link with routines to process this information. The tracefile generated at runtime is viewed later on a workstation, where the user can interactively investigate the activity of processors and the messages being exchanged.

VAMPIR provides a highly detailed view of the message passing efficiency of an MPI code. It points out exactly those events on which the processor is waiting, and can, for instance, help you determine exactly why your program spends 40% of its time parked at an MPI_BARRIER or MPI_RECV.

For comparison, Apprentice and PAT are more oriented to single processor performance. They give detailed summaries of the activity of individual routines and, for the MPI programmer, can help determine whether or not excessive time is being spent in MPI_RECV or MPI_BARRIER. They don't, however, provide VAMPIR's level detail.

Such detail can help users isolate poorly performing parts of programs or compare the behaviour of different algorithms. Several ARSC users have already investigated code performance and made improvements which have increased code performance on the T3E.

Several steps are required to use VAMPIR. First the user must set up his or her environments:

For VAMPIRtrace generation from MPI programs on the ARSC T3E.


  T3E:
     PAL_LICENSEFILE=/usr/local/pkg/VAMPIRtrace/etc/license.dat
     PAL_ROOT=/usr/local/pkg/VAMPIRtrace

To view tracefiles on an ARSC SGI workstation.


  SGI:
     PAL_LICENSEFILE=/usr/local/vampir/etc/license.dat
     PAL_ROOT=/usr/local/vampir

     Add to your $PATH:   /usr/local/vampir/bin/R4K-OS5

Next, compile programs on the T3E so that the instrumented MPI library is used. This is achieved as follows:


  Compile with: "-I/usr/local/pkg/VAMPIRtrace/include", for instance: 

  yukon% f90 -c -I/usr/local/pkg/VAMPIRtrace/include prog.f

Then, link with the necessary libraries, " -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi ", for instance:


  yukon% f90 -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi -o prog prog.o

Note, I have found it useful to set the following aliases:


  vamf90c (f90 -c -I/usr/local/pkg/VAMPIRtrace/include)
  vamf90l (f90 -L/usr/local/pkg/VAMPIRtrace/lib -lVT -lpmpi -lmpi)

The tracefile generating MPI library and the final generation of the file do not add considerably to the program runtime or change the behaviour but as with all tools it is recommended that users do not run in production mode with these options enabled. (Also for a long run tracefiles can be large.)

Next, run as usual using mpprun:


  yukon% mpprun -n 4 ./prog

Which results in a file containing the trace information which has the extension .bpv , prog.bpv in this case. This file must now be transfered to the SGI system where it can be viewed with VAMPIR.


  sgi% vampir prog.bpv

This brings up VAMPIR's GUI, and puts the user in a position to start examining the performance of the program. VAMPIR first appears as a small command window and the user can select complete history from the file menu to complete processing of the file and to view the entire program's behaviour. The user can then display the following information about the program from the "Global-displays" menu:


  - "global node" view shows how much time each node spends in different
    parts of the program.

  - "global chart" view displays the same information but in pie chart
    format.

  - "global timeline" display a timeline showing each processor's activity
    as a variously coloured horizontal bar.

As an example, we offer a simple overseer/worker montecarlo code developed as a first solution to a complex problem of ocean acoustics. The early version showed anomalous performance at high processor numbers.

Using VAMPIR it was quickly seen that the workload on each processor was reduced as the number of worker processors was increased and that the time taken to load data onto the overseer became dominant. The VAMPIR global timeline screen for this example is shown here:

Before

and pointed out the problem. In this graph, the user code activity is coloured green (green is good) and waiting time in MPI routines is coloured red (red is bad!). The topmost line shows the overseer processor and the lower lines the worker activity. The graph leads us to conclude that the workers are idle while the overseer is reading in the next data and that the overseer is idle while the workers are processing data.

Armed with this information, the code's performance was improved by making a simple ordering change in the overseer processor so that the next data packet was read in while the workers were computing results from the previous packet. Using VAMPIR, we can visualize this improvement, as shown in this graph:

After

This example shows how easy it can be to investigate and determine the reasons why programs are not performing as expected. VAMPIR is particularly useful in cases where there are tuning issues such as relative work-packet sizes and control structures.

This article is only intended to give a basic introduction to this powerful and useful tool. More information can be found about the VAMPIRtrace library on yukon in


  /usr/local/pkg//VAMPIRtrace/doc/VT-userguide.ps

and about the VAMPIR GUI on the SGI network in


  /usr/local/vampir/doc/VAMPIR-instguide.ps.gz

In addition, ARSC is offering VAMPIR training on December 10th (next article).

T3E Tools Class to be Broadcast via the MBone, Dec. 10

VAMPIR, Apprentice, Totalview, and PAT should become familiar to the MPP programmer. By using these tools effectively, we can craft superior algorithms and efficient programs and possibly get more work done in less time.

In light of the importance of these tools, ARSC will broadcast its T3E Tools course, via the MBone, for the benefit of our many remote users and the wider community. For details on the course, please see:

http://www.arsc.edu/user/classes/ClassT3ETools.html

(If you intend to "tune in," please let us know with a quick note to: arsc_mpp@arsc.edu . Thanks!)

For those unfamiliar with the MBone, or the IP-Multicast Backbone, it enables real-time video and audio to be broadcast over the internet. If your site is MBone-capable, you may observe, or join, these broadcasts, at your workstation. If not, you may want to prod your sysadmin. A wide variety of broadcasts have been available recently, for instance, PSC's 3-day T3E course last May, NASA's Mars Pathfinder press conferences, and an ARSC/UA lecture on teraflops computing last August.

For more information on the MBone, see:

http://www.best.com/~prince/techinfo/mbone.html

Case Study: LES, C++, and Pseudo-Spectral Methods

[ Many thanks to Steve de Bruyn Kops of the Mechanical Engineering Department at the University of Washington for this article. It's great to hear how MPP programmers are using C++, not to mention how they approach various problems and benefit from hardware upgrades. ]

Our project is large eddy simulation (LES) of a turbulent flame in order to develop subgrid-scale chemistry models. The experimental data, however, is none too detailed, and it proved difficult to sort out modeling errors versus simulation initialization errors. With the installation of the 512 PE T3E at PSC, it became possible to do a direct (no modeling) numerical simulation (DNS) of the flow, and thus, a three-way comparison between the lab data, the DNS, and the LES. The DNS also provides data at all spatial locations and all times, which allows for detailed validation of the LES. The problem became one of developing initial fields for the DNS which match the laboratory data.

At about the same time as the T3E at PSC came on line, ARSC changed the YMP-T3D charging ratio from 40:1 to 100:1, which was a help to us, and users started migrating to the ARSC T3E, yukon. Also, we knew that, while the reacting scalar simulation would require a larger domain to resolve the scalar fields, the velocity field could be studied in a 384^3 domain, which just barely fits on the ARSC T3D, denali. Thus, denali became the best machine on which to develop the velocity field.

Lab data is available for the flow at three widely spaced downstream locations in the wind tunnel. To advance a simulation to the third data point requires about 4000 time steps. So just developing the initial fields entails major simulations. After 4 attempts, a field was found which evolved almost exactly the same as the laboratory data.

The code uses pseudo-spectral methods so we can use Cray's PSCFFT3D routine for most of the work. From "man PSCFFT3D" on yukon:


       PSCFFT3D, PCSFFT3D 
           - Applies a three-dimensional (3D) real-to-complex or
           complex-to-real Fast Fourier Transform (FFT) to a matrix
           distributed across a set of processors.

Consequently, we don't have to worry about coding much of the inter-PE communication, which takes place within PSCFFT3D (about 75% of the runtime is spent in the highly optimized library routines). We use shmem for what little communication we do explicitly.

Most of the code is written in C++. We developed a Field class which ' knows ' how to manipulate a 3D field of data, ie, read, write, FFT, take derivatives, sort, compute its statistics, ... . Daughter classes add appropriate functions for velocity fields and scalar fields. The main simulation class binds 3 velocity fields, some number of scalar fields, and the appropriate chemistry functions and tells them to do their thing. The code is great at a few things:

  • post-processing routines are often less than 10 lines since the base classes already know how to do most things.
  • a code revision automatically propagates to all simulation and post-processing codes upon recompilation. No time wasted fixing the same code twice.
  • memory management is efficient. The main arrays in the big simulations are 1GB and various routines need work arrays. In our fortran codes, we have a pool of arrays like tmp1, tmp2, ... . Subroutine A needs 1 tmp and calls B which needs 2. But subroutine C uses 2 tmps and calls B. Efficient use of work arrays in code shared between simulations with different sub-models plus post-processing code is tough. In C++, work arrays are dynamically allocated the first time used and then returned to a pool, which is managed by the constructor and destructor for the Field class. Work arrays are completely transparent and the minimum number needed are automatically allocated by all simulation and post-processing routines. Also, the work Fields always have meaningful names since they are declared locally. One problem with the code:
  • it is easy to call several class functions (such as one to multiply and one to add) and thereby cycle data through the CPU twice. The optimization that comes with the comparable F90 array operators is not available. However, most of the run time is consumed by a small fraction of the code, which are easily coded explicitly. The classes allow us to quickly put together a new algorithm which will be bug free. Once it is tested, we optimize as the profiler dictates.

Below are some timing data for:

  • -T3E/450MHz/256MB w/ streams
  • -T3E/450MHz/256MB w/o streams
  • -T3E/300MHz/128MB w/o streams
  • -T3D

Some explanatory notes:

  1. Results for the smallest number of PE's possible (2 on the T3E and 8 on the T3D) are shown, followed by larger domains.
  2. In a pseudo-spectral code, to compute A x B = C, where A, B, and C are fields in spectral space and x denotes a point-wise multiply, A and B must be transformed to real space to get A' and B', and multiplied to get C'. Then C' must be transformed back to spectral space to get C. Finally, A and B must be recovered. If memory is tight, another set of transforms is done, ie, A' -> A and B'->B. Otherwise, A and B can be cached at the start of the operation and then restored from the saved values. Our code dynamically determines which method to use. This explains the speedup between 2 and 4 PE's on the T3E. The effect is less noticeable on the T3D.
  3. We have no idea how many flops are executed by PSCFFT3D/PCSFFT3D, so we can't translate the timing data to MFLOPS.
  4. Speed-up reported below is uses the T3D 8PE run as its basis.

------------------------------------------------------------------
T3E/450MHz/256MB, set_d_stream(3), set_i_stream(3)               
                                                                 
 # PE's  Elapsed    PE-secs   Sys CPU    Rmt CPU    I/O Wait  Speed-
         Seconds              Seconds    Seconds    Seconds     Up
  ==== =========== ========== ========== ========== ========= ======
  2      1126.43    2216.43      23.51       0.37      11.25   0.8 
  4       552.86    2174.75      22.45       0.45      11.64   1.7
  8       294.91    2318.40      24.24       0.72      12.61   3.3
  16      170.26    2670.20      26.92       1.24      15.89   5.6


------------------------------------------------------------------      
T3E/450MHz/256MB, set_d_stream(0), set_i_stream(0)                     
                                                                       
 # PE's  Elapsed    PE-secs   Sys CPU    Rmt CPU    I/O Wait  Speed-     
         Seconds              Seconds    Seconds    Seconds     Up 
  ==== =========== ========== ========== ========== ========= =======     
  2      1434.42    2820.67      33.71       0.32      12.24   0.7
  4       707.33    2782.13      32.79       0.46      11.70   1.3
  8       373.72    2937.32      34.93       0.75      12.17   2.6
  16      209.60    3289.65      38.33       1.27      14.33   4.3
      

--------------------------------------------------------------------      
T3E/300MHz/128MB, set_d_stream(0), set_i_stream(0)                     

 #PE's  Elapsed    PE-secs    Sys CPU    I/O Wait  I/O Wait   Speed-
        Seconds               Seconds    Sec Lck   Sec Unlck    Up  
  ==== =========== ========== ========== ======== ========== =======
    4     781.22    3082.03      29.03       0.38       5.34   1.2
    8     412.06    3238.79      31.94       0.62       7.69   2.3
   16     228.58    3576.43      34.93       1.08      11.27   4.2


--------------------------------------------------------------------
T3D                                                                    
                                                                       
 #PE's  Elapsed    PE-secs    Sys CPU    I/O Wait  I/O Wait   Speed-        
        Seconds               Seconds    Sec Lck   Sec Unlck    Up  
  ==== =========== ========== ========== ======== ========== =======
   8    959.7239     7677.79     3.8669  14.8242     8.1991   1.0
  16    525.5505     8408.81     7.0130  10.2716     4.7618   1.8
  32    314.6483    10068.75    29.2128  10.2543    11.5604   3.0
  64    221.4970    14175.81    58.3187   5.7765    12.4596   4.3

Quick-Tip Q & A


A: {{ Fortran 90 programmers: what does this do, and why?
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
       program junk
       integer j(10), i

       do 100 i = 1, 10
         j(i) = i
 100   continue

       do 200 i = 1. 10
         j(i) = j(i) + i * 100
 200   continue

       print*, j
       end
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc }}


# Well, it compiles and runs:
#   $ f90 junk.f    
#   $ ./a.out
#     1,  2,  3,  4,  5,  6,  7,  8,  9,  10
# 
# but the result probably isn't what was intended.  
# 
# The typing error didn't produce a syntax error, which is surprising
# unless you know that Fortran (yes, even Fortran 90),
#   1) ignores spaces, and
#   2) unless told otherwise, uses implicit type declaration.
# 
# Trying to compile with IMPLICIT NONE shows us what happened:
#   $ f90 -eI  junk.f
#   
#          do 200 i = 1. 10
#          ^                
#   
#     cf90-1171 f90: ERROR JUNK, File = junk.f, Line = 9, Column = 8 
#       An explicit type must be specified for object "DO200I", because -eI, 
#       the IMPLICIT NONE command line option is specified.
#     
#     cf90: Cray CF90 Version 3.0.0.1 (f18p45m332a39) Mon Nov  3, 1997  11:32:12
#     cf90: COMPILE TIME 0.119398 SECONDS
#     cf90: MAXIMUM FIELD LENGTH 1288864 DECIMAL WORDS
#     cf90: 15 SOURCE LINES
#     cf90: 1 ERRORS, 0 WARNINGS, 0 OTHER MESSAGES, 0 ANSI
#     cf90: CODE: 0 WORDS, DATA: 0 WORDS
# 
# 
# Without IMPLICIT NONE, the code declares a real variable, "do200i",
# defines it, 
#   do200i = 1.10
# and the compiler optimizes it out of existence.
# 
# But what about the next assignment:
#          j(i) = j(i) + i * 100
# Recompiling with array bounds checking gives a clue:
# 
#   $ f90 -Rb junk.f 
#   $ ./a.out
#   
#   lib-1961 ./a.out: WARNING 
#     Subscript 11 is out of range for dimension 1 for array
#     'J' at line 10 in file 'junk.f' with bounds 1:10.
#    1,  2,  3,  4,  5,  6,  7,  8,  9,  10
# 
# Unfortunately, this array bounds error does not crash the program, and
# the array j is not changed.  The incorrect results are returned, et
# voila...
# 
# The suggestion is clear, always use IMPLICIT NONE and/or -eI. 
# 


Q: In this age of high-school hackers and down-loadable password 
   cracking programs, we are rightly asked to invent difficult,
   uncrackable passwords.  They should contain strange characters.
   They should NOT be the names of our ex-girlfriends, ex-boyfriends,
   or cars.  We should not recycle them, and we should NEVER write them
   down!  How do you invent and remember good passwords!?


  ** Does this Sound Familiar? **

"In the office where I work, there are two categories of technology
cluttering up the place. Category one: very expensive, ultra-high-
performance computers.  Category two: bicycles.  The frighteningly
clever people who work there have made clever and rational decisions
about appropriate technology, and that's what they've come up with:
computers and bicycles."

[ Taken from a news article looking at technology... ]

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top