ARSC HPC Users' Newsletter 308, January 28, 2005

ARSC Fellows Program

ARSC has openings for several postdoctoral fellowships for the academic year starting September 2005. The goal of the fellows program is to support talented young scientists in the early stage of their careers and to provide an intellectual environment in which they can pursue a research agenda of their own choosing.

Each fellow will receive a three-year term appointment with the possibility of renewal for another two years. Candidates for these fellowships must be nominated by a university faculty member or a staff member of a national laboratory. Nominations must be received by the 15th of February 2005.

For more information on the fellows program see:    http://www.arsc.edu/misc/jobs/fellowsprogram.html

ScaLAPACK Intro: Part IV of V

We ended part III of this series ( /arsc/support/news/hpcnews/hpcnews306/index.xml ) by allocating the precise amount of memory on each processor required to hold the block-cyclically distributed arrays. Now we must assign data values to memory.

We'll perform this data distribution task using the routine "pdelset" which is part of the ScaLAPACK Tools library. (A shortcoming of the ScaLAPACK User Guide is the lack of guidance it provides at this step. "pdelset," for instance, is not mentioned anywhere in the User Guide.

The advantage of "pdelset" is that it hides the gory details of the block-cyclic distribution, which makes it easy to use. The disadvantage, it is inefficient. From the fortran source, downloaded from netlib, here's the interface and purpose of this routine:


      SUBROUTINE PDELSET( A, IA, JA, DESCA, ALPHA )
*
*  -- ScaLAPACK tools routine (version 1.7) --
*     University of Tennessee, Knoxville, Oak Ridge National Laboratory,
*     and University of California, Berkeley.
*     May 1, 1997
*
*     .. Scalar Arguments ..
      INTEGER            IA, JA
      DOUBLE PRECISION   ALPHA
*     ..
*     .. Array arguments ..
      INTEGER            DESCA( * )
      DOUBLE PRECISION   A( * )
*     ..
*
*  Purpose
*  =======
*
*  PDELSET sets the distributed matrix entry A( IA, JA ) to ALPHA.
*

Each process must call "pdelset" for each element of the globally distributed array, passing it the global indices of the element and the value to which the element should be set. "pdelset" determines if the particular global element is stored on the local processor, and if so, it computes its local indices and stores the value. If the particular global element is stored elsewhere, nothing happens.

The inefficiencies are that each process must compute, or know, all values of the global array, even those stored elsewhere, and each process must call "pdelset" for all global elements, even those stored elsewhere. In our demo, global array values are computed. Here's the initialization loop from the serial version of the code (from Part I of the series).


      !!! Global array dimension:  a(n,n)
      do j = 1, n
        do i = 1, n
           if (i.eq.j) then
               a(i,j) = 20.
           else
               a(i,j) = 0.
           endif
        enddo
      enddo

Here's our ScaLAPACK version, using the "pdelset" black box:


      do j = 1, n
        do i = 1, n
           if (i.eq.j) then
               val = 20.
           else
               val = 0.
           endif
           call pdelset(a,i,j,desca,val)
        enddo  
      enddo    

The code additions for this week are minor. Following array allocation, we call two initialization routines:


! -----    Initialize LHS and RHS
! 
      call init_my_rhs    (n,c,nprow,npcol,myrow,mycol,descc)
      call init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)
!

Here are these two routines:


!
!-----------------------------------------------------------------------
!
      subroutine init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)
      implicit none
      integer :: n,nprow,npcol,myrow,mycol
      integer :: desca(:)
      real    :: a(:,:)

      integer :: i, j
      real    :: val

!     Compute values for all elements of the global array, but using
!     pdelset, only set those elements which occur in local portion.
      do j = 1, n
        do i = 1, n
           if (i.eq.j) then
               val = 20.
           else
               val = 0.
           endif
           call pdelset(a,i,j,desca,val)
        enddo  
      enddo    

      return
      end
!
!-----------------------------------------------------------------------
!
      subroutine init_my_rhs (n,c,nprow,npcol,myrow,mycol,descc)
      implicit none
      integer :: n,nprow,npcol,myrow,mycol
      integer :: descc(:)
      real :: c(:,:)

      integer :: i
      real :: val

      do i= 1, n
        val = i*100.0/n
        call pdelset(c,i,1,descc,val)
      enddo

      return
      end

And we add a little code to the printing routine to show the actual data in addition to array dimensions and processor grid information. For instance, here's output from processor 0. As shown in Part III, proc 0 is assigned processor grid coordinates of (0,0) and its local portion of array A contains 5 rows and 8 columns:


  proc:  0 grid position:  0,  0 blksz:  4 numroc:  5:  8

The new output this week is the data. This is processor 0's 5x8 local portion of array A:


  20.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 
   0.0  20.0   0.0   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0  20.0   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0  20.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 

Below is the complete output from a run of the code. If this output doesn't make sense, you might review the description of the block-cyclic distribution which appeared in Part III of this series and/or the online ScaLAPACK User Guide.


%    aprun -n 6 ./slv_part4
 PE= 0: 6 PROW= 0: 3 PCOL= 0: 2
 PE= 1: 6 PROW= 0: 3 PCOL= 1: 2
 PE= 2: 6 PROW= 1: 3 PCOL= 0: 2
 PE= 3: 6 PROW= 1: 3 PCOL= 1: 2
 PE= 4: 6 PROW= 2: 3 PCOL= 0: 2
 PE= 5: 6 PROW= 2: 3 PCOL= 1: 2
 DISTRIBUTION OF ARRAY: A Global dimension: 13 : 13
proc:  0 grid position:  0,  0 blksz:  4 numroc:  5:  8
  20.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 
   0.0  20.0   0.0   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0  20.0   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0  20.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 
proc:  1 grid position:  0,  1 blksz:  4 numroc:  5:  5
   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0  20.0 
proc:  2 grid position:  1,  0 blksz:  4 numroc:  4:  8
   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 
proc:  3 grid position:  1,  1 blksz:  4 numroc:  4:  5
  20.0   0.0   0.0   0.0   0.0 
   0.0  20.0   0.0   0.0   0.0 
   0.0   0.0  20.0   0.0   0.0 
   0.0   0.0   0.0  20.0   0.0 
proc:  4 grid position:  2,  0 blksz:  4 numroc:  4:  8
   0.0   0.0   0.0   0.0  20.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0  20.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0   0.0  20.0   0.0 
   0.0   0.0   0.0   0.0   0.0   0.0   0.0  20.0 
proc:  5 grid position:  2,  1 blksz:  4 numroc:  4:  5
   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0 
   0.0   0.0   0.0   0.0   0.0 

 DISTRIBUTION OF ARRAY: C Global dimension: 13 : 1
proc:  0 grid position:  0,  0 blksz:  4 numroc:  5:  1
   7.7 
  15.4 
  23.1 
  30.8 
 100.0 
proc:  1 grid position:  0,  1 blksz:  4 numroc:  5:  0





proc:  2 grid position:  1,  0 blksz:  4 numroc:  4:  1
  38.5 
  46.2 
  53.8 
  61.5 
proc:  3 grid position:  1,  1 blksz:  4 numroc:  4:  0




proc:  4 grid position:  2,  0 blksz:  4 numroc:  4:  1
  69.2 
  76.9 
  84.6 
  92.3 
proc:  5 grid position:  2,  1 blksz:  4 numroc:  4:  0




%

Note that the blank data lines output for procs 1, 3, and 5, in the distribution of the matrix C are correct. Proc 5, for instance, occupies processor grid position (2,1), which is the 3rd processor row and 2nd processor column (indexed from 0). Since there's only one column of data in C, it is completely contained in the 1st processor column, requiring no storage in the 2nd processor column.

In the next part of this series we FINALLY get to call the ScaLAPACK solvers, and see if they actually work!

Challenges Newsletter

There are three tsunami movies in the 2004 edition of ARSC's Challenges Newsletter which is now available online. The Challenges Newsletter highlights current scientific research and events at ARSC. This edition discusses a wide range of topics including bioinformatics, ocean modeling, tsunami research, space weather and more.

Challenges can be found online at:    http://www.arsc.edu/challenges/index.html

The Cray X1 at ICM

[ Robert Osinski spent a couple of weeks at ARSC a couple of years ago. We thought we'd ask him to introduce his center, and let us know what they're doing with their X1. Thanks to Robert and Lukasz Bolikowski for the following article. ]

The Interdisciplinary Center for Modeling (ICM) is an autonomous part of Warsaw University in Poland. It is a research center and a HPC site. During its 10-year history it has hosted a number of Cray, NEC, SGI, and Sun systems. Current machines include: Cray X1 (32 MSP), Cray SV1ex (32 CPU), and a 200x AMD Opteron 246 cluster. The Cray X1 is ICM's 8th Cray system (after EL, YMP, T3E, J90, and two SV1) and it is now being upgraded to a Cray X1e.

Our Cray X1 is primarily used for: quantum chemistry (Gaussian, GAMESS), molecular modeling (GROMOS, CHARMM, AMBER), ocean & atmospheric modeling (COAMPS, Unified Model, POP, CICE).

One of the projects that is currently working in both ICM and ARSC is coupled ice-ocean model. The model consist of POP and CICE models developed at Los Alamos National Laboratory. The Arctic Ocean is modeled at ARSC whereas at ICM we are modeling the Baltic sea which is a marginal sea so the amount of needed PE and memory is not as huge as modeling Arctic Ocean.

We are also developing a bioinformatics software package for local sequence alignment and biological database search. The software uses Cray's BMM (Bit Matrix Multiply) chip to accelerate certain bit operations in the alignment and database filtering algorithms.

Please visit our website: http://www.icm.edu.pl/

for further information regarding our site.

Percussion and Art

ARSC's popular Discovery Tuesday series will continue on February 1st with a live performance by Scott Deal of the UAF Music Department. Scott will perform "For Marimba and Tape" by Australian composer Martin Wesley Smith. This performance will be accompanied by graphics from Miho Aoki who holds a joint appointment with ARSC and the UAF Art Department.

You can catch "Art and Music in Virtual Reality" on Tuesday, February 1, at 1 p.m. in the Discovery Lab (375 C Rasmuson).

Quick-Tip Q & A



A:[[  I am writing some code in C++ which uses a library that was 
  [[  written in C and compiled with a C compiler.   During the linking 
  [[  stage I get a bunch of undefined symbols.
  [[
  [[  E.g.
  [[  ld: 0711-317 ERROR: Undefined symbol: 
  [[                     .cb_compress(long*,long*,long,long)
  [[  ld: 0711-317 ERROR: Undefined symbol: 
  [[                     .cb_revcompl(long*,long*,long,long)
  [[  ld: 0711-317 ERROR: Undefined symbol: .cb_version(char*)
  [[  ld: 0711-317 ERROR: Undefined symbol: .cb_free(void*)
  [[  
  [[  It works fine when I recompile the library with a C++ compiler, 
  [[  but I really don't want to have two versions of the same library.  
  [[  What's going on here?  There must be a way to use a C library with 
  [[  C++ without recompiling the library.


#
# Thanks to Jesse Niles:
#

Because the library links fine when it is recompiled using the C++
compiler, I suspect that this is a name-mangling issue.  The
name-mangling in C++ allows for symbol overloading so that two
functions or variables can use the same name, allowing for things such
as function overloading.  In order for the linker to tell the
overloaded functions apart, the types, parameter count, etc. are
encoded into a symbol name in a compiler-dependent way.

When compiling against a C header file with function declarations,
name-mangling is done, and the encoded symbol is placed in the object
files.  However, when the linking portion of the build is done, the
linker will look for mangled names inside of the  library (.a) or
object (.o) file and will be unable to find the definitions of the
symbols declared in the header file.  To keep the C++ compiler from
mangling the names, the following can be used (The {}s are optional if
the function prototype is provided on the same line):

extern "C" { }

This can be illustrated by the following code segment:

1     void cb_compress(long*,long*,long,long) { }
2 
3     extern "C" void cb_compress2(long*,long*,long,long) { }
4 
5 
6     int main()
7     {
8         cb_compress(NULL,NULL,NULL,NULL);
9         cb_compress2(NULL,NULL,NULL,NULL);
10        return 0;
11    }


As you can see, the two functions are identical except for the '2'
added to the extern'd function name.  Once compiled, 'nm' can be used
to show the resultant symbols (the '-C' flag prevents the demangling of
C++ symbols):

 
% nm -C a.out 
 grep cb_compress
.cb_compress2        T   268443240
.cb_compress__FPlT1lT3 T   268443168

The former is the extern'd symbol and the latter is the c++ mangled
symbol.  The was the result on our IBM machine using xlC.  The mangling
will be different on other machines, and possibly even different
compilers on the same machine.  Compiling using g++ on the same machine
produced the following mess:


% nm -C a.out 
 grep cb_compress
._GLOBAL__D__Z11cb_compressPlS_ll T   268437664
._GLOBAL__I__Z11cb_compressPlS_ll T   268437556
._Z11cb_compressPlS_ll T   268437076
.cb_compress2        T   268437160
_GLOBAL__D__Z11cb_compressPlS_ll D   536979500
_GLOBAL__D__Z11cb_compressPlS_ll d   536979500          12
_GLOBAL__F__Z11cb_compressPlS_ll D   536873120
_GLOBAL__I__Z11cb_compressPlS_ll D   536979452
_GLOBAL__I__Z11cb_compressPlS_ll d   536979452          12

Note the still intact .cb_compress2 symbol.  The best solution for
fixing the problem is to add the following to the beginning and end of
the header file:

#ifdef __cplusplus
extern "C" {
#endif

...header file contents...

#ifdef __cplusplus
}
#endif


As it is common to include C header files such as dirent.h and unistd.h
in c++ applications, you will see the above lines in most of these
files.  I have encountered header files without these lines, and have
tried the 'extern "C" {}' around the #include directive in my own
source file and it works, although it is presumably more proper for the
header file to provide this.


Q: The loopmark listings that the Cray compilers can output were
   really helpful when I was optimizing my code for the X1.  Do the
   IBM compilers have an option which will produce a human readable
   output of optimized code?  I tried the compiler flag '-qlist' and
   it produced a listing, but all it had was Power4 assembly code.

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top