[Menu Bar] Resourses at ARSC Science at ARSC Newsroom Support About ARSC ARSC Home

 

ARSC HPC Users' Newsletter 306, December 17, 2004

Newsletter Index Quick-Tip Index Search Newsletters

Contents

 

tcsh "autologout," Especially on the ARSC IBMs

UNIX shells always seem to find a way to keep you on your toes. If you use tcsh and regularly, but mysteriously, find yourself logged out of terminal sessions, the problem might be the autologout functionality. When the variable "autologout" is set, the tcsh shell will *automatically* disconnect the session after the specified number of minutes. E.g.:

  # first set the autologout variable to 60 minutes.
  iceberg1 3% set autologout=60

  # verify that it is set using the built in shell command, 'set'.
  iceberg1 4% set | grep auto
  autologout      60

In this case the shell will disconnect after 60 minutes of idle time. After the specified amount of idle time, we would see the following:

  iceberg1 4% auto-logout
  Connection to iceberg closed.

In some versions of tcsh including those installed on iceberg and iceflyer, the autologout variable may be set by default (!!!) to 60 minutes. Sifting through the tcsh man page, one finds that autologout is set to 60 minutes by default in login and superuser shells, but not if the shell "thinks" it is running under a window system. It checks the setting of the environment variable DISPLAY to assess whether or not the session is running under a window system.

Thus, the autologout variable could be set for you automatically:

  1. If you use a terminal program which does not support X11.
  2. If you use a terminal program which does support X11, but you are using a connection method which does not set the DISPLAY environment variable for you (e.g., krlogin, ktelnet).

Note: ssh sets DISPLAY, as long as the terminal supports X11.

Fortunately, if, for some crazy reason, you don't want to be logged out automatically, you can get around this by unsetting the autologout variable. E.g.:

  unset autologout

This command can be added to your .cshrc if you would like to permanently disable autologout.

Remember this only applies to the tcsh shell.

 

ScaLAPACK Intro: Part III of V

In this issue, we finally begin distributing the data arrays across all processors. This is where the dream of performing linear algebra on large arrays, say, 300000 x 300000 64-bit REALs, starts to become reality. For the record, a couple of arrays this size would indeed fit on ARSC's IBM cluster, with room to spare.

As described in Part I, ScaLAPACK requires that the data arrays be block-cyclically distributed across the 2D processor grid (which was created in Part II). For each array, we need to take the the basic steps:

  1. settle on a block size
  2. call DESCINIT to create a standard ScaLAPACK array descriptor
  3. allocate local memory for each process' portion of the array
  4. distribute the actual data values into the allocated memory

(I've decided to save step 4 for the next issue--which is why this series has just expanded from IV to V parts). So, step 1:

  1. settle on a block size

    Yes, the programmer, not ScaLAPACK, determines the block size. "Block size" refers to the dimensions of the subdivisions into which the global array is decomposed. These blocks are then dealt out to processors like playing cards being dealt in a game of canasta. As an example, here's a 4x5 array:

      00    01    02    03    04
      10    11    12    13    14
      20    21    22    23    24
      30    31    32    33    34
    
    Given a block size of 2x2, this array would decompose into the following six blocks:
      00    01
      10    11
    
      02    03
      12    13
    
      04
      14
    
      20    21
      30    31
    
      22    23
      32    33
    
      24
      34
    

    The ScaLAPACK User Guide recommends a block size of 64x64 for large arrays, and in my rough experimentation on both the X1 and IBM cluster, 64x64 was indeed best.

    To compute the block size in the sample code, I've borrowed another subroutine from Carlo Cavazzoni of CINECA. "Blockset" chooses a good block size based on the size of the global array, and number of row and columns in the processor grid. It also honors a maximum block size value, which, following the User Guide recommendation, can simply be set at 64.

  2. call DESCINIT to create a standard ScaLAPACK array descriptor

    At this point, we have this information:

    We only need two additional bits of information to complete the description. The first is easy, the processor that will own the first element of the array, which we arbitrarily set to the processor with grid coordinates (0,0). The second is more interesting. From the "man" page for the ScaLAPACK Tools routine, DESCINIT, we need this:

         lld       Integer. (input) The leading dimension of the local array
                   that stores the local blocks of the distributed matrix.
    

    Each process can get the "lld" of the array using another Tools routine, NUMROC ("NUM Rows Or Columns"). This is described in the IBM PESSL manual:

    "This function (NUMROC) computes the local number of rows or columns of a block-cyclically distributed matrix contained in a process row or process column, respectively, indicated by the calling sequence argument iproc."

    Since each process only needs the local leading dimension for DESCINIT (i.e., the local number of ROWs), forget about columns, and call NUMROC with input arguments for:

    and NUMROC will magically return:

    Whew! We can now call DESCINIT, which returns the required array descriptor for all subsequent ScaLAPACK calls involving the array.

  3. allocate local memory for each process' portion of the array

    In my trivial "block-size" example, above, the 2x2 block size did not divide the 4x5 data array into equally sized blocks. One outcome of this fact is that if we distributed the 6 blocks onto, for instance, 6 processors, the local storage required for this array would vary between the processors. In general, the local storage will indeed vary between processors.

    Luckily, we're only missing one bit of information to determine each processor's exact local array dimensions.

    Having already called NUMROC to determine the local number of rows, we simply call it again (with basic column information as input) to obtain the local number of columns.

    Given the local dimensions, each process uses Fortran ALLOCATE to allocate its portion of the array in local heap memory. The sum of all the memory used in the local arrays will add up to the size of the global array.

---

The test code, below, now does everything through array allocation. I've added a subroutine "printlocals" to display information about the distributed arrays. Here's program output, with some explanations interspersed:

  KLONDIKE:baring$ aprun -n 6 ./slv_part3
   PE= 0: 6 PROW= 0: 3 PCOL= 0: 2
   PE= 1: 6 PROW= 0: 3 PCOL= 1: 2
   PE= 2: 6 PROW= 1: 3 PCOL= 0: 2
   PE= 3: 6 PROW= 1: 3 PCOL= 1: 2
   PE= 4: 6 PROW= 2: 3 PCOL= 0: 2
   PE= 5: 6 PROW= 2: 3 PCOL= 1: 2

The above output is unchanged from last week. In this run we have 6 processors, arranged in a 3x2 virtual processor grid. Each process is reporting its position in the processor grid.

   DISTRIBUTION OF ARRAY: A Global dimension: 13 : 13
  proc:  0 grid position:  0,  0 blksz:  4 numroc:  5:  8
  proc:  1 grid position:  0,  1 blksz:  4 numroc:  5:  5
  proc:  2 grid position:  1,  0 blksz:  4 numroc:  4:  8
  proc:  3 grid position:  1,  1 blksz:  4 numroc:  4:  5
  proc:  4 grid position:  2,  0 blksz:  4 numroc:  4:  8
  proc:  5 grid position:  2,  1 blksz:  4 numroc:  4:  5

The next block of output, above, shows that the global dimensions of array A are (13 rows : 13 columns). Each process also tells us that the block size is 4 and the total number of rows and columns in its local portion of the array. For instance, proc (0,0) has 5 rows and 8 columns of "A."

   DISTRIBUTION OF ARRAY: C Global dimension: 13 : 1
  proc:  0 grid position:  0,  0 blksz:  4 numroc:  5:  1
  proc:  1 grid position:  0,  1 blksz:  4 numroc:  5:  0
  proc:  2 grid position:  1,  0 blksz:  4 numroc:  4:  1
  proc:  3 grid position:  1,  1 blksz:  4 numroc:  4:  0
  proc:  4 grid position:  2,  0 blksz:  4 numroc:  4:  1
  proc:  5 grid position:  2,  1 blksz:  4 numroc:  4:  0

The final block of output shows the distribution of the 13x1, right hand side, array. The 3 processors in the processor grid's second column will each store 0 elements of this array.

A little more ASCII art might help explain this distribution. First, here's a 13x13 array:

00 01 02 03 04 05 06 07 08 09 0A 0B 0C
10 11 12 13 14 15 16 17 18 19 1A 1B 1C
20 21 22 23 24 25 26 27 28 29 2A 2B 2C
30 31 32 33 34 35 36 37 38 39 3A 3B 3C
40 41 42 43 44 45 46 47 48 49 4A 4B 4C
50 51 52 53 54 55 56 57 58 59 5A 5B 5C
60 61 62 63 64 65 66 67 68 69 6A 6B 6C
70 71 72 73 74 75 76 77 78 79 7A 7B 7C
80 81 82 83 84 85 86 87 88 89 8A 8B 8C
90 91 92 93 94 95 96 97 98 99 9A 9B 9C
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC 
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC 

Given the block size of 4x4, here is the first row of blocks into which this array would decompose:

00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33

04 05 06 07
14 15 16 17
24 25 26 27
34 35 36 37

08 09 0A 0B
18 19 1A 1B
28 29 2A 2B
38 39 3A 3B

0C
1C
2C
3C

In the block-cyclic distribution, this first row of blocks is dealt out entirely to the first row of processors in the processor grid. Thus, given our 3x2 grid of processors, these four blocks would be assigned as follows:

--proc (0,0):
00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33

--proc (0,1):
04 05 06 07
14 15 16 17
24 25 26 27
34 35 36 37

--proc (0,0):
08 09 0A 0B
18 19 1A 1B
28 29 2A 2B
38 39 3A 3B

--proc (0,1):
0C
1C
2C
3C

This distribution of the first row of blocks can be redrawn as:

--proc (0,0):
00 01 02 03 08 09 0A 0B
10 11 12 13 18 19 1A 1B
20 21 22 23 28 29 2A 2B
30 31 32 33 38 39 3A 3B

--proc (0,1):
04 05 06 07 0C
14 15 16 17 1C
24 25 26 27 2C
34 35 36 37 3C

We see that proc (0,0) gets 8 of the 13 total columns of the array, while proc (0,1) gets 5. Re-examining the information dumped by the test code for these two processors, we see this:

  proc:  0 grid position:  0,  0 blksz:  4 numroc:  5:  8
  proc:  1 grid position:  0,  1 blksz:  4 numroc:  5:  5

The number of columns is the last value on these two lines, and, yes, processor (0,0) indeed reports 8 columns and processor (0,1) reports 5. We could perform a similar analysis on the first column of blocks to see how it's dealt out the the first column of processors. We could do the same thing for the entire array.

If this explanation leaves you confused, be sure to check out the ScaLAPACK User Guide, the CINECA tutorials, or other resources. Here's the updated code:

      program slv
!
! Test program, sets up and solves a simple system of equations
! using LAPACK routines.
!
      implicit none
      integer   ::       n,istat,info,i,j
      real,dimension(:,:),allocatable  :: a
      real,dimension(:,:),allocatable  :: c
      integer,dimension(:),allocatable :: ipiv

      parameter (n = 13)

      integer   ::       context, iam, pnum
      integer   ::       mycol, myrow, nb
      integer   ::       npcol, nprocs, nprow
      integer   ::       l_nrowsa,l_ncolsa
      integer   ::       l_nrowsc,l_ncolsc

      integer,parameter :: descriptor_len=9
      integer   ::       desca( descriptor_len )
      integer   ::       descc( descriptor_len )

      integer   ::       numroc

      interface 
        subroutine say_hello ( context,                          &
        iam,nprocs,myrow,nprow,mycol,npcol)
        implicit none
        integer :: context
        integer :: iam,nprocs,myrow,nprow,mycol,npcol
        integer :: i,j,pnum
        end subroutine say_hello

        subroutine printlocals( context,                          &
        a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols)
        implicit none
        real    :: a(:,:)
        integer :: context
        integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols
        integer :: i,j,pnum
        end subroutine printlocals
      end interface 
!
! -----    Initialize the blacs.  Note: processors are counted starting at 0.
!
      call blacs_pinfo( iam, nprocs )
!
! -----    Set the dimension of the 2d processors grid.
!
      call gridsetup(nprocs,nprow,npcol)
!
! -----    Initialize a single blacs context.  Determine which processor I 
!          am in the 2D process or grid.
!
      call blacs_get( -1, 0, context )
      call blacs_gridinit( context, 'r', nprow, npcol )
      call blacs_gridinfo( context, nprow, npcol, myrow, mycol )
      call say_hello (context,                     &
        iam,nprocs,myrow,nprow,mycol,npcol)
!
! -----    Calculate the blocking factor for the matrix. 
!
      call blockset( nb, 64, n, nprow, npcol)
!
! -----    Distributed matrices: get num. local rows/cols. Create description. 
!
      l_nrowsa = numroc(n,nb,myrow,0,nprow)
      l_ncolsa = numroc(n,nb,mycol,0,npcol)
      call descinit( desca, n, n, nb, nb, 0, 0, context, l_nrowsa, info )

      l_nrowsc = numroc(n,nb,myrow,0,nprow)
      l_ncolsc = numroc(1,nb,mycol,0,npcol)
      call descinit( descc, n, 1, nb, nb, 0, 0, context, l_nrowsc, info )
!
! -----   Allocate LHS, RHS, pivot -----
!
      allocate (a(l_nrowsa, l_ncolsa ), stat=istat)
      if (istat/=0) stop "ERR:ALLOCATE FAILS"

      allocate (c(l_nrowsc,l_ncolsc), stat=istat)
      if (istat/=0) stop "ERR:ALLOCATE FAILS"

      allocate (ipiv (n), stat=istat)
      if (istat/=0) stop "ERR:ALLOCATE FAILS"
!
! -----    Show how arrays distributed
! 
      if (iam.eq.0) write(6,*) "DISTRIBUTION OF ARRAY: A",           &
        " Global dimension:",n,":",n
      call printlocals ( context,                                    &
        a,iam,nb,nprocs,myrow,mycol,l_nrowsa,l_ncolsa)

      if (iam.eq.0) write(6,*) "DISTRIBUTION OF ARRAY: C",           &
        " Global dimension:",n,":",1
      call printlocals ( context,                                    &
        c,iam,nb,nprocs,myrow,mycol,l_nrowsc,l_ncolsc)
!
! -----    Cleanup arrays -----
!
      deallocate (a,c,ipiv)
!
! -----    Exit BLACS cleanly -----
!
      call blacs_gridexit( context )
      call blacs_exit( 0 )

      end program slv
!
!-----------------------------------------------------------------------
!
      subroutine gridsetup(nproc,nprow,npcol)
!
! This subroutine factorizes the number of processors (nproc)
! into nprow and npcol,  that are the sizes of the 2d processors mesh.
!
! Written by Carlo Cavazzoni
!
      implicit none
      integer nproc,nprow,npcol
      integer sqrtnp,i

      sqrtnp = int( sqrt( dble(nproc) ) + 1 )
      do i=1,sqrtnp
        if(mod(nproc,i).eq.0) nprow = i
      end do
      npcol = nproc/nprow

      return
      end
!
!-----------------------------------------------------------------------
!
      subroutine blockset( nb, nbuser, n, nprow, npcol)
!
!     This subroutine try to choose an optimal block size
!     for the distributd matrix.
!
!     Written by Carlo Cavazzoni, CINECA
!
      implicit none
      integer nb, n, nprow, npcol, nbuser

      nb = min ( n/nprow, n/npcol )
      if(nbuser.gt.0) then
        nb = min ( nb, nbuser )
      endif
      nb = max(nb,1)

      return
      end subroutine blockset
!
!-----------------------------------------------------------------------
!
      subroutine say_hello ( context,                          &
        iam,nprocs,myrow,nprow,mycol,npcol)
!
! Each processor identifies itself and its place in the processor grid
!
        implicit none
        integer :: context
        integer :: iam,nprocs,myrow,nprow,mycol,npcol
        integer :: i,j,pnum

        do i=0,nprocs-1
          call blacs_barrier (context, 'a')

          if (iam.eq.i) then
            write(6,100) iam,nprocs,myrow,nprow,mycol,npcol
100         format(" PE=",i2,":",i2," PROW=",i2,":",i2," PCOL=",i2,":",i2)
            call flush(6_4)
          endif
        enddo

        call blacs_barrier (context, 'a')
      end subroutine say_hello
!
!-----------------------------------------------------------------------
!
      subroutine printlocals( context,                          &
        a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols)
!
! Prints the local array "a", processor by processor. Only use this
! for very small arrays or sub-arrays...
!
      implicit none
      real    :: a(:,:)
      integer :: context
      integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols
      integer :: i,j,pnum

      call blacs_barrier (context, 'a')

      do pnum=0,nprocs-1
        if (iam .eq. pnum) then
          write (6,100) iam,myrow,mycol,nb,l_nrows,l_ncols
100       format ("proc:",i3," grid position:",i3,",",i3,        &
                  " blksz:",i3," numroc:",i3,":",i3)
!!!          do i=1,l_nrows
!!!            write (6,200) (a(i,j),j=1,l_ncols)
!!!200         format (20(" ",f5.1))
!!!            call flush(6_4)
!!!          enddo
        endif
        call blacs_barrier (context, 'a')
      enddo

      call flush(6_4)
      end subroutine printlocals

In Part IV of this series, we'll actually populate the allocated local arrays with data. Can you feel the excitement starting to build?

 

NCL Upgrade Announced

The NCL developers have issued the following announcement. This isn't installed at ARSC yet. NCL users should let us know if they're interested.

> We are pleased to announce that a new version of NCL (4.2.0.a032)
> is now available for download:
>
> http://ngwww.ucar.edu/ncl/download.html
>
> This version has the long-awaited feature of being able to generate
> contours on non-uniform grids. This is probably the most significant
> functionality added to NCL since its release in 1995.
>
> For more information on the types of grids that NCL can now contour
> and to view some examples, see:
>
> http://ngwww.ucar.edu/ncl/grids/
>
> Since this new functionality has only been tested on a handful of
> grids, we are asking the user community to help us test this new
> capability. We will be happy to help you get started.
>
> This version also has several new processing and graphics functions
> and capability enhancements. To list a few:
>
> - a much faster version of the EOF routines
> - functions for writing Vis5D files
> - an enhanced GRIB reader
> - function for assigning randomly spaced data onto the nearest
> locations of a grid with two-dimensional latitude and longitude
> arrays
>
> For the full list of what's new, see:
> http://ngwww.ucar.edu/ncl/whatsnew.html

 

OpenMP 2.5 Draft Specification Released for Public Comments

Here's an announcement from OpenMP.org/ :

> The OpenMP ARB is pleased to announce the release of a draft of the 2.5
> specifications for public comment. The goal of the 2.5 effort was to
> combine the Fortran and C/C++ specifications into a single one
> and to fix inconsistencies. The language committee has done a phenomenal
> job for the last 18 months in combining the old specifications -- every
> word has been carefully reviewed and almost the entire specification has
> been rewritten.
>
> The ARB warmly welcomes any comments, corrections and suggestions you
> have for Version 2.5. Please send email to feedback@openmp.org. It is
> most helpful if you can refer to the page number and line number where
> appropriate.
>
> The public comment period will close on 31 January 2005.

 

Holiday Greetings

Happy holidays, everyone. Here's a seasonal image for you from the land of winter... (this image was submitted with the photography "Quick-Tip", below)

 

Quick-Tip Q & A

A:[[  Holiday fun... Share a tip on ANY SUBJECT!  Anything!  Teach us
  [[ something.  What do you do well?

# 
# Thanks to all for the big response!     
# 

 

 

[[ Answers, Questions, and Tips Graciously Accepted ]]

 

 


Current Editors:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619
Donald Bahls ARSC User Consultant ph: 907-450-8674
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
Contact:
Send comments and questions to the current editors using this Contact Form.
Email Subscriptions: Archives:

 

Newsletter Index Quick-Tip Index Search Newsletters

 

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-474-6935 | email:

home | search | about | support | news | science | resources