ARSC HPC Users' Newsletter 306, December 17, 2004
- tcsh "autologout," Especially on the ARSC IBMs
- ScaLAPACK Intro: Part III of V
- NCL Upgrade Announced
- OpenMP 2.5 Draft Specification Released for Public Comments
- Holiday Greetings
- Quick Tip
tcsh "autologout," Especially on the ARSC IBMs
UNIX shells always seem to find a way to keep you on your toes. If you use tcsh and regularly, but mysteriously, find yourself logged out of terminal sessions, the problem might be the autologout functionality. When the variable "autologout" is set, the tcsh shell will *automatically* disconnect the session after the specified number of minutes. E.g.:
# first set the autologout variable to 60 minutes. iceberg1 3% set autologout=60 # verify that it is set using the built in shell command, 'set'. iceberg1 4% set grep auto autologout 60
In this case the shell will disconnect after 60 minutes of idle time. After the specified amount of idle time, we would see the following:
iceberg1 4% auto-logout Connection to iceberg closed.
In some versions of tcsh including those installed on iceberg and iceflyer, the autologout variable may be set by default (!!!) to 60 minutes. Sifting through the tcsh man page, one finds that autologout is set to 60 minutes by default in login and superuser shells, but not if the shell "thinks" it is running under a window system. It checks the setting of the environment variable DISPLAY to assess whether or not the session is running under a window system.
Thus, the autologout variable could be set for you automatically:
- If you use a terminal program which does not support X11.
- If you use a terminal program which does support X11, but you are using a connection method which does not set the DISPLAY environment variable for you (e.g., krlogin, ktelnet).
Note: ssh sets DISPLAY, as long as the terminal supports X11.
Fortunately, if, for some crazy reason, you don't want to be logged out automatically, you can get around this by unsetting the autologout variable. E.g.:
This command can be added to your .cshrc if you would like to permanently disable autologout.
Remember this only applies to the tcsh shell.
ScaLAPACK Intro: Part III of V
In this issue, we finally begin distributing the data arrays across all processors. This is where the dream of performing linear algebra on large arrays, say, 300000 x 300000 64-bit REALs, starts to become reality. For the record, a couple of arrays this size would indeed fit on ARSC's IBM cluster, with room to spare.
As described in Part I , ScaLAPACK requires that the data arrays be block-cyclically distributed across the 2D processor grid (which was created in Part II ). For each array, we need to take the the basic steps:
- settle on a block size
- call DESCINIT to create a standard ScaLAPACK array descriptor
- allocate local memory for each process' portion of the array
- distribute the actual data values into the allocated memory
(I've decided to save step 4 for the next issue--which is why this series has just expanded from IV to V parts). So, step 1:
settle on a block size
Yes, the programmer, not ScaLAPACK, determines the block size. "Block size" refers to the dimensions of the subdivisions into which the global array is decomposed. These blocks are then dealt out to processors like playing cards being dealt in a game of canasta. As an example, here's a 4x5 array:
00 01 02 03 04 10 11 12 13 14 20 21 22 23 24 30 31 32 33 34Given a block size of 2x2, this array would decompose into the following six blocks:
00 01 10 11 02 03 12 13 04 14 20 21 30 31 22 23 32 33 24 34
The ScaLAPACK User Guide recommends a block size of 64x64 for large arrays, and in my rough experimentation on both the X1 and IBM cluster, 64x64 was indeed best.
To compute the block size in the sample code, I've borrowed another subroutine from Carlo Cavazzoni of CINECA. "Blockset" chooses a good block size based on the size of the global array, and number of row and columns in the processor grid. It also honors a maximum block size value, which, following the User Guide recommendation, can simply be set at 64.
to create a standard ScaLAPACK array descriptor
At this point, we have this information:
- dimensions of the global array
- block size
- BLACS context
We only need two additional bits of information to complete the description. The first is easy, the processor that will own the first element of the array, which we arbitrarily set to the processor with grid coordinates (0,0). The second is more interesting. From the "man" page for the ScaLAPACK Tools routine, DESCINIT, we need this:
lld Integer. (input) The leading dimension of the local array that stores the local blocks of the distributed matrix.
Each process can get the "lld" of the array using another Tools routine, NUMROC ("NUM Rows Or Columns"). This is described in the IBM PESSL manual:
"This function (NUMROC) computes the local number of rows or columns of a block-cyclically distributed matrix contained in a process row or process column, respectively, indicated by the calling sequence argument iproc."
Since each process only needs the local leading dimension for DESCINIT (i.e., the local number of ROWs), forget about columns, and call NUMROC with input arguments for:
- the number of ROWs in the global array
- the number of ROWs in block size
- the total number of ROWs in the process grid
- the local processor's ROW number in the process grid
and NUMROC will magically return:
- the total number of ROWs in the local portion of the distributed array. In other words, NUMROC returns the local leading dimension.
Whew! We can now call DESCINIT, which returns the required array descriptor for all subsequent ScaLAPACK calls involving the array.
allocate local memory for each process' portion of the array
In my trivial "block-size" example, above, the 2x2 block size did not divide the 4x5 data array into equally sized blocks. One outcome of this fact is that if we distributed the 6 blocks onto, for instance, 6 processors, the local storage required for this array would vary between the processors. In general, the local storage will indeed vary between processors.
Luckily, we're only missing one bit of information to determine each processor's exact local array dimensions.
Having already called NUMROC to determine the local number of rows, we simply call it again (with basic column information as input) to obtain the local number of columns.
Given the local dimensions, each process uses Fortran ALLOCATE to allocate its portion of the array in local heap memory. The sum of all the memory used in the local arrays will add up to the size of the global array.
The test code, below, now does everything through array allocation. I've added a subroutine "printlocals" to display information about the distributed arrays. Here's program output, with some explanations interspersed:
KLONDIKE:baring$ aprun -n 6 ./slv_part3 PE= 0: 6 PROW= 0: 3 PCOL= 0: 2 PE= 1: 6 PROW= 0: 3 PCOL= 1: 2 PE= 2: 6 PROW= 1: 3 PCOL= 0: 2 PE= 3: 6 PROW= 1: 3 PCOL= 1: 2 PE= 4: 6 PROW= 2: 3 PCOL= 0: 2 PE= 5: 6 PROW= 2: 3 PCOL= 1: 2
The above output is unchanged from last week. In this run we have 6 processors, arranged in a 3x2 virtual processor grid. Each process is reporting its position in the processor grid.
DISTRIBUTION OF ARRAY: A Global dimension: 13 : 13 proc: 0 grid position: 0, 0 blksz: 4 numroc: 5: 8 proc: 1 grid position: 0, 1 blksz: 4 numroc: 5: 5 proc: 2 grid position: 1, 0 blksz: 4 numroc: 4: 8 proc: 3 grid position: 1, 1 blksz: 4 numroc: 4: 5 proc: 4 grid position: 2, 0 blksz: 4 numroc: 4: 8 proc: 5 grid position: 2, 1 blksz: 4 numroc: 4: 5
The next block of output, above, shows that the global dimensions of array A are (13 rows : 13 columns). Each process also tells us that the block size is 4 and the total number of rows and columns in its local portion of the array. For instance, proc (0,0) has 5 rows and 8 columns of "A."
DISTRIBUTION OF ARRAY: C Global dimension: 13 : 1 proc: 0 grid position: 0, 0 blksz: 4 numroc: 5: 1 proc: 1 grid position: 0, 1 blksz: 4 numroc: 5: 0 proc: 2 grid position: 1, 0 blksz: 4 numroc: 4: 1 proc: 3 grid position: 1, 1 blksz: 4 numroc: 4: 0 proc: 4 grid position: 2, 0 blksz: 4 numroc: 4: 1 proc: 5 grid position: 2, 1 blksz: 4 numroc: 4: 0
The final block of output shows the distribution of the 13x1, right hand side, array. The 3 processors in the processor grid's second column will each store 0 elements of this array.
A little more ASCII art might help explain this distribution. First, here's a 13x13 array:
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 90 91 92 93 94 95 96 97 98 99 9A 9B 9C A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC
Given the block size of 4x4, here is the first row of blocks into which this array would decompose:
00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 04 05 06 07 14 15 16 17 24 25 26 27 34 35 36 37 08 09 0A 0B 18 19 1A 1B 28 29 2A 2B 38 39 3A 3B 0C 1C 2C 3C
In the block-cyclic distribution, this first row of blocks is dealt out entirely to the first row of processors in the processor grid. Thus, given our 3x2 grid of processors, these four blocks would be assigned as follows:
--proc (0,0): 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 --proc (0,1): 04 05 06 07 14 15 16 17 24 25 26 27 34 35 36 37 --proc (0,0): 08 09 0A 0B 18 19 1A 1B 28 29 2A 2B 38 39 3A 3B --proc (0,1): 0C 1C 2C 3C
This distribution of the first row of blocks can be redrawn as:
--proc (0,0): 00 01 02 03 08 09 0A 0B 10 11 12 13 18 19 1A 1B 20 21 22 23 28 29 2A 2B 30 31 32 33 38 39 3A 3B --proc (0,1): 04 05 06 07 0C 14 15 16 17 1C 24 25 26 27 2C 34 35 36 37 3C
We see that proc (0,0) gets 8 of the 13 total columns of the array, while proc (0,1) gets 5. Re-examining the information dumped by the test code for these two processors, we see this:
proc: 0 grid position: 0, 0 blksz: 4 numroc: 5: 8 proc: 1 grid position: 0, 1 blksz: 4 numroc: 5: 5
The number of columns is the last value on these two lines, and, yes, processor (0,0) indeed reports 8 columns and processor (0,1) reports 5. We could perform a similar analysis on the first column of blocks to see how it's dealt out the the first column of processors. We could do the same thing for the entire array.
If this explanation leaves you confused, be sure to check out the ScaLAPACK User Guide, the CINECA tutorials, or other resources. Here's the updated code:
program slv ! ! Test program, sets up and solves a simple system of equations ! using LAPACK routines. ! implicit none integer :: n,istat,info,i,j real,dimension(:,:),allocatable :: a real,dimension(:,:),allocatable :: c integer,dimension(:),allocatable :: ipiv parameter (n = 13) integer :: context, iam, pnum integer :: mycol, myrow, nb integer :: npcol, nprocs, nprow integer :: l_nrowsa,l_ncolsa integer :: l_nrowsc,l_ncolsc integer,parameter :: descriptor_len=9 integer :: desca( descriptor_len ) integer :: descc( descriptor_len ) integer :: numroc interface subroutine say_hello ( context, & iam,nprocs,myrow,nprow,mycol,npcol) implicit none integer :: context integer :: iam,nprocs,myrow,nprow,mycol,npcol integer :: i,j,pnum end subroutine say_hello subroutine printlocals( context, & a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols) implicit none real :: a(:,:) integer :: context integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols integer :: i,j,pnum end subroutine printlocals end interface ! ! ----- Initialize the blacs. Note: processors are counted starting at 0. ! call blacs_pinfo( iam, nprocs ) ! ! ----- Set the dimension of the 2d processors grid. ! call gridsetup(nprocs,nprow,npcol) ! ! ----- Initialize a single blacs context. Determine which processor I ! am in the 2D process or grid. ! call blacs_get( -1, 0, context ) call blacs_gridinit( context, 'r', nprow, npcol ) call blacs_gridinfo( context, nprow, npcol, myrow, mycol ) call say_hello (context, & iam,nprocs,myrow,nprow,mycol,npcol) ! ! ----- Calculate the blocking factor for the matrix. ! call blockset( nb, 64, n, nprow, npcol) ! ! ----- Distributed matrices: get num. local rows/cols. Create description. ! l_nrowsa = numroc(n,nb,myrow,0,nprow) l_ncolsa = numroc(n,nb,mycol,0,npcol) call descinit( desca, n, n, nb, nb, 0, 0, context, l_nrowsa, info ) l_nrowsc = numroc(n,nb,myrow,0,nprow) l_ncolsc = numroc(1,nb,mycol,0,npcol) call descinit( descc, n, 1, nb, nb, 0, 0, context, l_nrowsc, info ) ! ! ----- Allocate LHS, RHS, pivot ----- ! allocate (a(l_nrowsa, l_ncolsa ), stat=istat) if (istat/=0) stop "ERR:ALLOCATE FAILS" allocate (c(l_nrowsc,l_ncolsc), stat=istat) if (istat/=0) stop "ERR:ALLOCATE FAILS" allocate (ipiv (n), stat=istat) if (istat/=0) stop "ERR:ALLOCATE FAILS" ! ! ----- Show how arrays distributed ! if (iam.eq.0) write(6,*) "DISTRIBUTION OF ARRAY: A", & " Global dimension:",n,":",n call printlocals ( context, & a,iam,nb,nprocs,myrow,mycol,l_nrowsa,l_ncolsa) if (iam.eq.0) write(6,*) "DISTRIBUTION OF ARRAY: C", & " Global dimension:",n,":",1 call printlocals ( context, & c,iam,nb,nprocs,myrow,mycol,l_nrowsc,l_ncolsc) ! ! ----- Cleanup arrays ----- ! deallocate (a,c,ipiv) ! ! ----- Exit BLACS cleanly ----- ! call blacs_gridexit( context ) call blacs_exit( 0 ) end program slv ! !----------------------------------------------------------------------- ! subroutine gridsetup(nproc,nprow,npcol) ! ! This subroutine factorizes the number of processors (nproc) ! into nprow and npcol, that are the sizes of the 2d processors mesh. ! ! Written by Carlo Cavazzoni ! implicit none integer nproc,nprow,npcol integer sqrtnp,i sqrtnp = int( sqrt( dble(nproc) ) + 1 ) do i=1,sqrtnp if(mod(nproc,i).eq.0) nprow = i end do npcol = nproc/nprow return end ! !----------------------------------------------------------------------- ! subroutine blockset( nb, nbuser, n, nprow, npcol) ! ! This subroutine try to choose an optimal block size ! for the distributd matrix. ! ! Written by Carlo Cavazzoni, CINECA ! implicit none integer nb, n, nprow, npcol, nbuser nb = min ( n/nprow, n/npcol ) if(nbuser.gt.0) then nb = min ( nb, nbuser ) endif nb = max(nb,1) return end subroutine blockset ! !----------------------------------------------------------------------- ! subroutine say_hello ( context, & iam,nprocs,myrow,nprow,mycol,npcol) ! ! Each processor identifies itself and its place in the processor grid ! implicit none integer :: context integer :: iam,nprocs,myrow,nprow,mycol,npcol integer :: i,j,pnum do i=0,nprocs-1 call blacs_barrier (context, 'a') if (iam.eq.i) then write(6,100) iam,nprocs,myrow,nprow,mycol,npcol 100 format(" PE=",i2,":",i2," PROW=",i2,":",i2," PCOL=",i2,":",i2) call flush(6_4) endif enddo call blacs_barrier (context, 'a') end subroutine say_hello ! !----------------------------------------------------------------------- ! subroutine printlocals( context, & a,iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols) ! ! Prints the local array "a", processor by processor. Only use this ! for very small arrays or sub-arrays... ! implicit none real :: a(:,:) integer :: context integer :: iam,nb,nprocs,myrow,mycol,l_nrows,l_ncols integer :: i,j,pnum call blacs_barrier (context, 'a') do pnum=0,nprocs-1 if (iam .eq. pnum) then write (6,100) iam,myrow,mycol,nb,l_nrows,l_ncols 100 format ("proc:",i3," grid position:",i3,",",i3, & " blksz:",i3," numroc:",i3,":",i3) !!! do i=1,l_nrows !!! write (6,200) (a(i,j),j=1,l_ncols) !!!200 format (20(" ",f5.1)) !!! call flush(6_4) !!! enddo endif call blacs_barrier (context, 'a') enddo call flush(6_4) end subroutine printlocals
In Part IV of this series, we'll actually populate the allocated local arrays with data. Can you feel the excitement starting to build?
NCL Upgrade Announced
The NCL developers have issued the following announcement. This isn't installed at ARSC yet. NCL users should let us know if they're interested.
> We are pleased to announce that a new version of NCL (4.2.0.a032) > is now available for download: > > http://ngwww.ucar.edu/ncl/download.html > > This version has the long-awaited feature of being able to generate > contours on non-uniform grids. This is probably the most significant > functionality added to NCL since its release in 1995. > > For more information on the types of grids that NCL can now contour > and to view some examples, see: > > http://ngwww.ucar.edu/ncl/grids/ > > Since this new functionality has only been tested on a handful of > grids, we are asking the user community to help us test this new > capability. We will be happy to help you get started. > > This version also has several new processing and graphics functions > and capability enhancements. To list a few: > > - a much faster version of the EOF routines > - functions for writing Vis5D files > - an enhanced GRIB reader > - function for assigning randomly spaced data onto the nearest > locations of a grid with two-dimensional latitude and longitude > arrays > > For the full list of what's new, see: > http://ngwww.ucar.edu/ncl/whatsnew.html
OpenMP 2.5 Draft Specification Released for Public Comments
Here's an announcement from www.OpenMP.org :
> The OpenMP ARB is pleased to announce the release of a draft of the 2.5 > specifications for public comment. The goal of the 2.5 effort was to > combine the Fortran and C/C++ specifications into a single one > and to fix inconsistencies. The language committee has done a phenomenal > job for the last 18 months in combining the old specifications -- every > word has been carefully reviewed and almost the entire specification has > been rewritten. > > The ARB warmly welcomes any comments, corrections and suggestions you > have for Version 2.5. Please send email to firstname.lastname@example.org. It is > most helpful if you can refer to the page number and line number where > appropriate. > > The public comment period will close on 31 January 2005.
Happy holidays, everyone. Here's a seasonal image for you from the land of winter... (this image was submitted with the photography "Quick-Tip", below)
Quick-Tip Q & A
A:[[ Holiday fun... Share a tip on ANY SUBJECT! Anything! Teach us [[ something. What do you do well? # # Thanks to all for the big response! #
Talk less listen more.
Frequent second hand bookshops.
Recycle if you have to, but it would be better not to use in the first place.
Here is a quick tip on photography. For best Christmas lights photos, shoot them during the twilight hours (between 2-3pm in Fairbanks, this time of year).
Review your charitable giving, of both time and money. Are you giving enough? What's important to you? This is the perfect time to reconsider reasons and goals and make a plan for the new year.
Best mittens to knit - Lincoln wool on the outside for toughness, alpaca/merino yarn on the inside. Pattern is from the Norwegian issue of Knitter's magazine: two-end knitting in worsted weight, tall (3-4 rows) duplicate stitch lining, then turn inside out.
If you are a spinner with a drum carder, try blending different fibers. Wool with dog hair, mohair, silk, or alpaca, all are really nice. The blend will have some of the best features of each fiber - soft wool gives elasticity to almost everything.
When baking cakes and cookies, people tell you to measure carefully. What is important to the texture is the moisture/flour ratio. The amount of sugar is less important and can be vastly reduced without changing anything but the sweetness.
If you are out in the cold and you are having trouble touching your pinky to your thumb (on the same hand), then it is necessary to stop and start a fire immediately. If you wait much longer, you may not have enough dexterity to use a lighter or to light a match.
Should you become lost in the wilderness, be sure to eat more than just rabbits (or hares). There is so little fat in these critters that you will actually die within a week or two. This is known as "rabbit starvation" and will start out as diarrhea, followed by a great hunger that cannot be satisfied. This leads to the eating of more and more rabbits in a vain attempt to stop the hunger. Death soon follows.
Wear gloves while cleaning rabbits, avoid rabbits with spots on their livers, and don't eat visibly sluggish rabbits. This will help to prevent the infectious disease tularemia, which many rabbits carry.
On those long sub-zero walks with your dog, put vaseline on the metal collar ring and leash snap, and Fido's tongue won't stick and freeze to the metal.
I have found that when one of my dogs eats something he shouldn't have, necessitating making him take some hydrogen peroxide to induce vomiting, it's absolutely no problem at all to get him to take the hydrogen peroxide if I mix it with some yogurt. The old method of forcing the stuff down his throat just didn't work very well, and this new way seems like a treat to my dog.
Here is a way to get rid of candle wax drips on fabrics. Place a brown paper bag over the cooled wax drippings, iron with a hot iron. The bag absorbs the wax and allows you to lift it away!
WD-40 is probably the best solvent for nasty stuff like tree sap and bubble gum. Just spray it on and rub like mad, then wash the WD-40 off with water and detergent and the crud will come out like magic.
Fleece dog booties make good mittens and "slippers" for babies. The velcro holds them on and they come in packs of four.
How to get warm: Sometimes you can bundle up with lots of clothing, but still get cold.
My tip is to get a move on! If you move around (run, jump, bend over, dance, etc.), you'll find that all those warm clothes do wonders at keeping the heat in. I find that only a minute or two's worth of exertion can get me warmed up and toasty for quite some time.
At the first hint of a sore throat: Reduce your alcohol, coffee, and sugar intake as much as possible, take high concentration zinc lozenges, get rest, and drink lots of water. The most important thing, if applicable: eliminate ALL salty snacks from your diet.
If your car gets stuck in the snow, you may be able to use your floor mats to get traction. Flip them over, and stick them in front of your tires.
Make a list of all the movies you want to see on www.imdb.com and consider joining a mail order movie house like www.netflix.com. Don't waste time in the store; browse with intelligence on the net.
How many countries are there in the world? Check out the UN; add in a few that are missing. The result? Less than 200! Make a plan to visit at least one next year.
To revive hard brown sugar, place it on a baking sheet and sprinkle with water. Set in the oven on the lowest setting for about 10 minutes. This should make it soft enough to mash it with a spoon or a rolling pin. If there are still hard pieces, repeat.
To get an avocado out of the skin easily, cut it in half and then use a spoon to scoop between the skin and the flesh.
To make pots-and-pans cleanup easier when cooking over a campfire, coat the (outside of!) the pan with dish soap. The smoke stains will wash off much easier.
Peel bananas as apes do. Pinch the end of the banana opposite the stem and it opens easily every time.
Enter contests, you never know when you might win...
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.