ARSC HPC Users' Newsletter 309, February 11, 2005
On-Site Visit by IBM Expert
An IBM software analyst will be visiting ARSC from 2/16 - 2/18 to assist with optimization and any other software related issues.
This is a great opportunity for local ARSC IBM users to spend time, one-on-one, with an expert. Appointments MUST be scheduled in advance, by contacting Tom Logan of ARSC (450-8624 or firstname.lastname@example.org).
Cray Pat Threads
CrayPat is a set of tools to do performance experiments on the Cray. Recently I ran into the following error when attempting to run an instrumented executable that was produced by pat_build.
klondike 1% ./a.out.inst pat[FATAL]: maximum number of threads 4 exceeded pat[FATAL]: maximum number of threads 4 exceeded pat[FATAL]: maximum number of threads 4 exceeded pat[FATAL]: maximum number of threads 4 exceeded pat[FATAL]: maximum number of threads 4 exceeded
By default the maximum number of POSIX or OpenMP threads that can be created per process is 4 for msp based applications and 16 for ssp applications. This value can be altered using the environment variable PAT_RT_THREAD_MAX. This environment variable is described in "man pat" along with 40+ other pat-related environment variables. In this case PAT_RT_THREAD_MAX=9 allowed the application to run.
klondike 2% setenv PAT_RT_THREAD_MAX 9 klondike 3% ./a.out.inst
For a brief introduction to pat see:
ScaLAPACK Intro: Part V of V
[ [ NOTE: All five articles in this series, complete test codes, as well [ as makefiles and batch submission scripts for the both the Cray X1 and [ the IBM are available for download as: [ [ http://www.arsc.edu/files/arsc/news/HPCnews/misc/Intro_ScaLAPACK.tar.gz [
Part I of this series mentioned the two steps required to parallelize a LAPACK code using ScaLAPACK:
- the processors must be arranged in a "virtual processor grid," and
- the data must be physically distributed amongst the processors in block-cyclic layout.
It took three issues of the Newsletter to complete these two steps, but we're now ready to call the ScaLAPACK solver. For reference, here are the original LAPACK calls:
call sgetrf( n, n, a, n, ipiv, info ) call sgetrs( 'n', n, 1, a, n, ipiv, c, n, info )
And our final ScaLAPACK equivalents:
call psgetrf( n, n, a, 1, 1, desca, ipiv, info ) call psgetrs( 'n', n, 1, a, 1, 1, desca, ipiv, & c, 1, 1, descc, info )
As everyone should have guessed, the solution vector produced in the call to psgetrs is block-cyclically distributed, just like the input matrices. To mimic the behavior of the serial version, we have one last chore: we must reverse this distribution and reassemble the solution vector in one place so we can use it.
Fortunately, the ScaLAPACK Tools Library makes this easy with the routine, pselget, the complement to pselset.
"pselGET," as the name implies, "gets" an element of an array from wherever it resides in distributed memory to local memory and, as with pselset, every process must call pselget for the element in question. (Under the hood, pselget performs a BLACS broadcast, using SGEBS2D and SGEBR2D, from the process which owns the array element to all others.)
To reassemble the entire global array, every processor must call pselget for every element of the array. The sample code performs this with one new subroutine:
subroutine get_solution (n,c,descc,solution) implicit none integer :: n integer :: descc(:) real :: c(:,:),solution(:) integer :: i do i= 1, n call pselget('A',' ',solution(i),c,i,1,descc) enddo return end
The array "solution" must be allocated large enough to hold the entire global array, not just the local portion of the block-cyclic distribution. After reassembling it, the test code simply prints the solution vector:
! ! ----- Reassemble solution on all processors. Print on 0 ----- ! call get_solution (n,c,descc,solution) if (iam.eq.0) then write (6,*) "SOLUTION: " do i=1,n write (6,'(f5.2)') solution(i) enddo endif
Here's the complete output of the final ScaLAPACK version, and, what a relief, it agrees with the result of the serial version:
% aprun -n 6 ./slv_part5 PE= 0: 6 PROW= 0: 3 PCOL= 0: 2 PE= 1: 6 PROW= 0: 3 PCOL= 1: 2 PE= 2: 6 PROW= 1: 3 PCOL= 0: 2 PE= 3: 6 PROW= 1: 3 PCOL= 1: 2 PE= 4: 6 PROW= 2: 3 PCOL= 0: 2 PE= 5: 6 PROW= 2: 3 PCOL= 1: 2 DISTRIBUTION OF ARRAY: A Global dimension: 13 : 13 proc: 0 grid position: 0, 0 blksz: 4 numroc: 5: 8 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 proc: 1 grid position: 0, 1 blksz: 4 numroc: 5: 5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0 proc: 2 grid position: 1, 0 blksz: 4 numroc: 4: 8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 proc: 3 grid position: 1, 1 blksz: 4 numroc: 4: 5 20.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 proc: 4 grid position: 2, 0 blksz: 4 numroc: 4: 8 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0 proc: 5 grid position: 2, 1 blksz: 4 numroc: 4: 5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DISTRIBUTION OF ARRAY: C Global dimension: 13 : 1 proc: 0 grid position: 0, 0 blksz: 4 numroc: 5: 1 7.7 15.4 23.1 30.8 100.0 proc: 1 grid position: 0, 1 blksz: 4 numroc: 5: 0 proc: 2 grid position: 1, 0 blksz: 4 numroc: 4: 1 38.5 46.2 53.8 61.5 proc: 3 grid position: 1, 1 blksz: 4 numroc: 4: 0 proc: 4 grid position: 2, 0 blksz: 4 numroc: 4: 1 69.2 76.9 84.6 92.3 proc: 5 grid position: 2, 1 blksz: 4 numroc: 4: 0 SOLUTION: 0.38 0.77 1.15 1.54 1.92 2.31 2.69 3.08 3.46 3.85 4.23 4.62 5.00 %
If you want to see final version of the complete code, grab the tar file as described at the top of the article. Untar it, and look for " src/slv_part5.f90 ." A test with an 80000x80000 matrix (and with the call to printlocals commented out, to eliminate unnecessary stuff from the standard output file) worked fine.
Platform Specific Issues:
The test code used in this series was based on an actual user code. The current version of that code requires 230GB of RAM and was optimized and run on both large HPC systems at ARSC. What follows are some lessons and observations from this experience.
First, a little more discussion of datatypes. The datatype of the matrices in the user code is double precision complex (e.g., complex(kind=8)), which brings up some possible confusion about the names of LAPACK/ScaLAPACK routines. These libraries pre-date Fortran 90 generic interfaces, so they provide a different subroutine for each datatype, coded as follows:
s == single precision d == double precision c == single precision complex z == double precision complex
For instance, there are four versions of pselset:
pselset pdelset pcelset pzelset
Cray X1 performance and optimization:
The matrix initialization step (using pzelset) did not vectorize and was thus slow on the X1. However, the source code for pzelset and the routines on which it depends is available at netlib, and was vectorizable. Rewriting this was a hassle, but a major optimization.
Performance was essentially equivalent whether the code was run in SSP or MSP mode. I.e., a run on N MSPs was as fast as a run on 4*N SSPs.
Cray gets bonus points for implementing the single and double precision versions of the solvers (e.g., psgetrf and pdgetrf) and for including the ScaLAPACK Tools Library in its implementation.
Cray gets more bonus points for Programming Environment 5.3, which includes BLACS and ScaLAPACK for all four combinations of the compiler options:
-s default32 vs -s default64, and -O ssp vs -O msp
This means that if your code defines REALs to be of default REAL type, and it calls psgetrf, you can chose whether to use 32- or 64-bit REALS by compiling with -sdefault32 or -sdefault64, respectively.
As mentioned in Part I, IBM doesn't actually support LAPACK/ScaLAPACK, but instead supports ESSL/PESSL. Thus, you might have to make some adjustments to use the IBM:
The subroutine interfaces are not always the same (although for the subroutines used in these articles, they are identical).
None of the ScaLAPACK Tools routines (pselset, pdelset, pselget, etc...) referenced in these articles are supplied in PESSL. They had to be downloaded and compiled from source.
PESSL does not supply the single precision versions of the solvers, only pdgetrf/pdgetrs. (If you download the tar file, you'll see that I've converted everything to double precision, so the same codes will run under both LAPACK/ScaLAPACK and ESSL/PESSL.)
IBM gets bonus points for providing an efficient multithreaded version of ESSL (libesslsmp.a). This can be used to run a serial code (like that given in part I of this series) on multiple processors of a shared-memory node.
For instance, on ARSC's heterogeneous IBM p690+/p655+ complex, you could use up to 32 threads on 1 p690+ node. This gives a serial ESSL code access to 256 GB of memory and 32 processors.
If there is little cost to the non-ESSL portion of the code, or if it can be parallelized using OpenMP, this can be a simple way, relative to converting it to PESSL, to dramatically reduce runtime.
IBM also gets bonus points for providing an efficient multithreaded version of PESSL (libpesslsmp.a). With PESSL-SMP, you assign work to processors using a combination of distributed memory processes and shared memory threads.
For instance, given a PESSL code, you could launch 2 distributed memory processes per 8-CPU p655+ node and have each of these run 4 shared-memory threads.
For the IBM, the most important optimization of the ScaLAPACK user code was to parallelize the matrix initialization step using trivial OpenMP directives and to use the pesslsmp library to run with 1 process X 8 threads per p655+ node.
This was effective because, as mentioned in Part IV, the data distribution method required that every process call pzelset for every element of the global array. Running with 8 threads per process gave the program access to all the processors and memory of the nodes, but reduced by a factor of 8 the number of processes. This, in turn, reduced the time required for matrix initialization by a factor of 8.
The PESSL solver routine pzgetrf was equally fast regardless of how the total processor count was factored into processes and threads.
Note that the above performance observations were based on a code calling only pzgetrf/pzgetrs. It's possible that if your code uses a richer set of routines from ScaLAPACK/PESSL, things will be different.
Book Review: Mac Annoyances
"Mac Annoyances," by: John Rizzo, O'Reilly, 156 pages, $25
Rizzo describes how to fix about 200 annoying features of the Mac and its major applications. These range from adapting Mac OS GUI to your liking to fixing a plethora of interface annoyances and bugs with iLife apps. There are tips for performance enhancements and work-arounds for some bugs.
For example, in the section on Mail, Rizzo has tips on Apple Mail, Entourage, Eudora and AOL. He tells how to avoid long recipient lists and how to fix several of the common problems that occur in a heterogeneous environment.
For Eudora, Rizzo tells how to delete mailboxes and gives a trick for re-ordering them: To delete a mailbox you click on Window > Mailboxes. Select the mailbox you want to delete; tap the delete key and confirm the delete. To put your mail file about Zorn's Lemma at the beginning of the list of mailboxes, rename it with a blank in front of the Z.
Rizzo describes a freeware program for creating an internal thermometer for PowerBooks. This may help reduce fan noise, which increases battery life, and it can be used to help reduce case temperature, an important ergonomic feature for some laptop users.
There are a couple dozen hints and tips in the area of the most annoying of Mac apps, Microsoft Office. For example, Rizzo describes how to stop hyperventilating over links--you must turn off hyperlinking in two places to actually turn it off. He also has several tips on spellchecking and autocorrection and how to reset them when they only seem to want to obey Chairman Bill.
On the other hand there are irritations that are not discussed like how to stop the pop-ups from Retrospect and anti-virus programs and how to turn off animation. And, of course, Rizzo never discusses how to do un-Mac things like making window sizes adjustable from any edge or moving the menu bar closer to its application window.
The bottom line: If you use a Mac and are irritated by some "feature" of its interface or by something in its major applications, you should look into Mac Annoyances.
Butrovich Police Blotter
Here's the latest important announcement from the Butrovich building. If this doesn't make any sense, you need to come visit us!
> > Attention, > > There is a Chevy Truck (license plate # EFD833) that appears to > have rolled forward onto its cord un-plugging itself. > > We wouldn't want your vehicle to freeze up in the cold! >
Quick-Tip Q & A
A:[[ The loopmark listings that the Cray compilers can output were [[ really helpful when I was optimizing my code for the X1. Do the [[ IBM compilers have an option which will produce a human readable [[ output of optimized code? I tried the compiler flag '-qlist' and [[ it produced a listing, but all it had was Power4 assembly code. # # Thanks to John Skinner # To generate all available information via listing options under the XL Fortran compilers, one can use the following set of options. In this case, the mpxlf wrapper program is used simply to take care of MPI-related library paths, includes, etc. Note the MPI source file arsc-test.f (program integral) is from a previous ARSC newsletter: mpxlf90 -O5 -qsource -qipa="list=b.lst" -qhot -qreport=hotlist \ -qlist -qattr=full -qxref=full -o arsctest arsc-test.f This creates 3 files: - executable file "arsctest" - listing file "arsc-test.lst" with compiler OPTIONS SECTION, SOURCE SECTION, ATTRIBUTE AND CROSS REFERENCE SECTION, LOOP TRANSFORMATION SECTION, OBJECT SECTION - separate listing file "b.lst" with detailed inter-procedural analysis Note that using -qipa and -qlist together will cause IPA to overwrite the default compiler listing file, so use the "list=filename" suboption of -qipa to generate a different listing file. Basic details on these options and their sub-options are available under the xlf manpage, but really detailed information appears in the online XL Fortran and C/C++ Compiler manuals available from IBM at: http://publib.boulder.ibm.com/ # # Editor's Note: program integral was found in issue 283 # # Q: To keep track of versions of files, I often apply the current date as a suffix or prefix to the filename, and I always use the format YYYYMMDD so the versions will sort in order. In other words, I often find myself figuring out what day it is and then typing the YYYYMMDD string... like this for example: $ cp valentine.jpg valentines.jpg.20050211 $ cp bad_ideas bad_ideas.20050211 Is there some Unix alias or something that could do this for me?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.