| Newsletter Index | Quick-Tip Index | Search Newsletters |
I hope you can learn from the following porting experience, even if you're not an SX-6 user. And if you have a porting adventure of your own, perhaps you could share it through this newsletter?
The code in question now runs at 1940 MFLOPS on a single SX-6 processor. A small test case completes in about 1/3rd the wall-clock time required for 4 multi-tasked SV1ex processors.
While auto-tasking on the SV1ex is of definite benefit, auto parallelizing on the SX-6 (for this problem size) is actually detrimental.
This is a vectorizable Fortran 90, ab initio quantum physics code with about 100 source files.
Steps in porting to the SX-6:
This makes all numeric storage units use 64-bits.
The SX-6 offers both 32- and 64-bit mode, and the default is 32. On the Cray, everything is 64-bits and starting with the same data size made sense. The SX-6 is IEEE, but in this case, this didn't cause cause different results.
From the online Fortan90/SX Programmers Guide:
-f4
Specifies that the input source program is described in standard
free format.
On the SX-6, "-I" performs like the Cray "-p
-I
Specifies that INCLUDE files and compiled MODULE information file
are retrieved from the specified directory prior to the current
directory.
My makefile includes the following f90 option, where "STARTDIR" is set to the root directory of the build and the subdirectory "MOD" is where the module files lie:
-I${STARTDIR}/MODS
Codes compiled with "-ew" must be linked with 64-bit libraries. In this case, this indicated the math libraries, but it also applies, for instance, to the MPI libraries (eg., "-lmpiw" where the "w" corresponds to the "w" in "-ew").
On my first attempt to compile, f90/SX issued 6 helpful error messages, like this:
f90: error(746): generics.f90, line 381:
Specific procedure "d_gemv" and "s_gemv" in same generic
interface can not be distinguished.
The "-ew" option had eliminated the differences in storage unit type between the two specific subroutines, d_gemv and s_gemv. These subroutines were accessed through the generic name, gemv. With the differences removed, the compiler was unable to disambiguate the two specific routines, which is a Fortran error.
To solve it, I introduced a new macro, "SX6", and used preprocessor directives to "comment out" the newly-duplicate subroutines, as shown here:
interface gemv
#ifndef SX6
module procedure z_gemv,c_gemv,d_gemv,s_gemv
#else
module procedure c_gemv,s_gemv
#endif
end interface
#ifndef SX6
subroutine d_gemv(tr,m,n,alph,a,lda,x,incx,beta,y,incy)
use kinds
character*1 tr
integer(kind=ikind) m,n,lda,ldb,incx,incy
double precision :: a(:,:),x(:),alph,beta
double precision :: y(:)
integer(kind=ikind) lda_lo
lda_lo=size(a,1)
call dgemv(tr,m,n,alph,a,lda_lo,x,incx,beta,y,incy)
end subroutine d_gemv
#endif
subroutine s_gemv(tr,m,n,alph,a,lda,x,incx,beta,y,incy)
use kinds
character*1 tr
integer(kind=ikind) m,n,lda,ldb,incx,incy
real :: a(:,:),x(:),alph,beta
real :: y(:)
integer(kind=ikind) lda_lo
lda_lo=size(a,1)
call sgemv(tr,m,n,alph,a,lda_lo,x,incx,beta,y,incx)
end subroutine s_gemv
After steps 1-5, above, the compiler produced an executable... which sputtered like an '83 Ford Escort I once owned, ran, and promptly crashed.
I spent time on a wild goose chase. The code uses linked lists for bookkeeping, implemented using Fortran pointers. From the location of the crash, I focussed on possible pointer problems.
Experiments with the f90/SX option:
-O overlap
(default when compilation mode -C vsafe or -C ssafe
is effective) Assumes that pointer references are
overlapped in optimization.
didn't help. I tried some other things. Eventually I compiled the problematic subroutine for debugging:
-C debug -g
This eliminated the problem in that subroutine. But it simply moved the crash to a different subroutine. After repeating this step a couple of times, I smartened up.
Now it ran to completion, and produced correct results. Nothing to debug! Right?
On the SX-6, basic performance data is produced by simply setting the run-time environment variable "F_PROGINF" to "DETAIL". The code ran at 445 MFLOPS.
Here's the complete F_PROGINF output for the "-D debug" version:
****** Program Information ******
Real Time (sec) : 1505.139990
User Time (sec) : 1500.737250
Sys Time (sec) : 2.480038
Vector Time (sec) : 94.229735
Inst. Count : 528581378908.
V. Inst. Count : 4805111361.
V. Element Count : 918360808316.
FLOP Count : 667538032141.
MOPS : 960.952409
MFLOPS : 444.806732
VLEN : 191.121649
V. Op. Ratio (%) : 63.680549
Memory Size (MB) : 64.031250
MIPS : 352.214473
I-Cache (sec) : 1.709976
O-Cache (sec) : 75.106489
Bank (sec) : 14.313291
Start Time (date) : 2002/10/17 13:45:58
End Time (date) : 2002/10/17 14:11:03
Here are all possible "compile mode" or "-C" settings, in order of increasing optimization (or decreasing conservatism):
-C{debug|ssafe|vsafe|sopt|vopt|hopt}
(Note: "vopt" is the default. "s" means scalar, "v" means vector.)
With everything already compiled at "-C debug", I experimented by recompiling a few of the problematic subroutines (from step 6) with "ssafe". This was successful, so I recompiled everything at "ssafe".
Again, correct results. It now ran at 1231 MFLOPS. Here's the F_PROGINF output for the "-C ssafe" version:
****** Program Information ******
Real Time (sec) : 542.073901
User Time (sec) : 540.851285
Sys Time (sec) : 1.021653
Vector Time (sec) : 92.713041
Inst. Count : 161857294035.
V. Inst. Count : 4788058445.
V. Element Count : 918260158881.
FLOP Count : 666211339613.
MOPS : 1988.216400
MFLOPS : 1231.782855
VLEN : 191.781318
V. Op. Ratio (%) : 85.393384
Memory Size (MB) : 64.031250
MIPS : 299.263954
I-Cache (sec) : 0.754988
O-Cache (sec) : 61.974722
Bank (sec) : 7.211348
Start Time (date) : 2002/10/17 11:55:29
End Time (date) : 2002/10/17 12:04:31
From the SX-6 Fortan90 Programmers Guide, I discovered that each "-C" option is simply a short-cut to a set of so-called "detailed" options. (Similar to the "-O" levels on the Cray.) From a careful search of the guide, I generated this list of "detailed options," which appears to be equivalent to the setting, "-C ssafe":
-Wf"-O nochg nodarg nodiv noiodo nomove overlap nounroll" \
-Wf"-pvctl nocollapse nomatmul noouterunroll" \
-Wf"-Nv"
Again using a few of the problematic subroutines, I emulated a binary search through the above list of options, and ultimately discovered that only one of them, "nodarg," was required for correct execution. From the manual:
-O nodarg
(default when compilation mode -C ssafe is effective)
Dummy arguments are not subject to optimization.
Again, correct results. This version ran at 1940 MFLOPS. Here's the F_PROGINF output for the "-O nodarg" version:
****** Program Information ******
Real Time (sec) : 328.804629
User Time (sec) : 327.538463
Sys Time (sec) : 1.217602
Vector Time (sec) : 154.557969
Inst. Count : 64776518747.
V. Inst. Count : 7854910424.
V. Element Count : 999827065397.
FLOP Count : 635471363086.
MOPS : 3226.334598
MFLOPS : 1940.142718
VLEN : 127.286883
V. Op. Ratio (%) : 94.613515
Memory Size (MB) : 64.031250
MIPS : 197.767671
I-Cache (sec) : 2.066626
O-Cache (sec) : 25.200404
Bank (sec) : 2.160127
Start Time (date) : 2002/10/17 14:37:42
End Time (date) : 2002/10/17 14:43:11
Here's the complete set of options I'm now using (the FFLAGS_1 macro is used for the free-form files):
CF= f90
DEBUG=
OPT= -Wf"-O nodarg"
LIBS= -llapack_64 -lblas_64
FFLAGS= ${OPT} ${DEBUG} -ew -I${STARTDIR}/PARAMETER\
-I${STARTDIR}/MODS
FFLAGS_1= -DSX6 ${OPT} ${DEBUG} -ew -f4 -c -I${STARTDIR}/PARAMETER\
-I${STARTDIR}/MODS
Recompiling with '-C hopt -Wf"-O nodarg"' crashed the code, so for now, I'm satisfied with the default "-C" level of "vopt".
The program compiles and runs correctly with auto parallelization "-P auto" and the parallel blas library "-lparblas_64" and burns time on multiple processors. But this produces no speedup. The user of this code will attempt larger data sets, which may prove that it simply needs more work (note that the SX-6 vector length is 4x that of the SV1).
Another step would be a "search" through all the source files, finding those that can be safely compiled at "-C hopt".
[ Thanks to Jim Long of ARSC for contributing this 2-part series. ]
This series of articles presents some issues that arise when writing portable code in the context of compressing nucleotide text strings and producing the reverse complement of such strings.
Part II, Shifty CharactersIn Part I, we looked at Endian "Enigmas" in the context of bit shifting on different architectures. We did this because we want to be able to compress nucleotide text strings, made up entirely of just the four letters A, C, G, & T, into words containing 2-bit representations of each nucleotide. Thus a 32-bit word will contain 16 nucleotides, and a 64-bit word will contain 32, in both cases a compression factor of 4.
The compression is done simply by taking the 2-bit representation for each 8-bit ascii character and shifting it into its proper position within a word. If the bit patterns are A=00, C=01, G=11, & T=10, then on a big-endian machine the string "acgta" looks like:
uncompressed representation
|<--------------word0-------------->|<--------------word1-------------->| |byte0---|byte1---|byte2---|byte3---|byte4---|byte5---|byte6---|byte7---| 01100001 01100011 01100111 01110100 01100001 00000000 <------8-bit ascii a c g t a nullcompressed representation, 4 letters/byte
|<--------------word0-------------->|<--------------word1-------------->| |byte0---|byte1---|byte2---|byte3---|byte4---|byte5---|byte6---|byte7---| 00011110 00000000 00000000 00000000 <------ compressed 2-bit string a c g t a null padded zeros
Each 2-bit representation was shifted to the left on a big-endian machine. On a little-endian machine, we have to shift the other way:
uncompressed representation
|<--------------word1-------------->|<--------------word0-------------->|
|byte7---|byte6---|byte5---|byte4---|byte3---|byte2---|byte1---|byte0---|
8-bit ascii -----> 00000000 01100001 01110100 01100111 01100011 01100001
null a t g c a
compressed representation, 4 letters/byte
|<--------------word1-------------->|<--------------word0-------------->|
|byte7---|byte6---|byte5---|byte4---|byte3---|byte2---|byte1---|byte0---|
compressed 2-bit string ----------> 00000000 00000000 00000000 10110100
padded zeros null a t g c a
Note that in both cases, an "a" is 00, composed of the same zero bits used to pad the word. Thus when decompressing, one must know the number of letters that were compressed.
The 2-bit representations could have been found with a lookup table that translates each ascii character into its equivalent 2-bit representation, something like:
array[256]; array['A'] = array['a'] = 0x0; /* bit pattern 00 */ array['C'] = array['c'] = 0x1; /* bit pattern 01 */ array['G'] = array['g'] = 0x3; /* bit pattern 11 */ array['T'] = array['t'] = 0x2; /* bit pattern 10 */
Doing it this way is slow, however, having to do a lookup for each 8-bit character. An examination of the 8-bit ascii representations for the characters reveals that the desired 2-bit patterns for each letter are already unique within each ascii representation, and are case insensitive:
A 0100 0001
a 0110 0001
C 0100 0011
c 0110 0011
G 0100 0111
g 0110 0111
T 0101 0100
t 0111 0100
^^
||
these two columns contain the desired 2-bit code
It is much faster to simply mask out the undesired bits and shift the desired bits to their proper location. If "unc" is an array of nucleotides in 8-bit ascii, then the following code fragment shows how to create one word of compressed data from 4 words of uncompressed on a big-endian machine, doing everything in the registers without additional load/stores:
mask shift logical "or"
========== ===== ============
compressed[0] = ( (0x06000000 & unc[0]) << 5) |
( (0x00060000 & unc[0]) << 11) |
( (0x00000600 & unc[0]) << 17) |
( (0x00000006 & unc[0]) << 23) |
((unsigned long)(0x06000000 & unc[1]) >> 3) |
( (0x00060000 & unc[1]) << 3) |
( (0x00000600 & unc[1]) << 9) |
( (0x00000006 & unc[1]) << 15) |
((unsigned long)(0x06000000 & unc[2]) >> 11) |
((unsigned long)(0x00060000 & unc[2]) >> 5) |
( (0x00000600 & unc[2]) << 1) |
( (0x00000006 & unc[2]) << 7) |
((unsigned long)(0x06000000 & unc[3]) >> 19) |
((unsigned long)(0x00060000 & unc[3]) >> 13) |
((unsigned long)(0x00000600 & unc[3]) >> 7) |
((unsigned long)(0x00000006 & unc[3]) >> 1);
masking turns bits off, while a logical "or" turns bits on.
The equivalent code on a little-endian machine looks like:
mask shift logical "or"
========== ===== ============
compressed[0] = ((unsigned long)(0x00000006 & unc[0]) >> 1) |
((unsigned long)(0x00000600 & unc[0]) >> 7) |
((unsigned long)(0x00060000 & unc[0]) >> 13) |
((unsigned long)(0x06000000 & unc[0]) >> 19) |
( (0x00000006 & unc[1]) << 7) |
( (0x00000600 & unc[1]) << 1) |
((unsigned long)(0x00060000 & unc[1]) >> 5) |
((unsigned long)(0x06000000 & unc[1]) >> 11) |
( (0x00000006 & unc[2]) << 15) |
( (0x00000600 & unc[2]) << 9) |
( (0x00060000 & unc[2]) << 3) |
((unsigned long)(0x06000000 & unc[2]) >> 3) |
( (0x00000006 & unc[3]) << 23) |
( (0x00000600 & unc[3]) << 17) |
( (0x00060000 & unc[3]) << 11) |
( (0x06000000 & unc[3]) << 5);
Note the mirror symmetry between the two code fragments.
Decompression is a little trickier because the prefix for a "T" (0101) is different than for the other letters (0100). We can determine T-ness in a register by doing an "xor" (^: exclusive or) of the 2-bit extraction with the 2-bit representation for "T":
A^T = 00^10 = 10 C^T = 01^10 = 11 G^T = 11^10 = 01 T^T = 10^10 = 00
Only T xor T yields a false bool value. Using this fact, the code on a big-endian machine to decode the first 1/4 of a compressed string could look like:
unc[0] = (((0xC0000000 & compressed[0]) ^ 0x80000000)? /* is it a T? */
(((unsigned long)(0xC0000000 & compressed[0]) >> 5) | 0x41000000):
(0x54000000)) /* the letter T */
|
(((0x30000000 & compressed[0]) ^ 0x20000000)? /* is it a T? */
(((unsigned long)(0x30000000 & compressed[0]) >> 11) | 0x00410000):
(0x00540000)) /* the letter T */
|
(((0x0C000000 & compressed[0]) ^ 0x08000000)? /* is it a T? */
(((unsigned long)(0x0C000000 & compressed[0]) >> 17) | 0x00004100):
(0x00005400)) /* the letter T */
|
(((0x03000000 & compressed[0]) ^ 0x02000000)? /* is it a T? */
(((unsigned long)(0x03000000 & compressed[0]) >> 23) | 0x00000041):
(0x00000054)); /* the letter T */
Each successive 2-bit pair is masked out of the compressed input and xor-ed with 10. If that boolean is true, it is not a "T", and we just shift the 2-bits into their proper place and add the remaining common bits. If it is a "T", then we just return a "T" in the proper location. unc[1], unc[2], and unc[3] are similarly computed with the other 3/4 of the compressed word. For a little-endian machine, the same idea prevails with only different masking and shifting.
The same ideas presented above can also be used to do 4-bit and 5-bit compression, where 5-bit covers the entire alphabet.
Finally, note that it is easy to produce the reverse complement of a compressed nucleotide string. Along a double helix of dna, each nucleotide is paired with its complement, "A" with T", and "C" with "G". A reverse complement string is the original string read backwards, replacing each letter with its complement. Reading backwards is accomplished by reversing the ordering of the 2-bit units within a word, and reversing the ordering of the words. Producing the complement is accomplished by xor-ing each 2-bit pattern with 10, incidentally the same thing we did above to determine T-ness:
A^10 = 00^10 = 10 = T C^10 = 01^10 = 11 = G G^10 = 11^10 = 01 = C T^10 = 10^10 = 00 = A
Code to produce the reverse complement for a string that exactly fills its final word looks like:
for(i=0; i<length; i++)
{
/* reverse and complement */
rc[i] = (((unsigned long)(0xCCCCCCCC & compressed[length-1-i])) >> 2) |
( (0x33333333 & compressed[length-1-i]) << 2);
rc[i] = ( (unsigned long)(0xF0F0F0F0 & rc[i])>>4) | ((0x0F0F0F0F &
rc[i])<<4);
rc[i] = ( (unsigned long)(0xFF00FF00 & rc[i])>>8) | ((0x00FF00FF &
rc[i])<<8);
rc[i] = (((unsigned long)(dbrc[i]) >> 16) | (rc[i] << 16)) ^ 0xAAAAAAAA;
}
^ 0xAAAAAAAA complements the entire word at once after it has been reversed. Note that this code works for both big- and little-endian machines. Additional architecture dependent shifting must be done for strings that do not exactly fill their final word.
Hope you have fun "coding to the metal"!
Here's the link for the IBM online publication server. It contains a plethora of publications (e.g., 3 pages for C language references alone).
http://www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi?CTY=US
> The 3rd International Conference on Computational Science > > June 2 - 4, 2003 > > Saint Petersburg, Russian Federation > Melbourne, Australia > > Computational Science is a vital part of many scientific > investigations, affecting researchers and practitioners in the > sciences and beyond. Due to the sheer size of many challenges in > computational science, the use of supercomputing, parallel > processing, and sophisticated algorithms, is inevitable. > > The International Conference on Computational Science 2003 (ICCS > 2003) aims to bring together researchers and scientists from > mathematics and computer science as basic computing disciplines, > researchers from various application areas who are pioneering > advanced application of computational methods to sciences such as > physics, chemistry, life sciences, and engineering, arts and > humanitarian fields, along with software developers and vendors, to > discuss problems and solutions in the area, to identify new issues, > and to shape future directions for research, as well as to help > industrial users apply various advanced computational techniques. > > ICCS 2003 is the follow-up of the highly successful ICCS 2002 > conference held in Amsterdam, The Netherlands. ICCS 2003 will be > unique, in the sense that it is a single event held at two different > sites almost opposite to each other on the globe, that is, in > Melbourne, Australia and Saint Petersburg, Russian Federation. The > conference will run at the same dates at both locations, and there > will be a single set of ICCS 2003 proceedings. You, as participant, > as author of a paper, or as organizer of a workshop, decide to which > location you go. In this way we hope that researchers from all over > the world will be able to participate in the most important event on > Computational Science in 2003.
A:[[ Can I see the list of file names the shell _would_ generate from a [[ wildcard expansion? Without actually _doing_ something with the [[ list? This command, [[ [[ $ ls -1 *2001* [[ [[ for instance, lists the all files and directories selected with the [[ wildcard string, "*2001*", but it also descends into any directories [[ selected. # # We don't always plan in advance, but this week we did, and neither of # our two respondants gave the answer we'd planned. So, you get three # answers. From the editors: # echo *2001* # # From Brad Chamberlain: # On my computer, you can use the command: ls -d *2001* to get a list of all files and directories that match, without descending into directories. # # And from Richard Griswold: # If you want to list all of the files and directories in a directory tree, the simplest way is to use the 'find' command: find . -name "*2001*" This finds all files and directories that contain the string "2001" anywhere in the name, starting at the current directory. The quotes around the search pattern keep the shell from expanding it before passing it to the find command. Check out the find manpage for more options. The zsh shell also has a feature that allows you to expand wildcards in an entire subdirectory. If you type **/*2001* then press tab, the shell will list all matching files and directories in the current directory tree. See www.zsh.org for more information. Q: This was a tech-heavy newsletter, so how about something light for a change of pace? What's an example you've experienced or seen of a "Catch 22"? In two or three sentences only, please. If you'd like to submit an answer, but don't know what a "Catch 22" is, please read the next issue. Note that no late answers will be accepted. :-)
[[ Answers, Questions, and Tips Graciously Accepted ]]
Contact:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619 Donald Bahls ARSC User Consultant ph: 907-450-8674 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.Email Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources