ARSC HPC Users' Newsletter 256, Oct 18, 2002

Case Study: Porting a Fortran code to the SX-6

I hope you can learn from the following porting experience, even if you're not an SX-6 user. And if you have a porting adventure of your own, perhaps you could share it through this newsletter?

The code in question now runs at 1940 MFLOPS on a single SX-6 processor. A small test case completes in about 1/3rd the wall-clock time required for 4 multi-tasked SV1ex processors.

While auto-tasking on the SV1ex is of definite benefit, auto parallelizing on the SX-6 (for this problem size) is actually detrimental.

This is a vectorizable Fortran 90, ab initio quantum physics code with about 100 source files.

Steps in porting to the SX-6:

  1. Compile everything with "-ew"

    This makes all numeric storage units use 64-bits.

    The SX-6 offers both 32- and 64-bit mode, and the default is 32. On the Cray, everything is 64-bits and starting with the same data size made sense. The SX-6 is IEEE, but in this case, this didn't cause cause different results.

  2. Compile free-form files with -f4

    From the online Fortan90/SX Programmers Guide:

    
        -f4
           Specifies that the input source program is described in standard
           free format.
    
  3. Compile with "-I" to locate MODULE information files.

    On the SX-6, "-I" performs like the Cray "-p <module_site>" option. From the manual:

    
        -I 
           Specifies that INCLUDE files and compiled MODULE information file
           are retrieved from the specified directory prior to the current
           directory.
    

    My makefile includes the following f90 option, where "STARTDIR" is set to the root directory of the build and the subdirectory "MOD" is where the module files lie:

    
      -I${STARTDIR}/MODS
    
  4. Link with "-llapack_64" and "-lblas_64"

    Codes compiled with "-ew" must be linked with 64-bit libraries. In this case, this indicated the math libraries, but it also applies, for instance, to the MPI libraries (eg., "-lmpiw" where the "w" corresponds to the "w" in "-ew").

  5. Fix "-ew" problems in generic subroutines

    On my first attempt to compile, f90/SX issued 6 helpful error messages, like this:

    
        f90: error(746): generics.f90, line 381:
            Specific procedure "d_gemv" and "s_gemv" in same generic
            interface can not be distinguished.
    

    The "-ew" option had eliminated the differences in storage unit type between the two specific subroutines, d_gemv and s_gemv. These subroutines were accessed through the generic name, gemv. With the differences removed, the compiler was unable to disambiguate the two specific routines, which is a Fortran error.

    To solve it, I introduced a new macro, "SX6", and used preprocessor directives to "comment out" the newly-duplicate subroutines, as shown here:

    
          interface gemv 
    #ifndef SX6
             module  procedure z_gemv,c_gemv,d_gemv,s_gemv
    #else
             module  procedure c_gemv,s_gemv
    #endif
          end interface
    
    
    
    #ifndef SX6
             subroutine d_gemv(tr,m,n,alph,a,lda,x,incx,beta,y,incy)
             use kinds
             character*1 tr
             integer(kind=ikind) m,n,lda,ldb,incx,incy
             double precision :: a(:,:),x(:),alph,beta
             double precision :: y(:)
    
             integer(kind=ikind) lda_lo
    
    
             lda_lo=size(a,1)
             call dgemv(tr,m,n,alph,a,lda_lo,x,incx,beta,y,incy)
             end subroutine d_gemv
    #endif
    
             subroutine s_gemv(tr,m,n,alph,a,lda,x,incx,beta,y,incy)
             use kinds
             character*1 tr
             integer(kind=ikind) m,n,lda,ldb,incx,incy
             real :: a(:,:),x(:),alph,beta
             real :: y(:)
    
             integer(kind=ikind) lda_lo
    
    
             lda_lo=size(a,1)
             call sgemv(tr,m,n,alph,a,lda_lo,x,incx,beta,y,incx)
             end subroutine s_gemv
    
  6. First run of the program! Followed by Wild Goose Chase.

    After steps 1-5, above, the compiler produced an executable... which sputtered like an '83 Ford Escort I once owned, ran, and promptly crashed.

    I spent time on a wild goose chase. The code uses linked lists for bookkeeping, implemented using Fortran pointers. From the location of the crash, I focussed on possible pointer problems.

    Experiments with the f90/SX option:

    
       -O overlap
          (default when compilation mode -C vsafe or -C ssafe
          is effective) Assumes that pointer references are
          overlapped in optimization.
    

    didn't help. I tried some other things. Eventually I compiled the problematic subroutine for debugging:

    
       -C debug -g
    

    This eliminated the problem in that subroutine. But it simply moved the crash to a different subroutine. After repeating this step a couple of times, I smartened up.

  7. Compile entire program with "-C debug".

    Now it ran to completion, and produced correct results. Nothing to debug! Right?

    On the SX-6, basic performance data is produced by simply setting the run-time environment variable "F_PROGINF" to "DETAIL". The code ran at 445 MFLOPS.

    Here's the complete F_PROGINF output for the "-D debug" version:

    
         ******  Program Information  ******
      Real Time (sec)       :       1505.139990
      User Time (sec)       :       1500.737250
      Sys  Time (sec)       :          2.480038
      Vector Time (sec)     :         94.229735
      Inst. Count           :      528581378908.
      V. Inst. Count        :        4805111361.
      V. Element Count      :      918360808316.
      FLOP Count            :      667538032141.
      MOPS                  :        960.952409
      MFLOPS                :        444.806732
      VLEN                  :        191.121649
      V. Op. Ratio (%)      :         63.680549
      Memory Size (MB)      :         64.031250
      MIPS                  :        352.214473
      I-Cache (sec)         :          1.709976
      O-Cache (sec)         :         75.106489
      Bank (sec)            :         14.313291
    
      Start Time (date)  :  2002/10/17 13:45:58
      End   Time (date)  :  2002/10/17 14:11:03
    
  8. Compile entire program with "-C ssafe".

    Here are all possible "compile mode" or "-C" settings, in order of increasing optimization (or decreasing conservatism):

    
      -C{debug
    ssafe
    vsafe
    sopt
    vopt
    hopt} 
    

    (Note: "vopt" is the default. "s" means scalar, "v" means vector.)

    With everything already compiled at "-C debug", I experimented by recompiling a few of the problematic subroutines (from step 6) with "ssafe". This was successful, so I recompiled everything at "ssafe".

    Again, correct results. It now ran at 1231 MFLOPS. Here's the F_PROGINF output for the "-C ssafe" version:

    
         ******  Program Information  ******
      Real Time (sec)       :        542.073901
      User Time (sec)       :        540.851285
      Sys  Time (sec)       :          1.021653
      Vector Time (sec)     :         92.713041
      Inst. Count           :      161857294035.
      V. Inst. Count        :        4788058445.
      V. Element Count      :      918260158881.
      FLOP Count            :      666211339613.
      MOPS                  :       1988.216400
      MFLOPS                :       1231.782855
      VLEN                  :        191.781318
      V. Op. Ratio (%)      :         85.393384
      Memory Size (MB)      :         64.031250
      MIPS                  :        299.263954
      I-Cache (sec)         :          0.754988
      O-Cache (sec)         :         61.974722
      Bank (sec)            :          7.211348
     
      Start Time (date)  :  2002/10/17 11:55:29
      End   Time (date)  :  2002/10/17 12:04:31
    
  9. Isolate the component(s) of "ssafe" required for correct execution.

    From the SX-6 Fortan90 Programmers Guide, I discovered that each "-C" option is simply a short-cut to a set of so-called "detailed" options. (Similar to the "-O" levels on the Cray.) From a careful search of the guide, I generated this list of "detailed options," which appears to be equivalent to the setting, "-C ssafe":

    
         -Wf"-O nochg nodarg nodiv noiodo nomove overlap nounroll"  \
         -Wf"-pvctl nocollapse nomatmul noouterunroll"  \
         -Wf"-Nv"
    

    Again using a few of the problematic subroutines, I emulated a binary search through the above list of options, and ultimately discovered that only one of them, "nodarg," was required for correct execution. From the manual:

    
       -O nodarg
          (default when compilation mode -C ssafe is effective)
          Dummy arguments are not subject to optimization.
    
  10. Recompile everything with '-Wf"-O nodarg"':

    Again, correct results. This version ran at 1940 MFLOPS. Here's the F_PROGINF output for the "-O nodarg" version:

    
         ******  Program Information  ******
      Real Time (sec)       :        328.804629
      User Time (sec)       :        327.538463
      Sys  Time (sec)       :          1.217602
      Vector Time (sec)     :        154.557969
      Inst. Count           :       64776518747.
      V. Inst. Count        :        7854910424.
      V. Element Count      :      999827065397.
      FLOP Count            :      635471363086.
      MOPS                  :       3226.334598
      MFLOPS                :       1940.142718
      VLEN                  :        127.286883
      V. Op. Ratio (%)      :         94.613515
      Memory Size (MB)      :         64.031250
      MIPS                  :        197.767671
      I-Cache (sec)         :          2.066626
      O-Cache (sec)         :         25.200404
      Bank (sec)            :          2.160127
     
      Start Time (date)  :  2002/10/17 14:37:42
      End   Time (date)  :  2002/10/17 14:43:11
    

    Here's the complete set of options I'm now using (the FFLAGS_1 macro is used for the free-form files):

    
      CF=     f90
      DEBUG=
      OPT=    -Wf"-O nodarg" 
      LIBS=   -llapack_64 -lblas_64 
      
      FFLAGS= ${OPT} ${DEBUG} -ew -I${STARTDIR}/PARAMETER\
                             -I${STARTDIR}/MODS
      
      FFLAGS_1= -DSX6  ${OPT} ${DEBUG} -ew -f4 -c -I${STARTDIR}/PARAMETER\
                             -I${STARTDIR}/MODS
    
  11. Attempt higher levels of optimization and parallelization:

    Recompiling with '-C hopt -Wf"-O nodarg"' crashed the code, so for now, I'm satisfied with the default "-C" level of "vopt".

    The program compiles and runs correctly with auto parallelization "-P auto" and the parallel blas library "-lparblas_64" and burns time on multiple processors. But this produces no speedup. The user of this code will attempt larger data sets, which may prove that it simply needs more work (note that the SX-6 vector length is 4x that of the SV1).

    Another step would be a "search" through all the source files, finding those that can be safely compiled at "-C hopt".

Portable Nucleotide String Compression: Part II, Shifty Characters

[ Thanks to Jim Long of ARSC for contributing this 2-part series. ]

This series of articles presents some issues that arise when writing portable code in the context of compressing nucleotide text strings and producing the reverse complement of such strings.

Part II, Shifty Characters

In Part I, we looked at Endian "Enigmas" in the context of bit shifting on different architectures. We did this because we want to be able to compress nucleotide text strings, made up entirely of just the four letters A, C, G, & T, into words containing 2-bit representations of each nucleotide. Thus a 32-bit word will contain 16 nucleotides, and a 64-bit word will contain 32, in both cases a compression factor of 4.

The compression is done simply by taking the 2-bit representation for each 8-bit ascii character and shifting it into its proper position within a word. If the bit patterns are A=00, C=01, G=11, & T=10, then on a big-endian machine the string "acgta" looks like:

uncompressed representation



<--------------word0-------------->
<--------------word1-------------->


byte0---
byte1---
byte2---
byte3---
byte4---
byte5---
byte6---
byte7---

 01100001 01100011 01100111 01110100 01100001 00000000 <------8-bit ascii
 a        c        g        t        a        null

compressed representation, 4 letters/byte


<--------------word0-------------->
<--------------word1-------------->


byte0---
byte1---
byte2---
byte3---
byte4---
byte5---
byte6---
byte7---

 00011110 00000000 00000000 00000000   <------ compressed 2-bit string
 a c g t  a null     padded zeros

Each 2-bit representation was shifted to the left on a big-endian machine. On a little-endian machine, we have to shift the other way:

uncompressed representation



<--------------word1-------------->
<--------------word0-------------->


byte7---
byte6---
byte5---
byte4---
byte3---
byte2---
byte1---
byte0---

8-bit ascii -----> 00000000 01100001 01110100 01100111 01100011 01100001
                   null     a        t        g        c        a

compressed representation, 4 letters/byte



<--------------word1-------------->
<--------------word0-------------->


byte7---
byte6---
byte5---
byte4---
byte3---
byte2---
byte1---
byte0---

compressed 2-bit string ---------->  00000000 00000000 00000000 10110100
                                       padded zeros    null  a  t g c a

Note that in both cases, an "a" is 00, composed of the same zero bits used to pad the word. Thus when decompressing, one must know the number of letters that were compressed.

The 2-bit representations could have been found with a lookup table that translates each ascii character into its equivalent 2-bit representation, something like:


array[256];

array['A'] = array['a'] = 0x0;  /* bit pattern 00 */
array['C'] = array['c'] = 0x1;  /* bit pattern 01 */
array['G'] = array['g'] = 0x3;  /* bit pattern 11 */
array['T'] = array['t'] = 0x2;  /* bit pattern 10 */

Doing it this way is slow, however, having to do a lookup for each 8-bit character. An examination of the 8-bit ascii representations for the characters reveals that the desired 2-bit patterns for each letter are already unique within each ascii representation, and are case insensitive:


A 0100 0001
a 0110 0001
C 0100 0011
c 0110 0011
G 0100 0111
g 0110 0111
T 0101 0100
t 0111 0100
        ^^
        


        these two columns contain the desired 2-bit code

It is much faster to simply mask out the undesired bits and shift the desired bits to their proper location. If "unc" is an array of nucleotides in 8-bit ascii, then the following code fragment shows how to create one word of compressed data from 4 words of uncompressed on a big-endian machine, doing everything in the registers without additional load/stores:


                                     
                                    mask              shift  logical "or"
                                 ==========           =====  ============
compressed[0] = (               (0x06000000 & unc[0]) <<  5)   

                (               (0x00060000 & unc[0]) << 11)   

                (               (0x00000600 & unc[0]) << 17)   

                (               (0x00000006 & unc[0]) << 23)   

                ((unsigned long)(0x06000000 & unc[1]) >>  3)   

                (               (0x00060000 & unc[1]) <<  3)   

                (               (0x00000600 & unc[1]) <<  9)   

                (               (0x00000006 & unc[1]) << 15)   

                ((unsigned long)(0x06000000 & unc[2]) >> 11)   

                ((unsigned long)(0x00060000 & unc[2]) >>  5)   

                (               (0x00000600 & unc[2]) <<  1)   

                (               (0x00000006 & unc[2]) <<  7)   

                ((unsigned long)(0x06000000 & unc[3]) >> 19)   

                ((unsigned long)(0x00060000 & unc[3]) >> 13)   

                ((unsigned long)(0x00000600 & unc[3]) >>  7)   

                ((unsigned long)(0x00000006 & unc[3]) >>  1);

masking turns bits off, while a logical "or" turns bits on.

The equivalent code on a little-endian machine looks like:


                                    mask              shift  logical "or"
                                 ==========           =====  ============
compressed[0] = ((unsigned long)(0x00000006 & unc[0]) >>  1)   

                ((unsigned long)(0x00000600 & unc[0]) >>  7)   

                ((unsigned long)(0x00060000 & unc[0]) >> 13)   

                ((unsigned long)(0x06000000 & unc[0]) >> 19)   

                (               (0x00000006 & unc[1]) <<  7)   

                (               (0x00000600 & unc[1]) <<  1)   

                ((unsigned long)(0x00060000 & unc[1]) >>  5)   

                ((unsigned long)(0x06000000 & unc[1]) >> 11)   

                (               (0x00000006 & unc[2]) << 15)   

                (               (0x00000600 & unc[2]) <<  9)   

                (               (0x00060000 & unc[2]) <<  3)   

                ((unsigned long)(0x06000000 & unc[2]) >>  3)   

                (               (0x00000006 & unc[3]) << 23)   

                (               (0x00000600 & unc[3]) << 17)   

                (               (0x00060000 & unc[3]) << 11)   

                (               (0x06000000 & unc[3]) <<  5);

Note the mirror symmetry between the two code fragments.

Decompression is a little trickier because the prefix for a "T" (0101) is different than for the other letters (0100). We can determine T-ness in a register by doing an "xor" (^: exclusive or) of the 2-bit extraction with the 2-bit representation for "T":


 
A^T = 00^10 = 10 
C^T = 01^10 = 11 
G^T = 11^10 = 01 
T^T = 10^10 = 00

Only T xor T yields a false bool value. Using this fact, the code on a big-endian machine to decode the first 1/4 of a compressed string could look like:


unc[0] = (((0xC0000000 & compressed[0]) ^ 0x80000000)?      /* is it a T? */
         (((unsigned long)(0xC0000000 & compressed[0]) >>  5) 
 0x41000000):
           (0x54000000))  /* the letter T */
          

         (((0x30000000 & compressed[0]) ^ 0x20000000)?      /* is it a T? */
          (((unsigned long)(0x30000000 & compressed[0]) >> 11) 
 0x00410000):
           (0x00540000))  /* the letter T */
         

          (((0x0C000000 & compressed[0]) ^ 0x08000000)?      /* is it a T? */
         (((unsigned long)(0x0C000000 & compressed[0]) >> 17) 
 0x00004100):
            (0x00005400))  /* the letter T */
         

         (((0x03000000 & compressed[0]) ^ 0x02000000)?      /* is it a T? */
         (((unsigned long)(0x03000000 & compressed[0]) >> 23) 
 0x00000041):
          (0x00000054));  /* the letter T */

Each successive 2-bit pair is masked out of the compressed input and xor-ed with 10. If that boolean is true, it is not a "T", and we just shift the 2-bits into their proper place and add the remaining common bits. If it is a "T", then we just return a "T" in the proper location. unc[1], unc[2], and unc[3] are similarly computed with the other 3/4 of the compressed word. For a little-endian machine, the same idea prevails with only different masking and shifting.

The same ideas presented above can also be used to do 4-bit and 5-bit compression, where 5-bit covers the entire alphabet.

Finally, note that it is easy to produce the reverse complement of a compressed nucleotide string. Along a double helix of dna, each nucleotide is paired with its complement, "A" with T", and "C" with "G". A reverse complement string is the original string read backwards, replacing each letter with its complement. Reading backwards is accomplished by reversing the ordering of the 2-bit units within a word, and reversing the ordering of the words. Producing the complement is accomplished by xor-ing each 2-bit pattern with 10, incidentally the same thing we did above to determine T-ness:


A^10 = 00^10 = 10 = T
C^10 = 01^10 = 11 = G
G^10 = 11^10 = 01 = C
T^10 = 10^10 = 00 = A

Code to produce the reverse complement for a string that exactly fills its final word looks like:


for(i=0; i<length; i++)
{
   /* reverse and complement */
   rc[i] = (((unsigned long)(0xCCCCCCCC & compressed[length-1-i])) >> 2) 

          (                (0x33333333 & compressed[length-1-i])  << 2);
   rc[i] = ( (unsigned long)(0xF0F0F0F0 & rc[i])>>4) 
 ((0x0F0F0F0F & 
rc[i])<<4);
   rc[i] = ( (unsigned long)(0xFF00FF00 & rc[i])>>8) 
 ((0x00FF00FF & 
rc[i])<<8);
   rc[i] = (((unsigned long)(dbrc[i]) >> 16) 
 (rc[i] << 16)) ^ 0xAAAAAAAA;
}

^ 0xAAAAAAAA complements the entire word at once after it has been reversed. Note that this code works for both big- and little-endian machines. Additional architecture dependent shifting must be done for strings that do not exactly fill their final word.

Hope you have fun "coding to the metal"!

IBM Online Publication Server

Here's the link for the IBM online publication server. It contains a plethora of publications (e.g., 3 pages for C language references alone).

http://www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi?CTY=US

The 3rd International Conference on Computational Science

From: http://www.science.uva.nl/events/ICCS2003/cfp.html

  > The 3rd International Conference on Computational Science
  >
  >       June 2 - 4, 2003
  >
  >       Saint Petersburg, Russian Federation
  >       Melbourne, Australia
  >
  > Computational Science is a vital part of many scientific
  > investigations, affecting researchers and practitioners in the
  > sciences and beyond. Due to the sheer size of many challenges in
  > computational science, the use of supercomputing, parallel
  > processing, and sophisticated algorithms, is inevitable.
  >
  > The International Conference on Computational Science 2003 (ICCS
  > 2003) aims to bring together researchers and scientists from
  > mathematics and computer science as basic computing disciplines,
  > researchers from various application areas who are pioneering
  > advanced application of computational methods to sciences such as
  > physics, chemistry, life sciences, and engineering, arts and
  > humanitarian fields, along with software developers and vendors, to
  > discuss problems and solutions in the area, to identify new issues,
  > and to shape future directions for research, as well as to help
  > industrial users apply various advanced computational techniques.
  >
  > ICCS 2003 is the follow-up of the highly successful ICCS 2002
  > conference held in Amsterdam, The Netherlands. ICCS 2003 will be
  > unique, in the sense that it is a single event held at two different
  > sites almost opposite to each other on the globe, that is, in
  > Melbourne, Australia and Saint Petersburg, Russian Federation. The
  > conference will run at the same dates at both locations, and there
  > will be a single set of ICCS 2003 proceedings. You, as participant,
  > as author of a paper, or as organizer of a workshop, decide to which
  > location you go. In this way we hope that researchers from all over
  > the world will be able to participate in the most important event on
  > Computational Science in 2003.

Quick-Tip Q & A



A:[[ Can I see the list of file names the shell _would_ generate from a
  [[ wildcard expansion?  Without actually _doing_ something with the
  [[ list?  This command,
  [[
  [[    $ ls -1 *2001*
  [[
  [[ for instance, lists the all files and directories selected with the
  [[ wildcard string, "*2001*", but it also descends into any directories
  [[ selected.


# 
# We don't always plan in advance, but this week we did, and neither of 
# our two respondants gave the answer we'd planned.  So, you get three
# answers.  From the editors:
#

  echo *2001*


#
# From Brad Chamberlain:
#

On my computer, you can use the command:

  ls -d *2001* 

to get a list of all files and directories that match, without descending
into directories.


#
# And from Richard Griswold:
#

If you want to list all of the files and directories in a directory
tree, the simplest way is to use the 'find' command:

  find . -name "*2001*"

This finds all files and directories that contain the string "2001"
anywhere in the name, starting at the current directory.  The quotes
around the search pattern keep the shell from expanding it before
passing it to the find command.  Check out the find manpage for more
options.

The zsh shell also has a feature that allows you to expand wildcards in
an entire subdirectory.  If you type

  **/*2001*

then press tab, the shell will list all matching files and directories
in the current directory tree.  See www.zsh.org for more information.




Q: This was a tech-heavy newsletter, so how about something light for a
   change of pace?

   What's an example you've experienced or seen of a "Catch 22"?  In two
   or three sentences only, please.

   If you'd like to submit an answer, but don't know what a "Catch 22"
   is, please read the next issue.  Note that no late answers will be
   accepted.  :-)

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top