ARSC HPC Users' Newsletter 255, October 4, 2002

ARSC Accepting Account Applications for the SX-6

Start at our home page:

http://www.arsc.edu/

and descend into the page, "Benchmarking the SX-6". You'll find a "Getting Started" guide for new users, more information on the SX-6, and account application requirements and forms.

NOTE: We will be staging the creation of accounts on the SX-6. Please apply when convenient, but realize there may be a delay before we activate your account.

Cray Bioinformatics Library Installed on Chilkoot

Version 1.0 of the Cray bioinformatics library (cbl) has been installed on chilkoot, and is available for testing.

From man: INTRO_LIBCBL(3B)

"LIBCBL routines perform low level bit manipulation and searching operations useful in the analysis of nucleotide and amino acid sequence data."

The bioinformatics library requires programming environment 3.6 (PE 3.6). Please note that although PE 3.6 has been installed and is available as module "PrgEnv.new," internal testing of PE 3.6 is not yet complete.

To use the bioinformatics library, load modules, as follows:


  chilkoot$  module switch PrgEnv PrgEnv.new
  chilkoot$  module load biolib

The man pages are available at:

http://www.arsc.edu/support/manuals/LIBCBL/

Please let us know if you use biolib, as we're quite interested in your observations, difficulties, and successes.

COMCOT: Case Study of Port to the Cray SV1ex

[ Thanks to Tom Logan of ARSC for this contribution. ]

I recently was passed COMCOT, the COrnell Multi-grid COupled Tsunami model, to port to the ARSC SV1ex system. This multi-grid tsunami simulation model is based on shallow water wave equations that have been implemented in FORTRAN.

My first task was creation of a Makefile for the code. I copied another Chilkoot FORTRAN makefile that I had, and used compiler options for aggressive optimization ( -Otask3,aggress ).

Compilation was not a difficult task, and, once I verified that the code was giving reasonable results, I moved on to a performance analysis. Using the hpm tool, I found that my initial run only achieved a dismal 42 Mflops:


  Group 0: CPU seconds   :3285.91257   CP executing     : 1642956285740

  Million inst/sec (MIPS) :    148.93   Instructions     :  489377395309
  Avg. clock periods/inst :      3.36
  % CP holding issue      :     50.46   CP holding issue :  828998044929
  Inst.buffer fetches/sec :      2.79M  Inst.buf. fetches:    9164820769
  Floating adds/sec       :     23.95M  F.P. adds        :   78705961705
  Floating multiplies/sec :     16.12M  F.P. multiplies  :   52964667681
  Floating reciprocal/sec :      1.56M  F.P. reciprocals :    5118700959
  Cache hits/sec          :    103.31M  Cache hits       :  339467490952
  CPU mem. references/sec :     20.57M  CPU references   :   67600608759

  Floating ops/CPU second :     41.63M

My next step was to profile the code using perfview. To do this, I added some compiler options ( -Otask3,aggress -ef -l perf ) and re-ran the code.

Using perfview on the resulting perf.data file, I quickly noticed that 96.5% of the time was spent in a single routine named "CONMOME." This appears in the following perfview pie chart giving the percentage of time spent in each subroutine:

Using the 'Observations' report of perfview, I determined that CONMOME was only achieving 38.4 Mflops. The observations for CONMOME included the following:


        This routine appears to be a partially vectorized code.
        WARNING: this routine does not use efficient execution paths.
        It is jumping too much and too far, which causes a very high
        rate (2.9) of instruction buffer fetches.

        Suggestions:

        * Study this routine to determine whether the large amount
          of jumping can be lessened.  It might be desirable to move
          frequently-called routines/functions into this routine.  This
          process is called 'inlining'.  The Fortran and C compilers are
          capable of performing automatic inlining.

At this point, I knew what routine in the code I needed to work on, however additional information could be helpful. To this end, I added compiler switches to give full compilation messages, both positive and negative ( -eo -Otask3,msgs,negmsgs,aggress -Ca ), re-ran the code, and used the ftnlist command to get a complete compilation report. Looking at the compiler messages displayed to stdout (which are also listed in the output of ftnlist) I saw:


          DO 100 I = 2, IX-1       
    f90-6518 f90: TASKING CONMOME, File = comcot_v1_4.f, Line = 2815
      A loop starting at line 2815 was not tasked because it contains
      an alternate exit.
    f90-6250 f90: VECTOR CONMOME, File = comcot_v1_4.f, Line = 2815
      A loop starting at line 2815 was not vectorized for an unspecified
      reason.

          DO 100 J = 2, JY         
    f90-6518 f90: TASKING CONMOME, File = comcot_v1_4.f, Line = 2816
      A loop starting at line 2816 was not tasked because it contains
      an alternate exit.
    f90-6270 f90: VECTOR CONMOME, File = comcot_v1_4.f, Line = 2816
      A loop starting at line 2816 was not vectorized because it contains
      conditional code which is more efficient if executed in scalar mode.

          DO 200 J = 2, JY-1        !3
    f90-6294 f90: VECTOR CONMOME, File = comcot_v1_4.f, Line = 2958
      A loop starting at line 2958 was not vectorized because a better
      candidate was found at line 2959.
    f90-6403 f90: TASKING CONMOME, File = comcot_v1_4.f, Line = 2958
      A loop starting at line 2958 was tasked.

          DO 200 I = 2, IX          !3
    f90-6270 f90: VECTOR CONMOME, File = comcot_v1_4.f, Line = 2959
      A loop starting at line 2959 was not vectorized because it contains
      conditional code which is more efficient if executed in scalar mode.
    f90-6419 f90: TASKING CONMOME, File = comcot_v1_4.f, Line = 2959
      A loop starting at line 2959 was tasked as part of the loop starting
      at line 2958.

The compiler messages reinforced the perfview observations - the code was not vectorizing because too much jumping was occurring. It also gave the additional information of exactly where the loops occurred and the hint that the vectorization was inhibited by conditional code. Looking at the code, I found the loop at line 2815 contained:


2815      DO 100 I = 2, IX-1       
2816      DO 100 J = 2, JY         
2817     
2818 C..GUESS THE RUNUP NEVER REACH AT HZ(I,J)=ELMAX
2819      IF (HZ(I,J) .LE. ELMAX) THEN
2820 C       GOTO 100
2821        WRITE(*,*)'WARNING !!!  Maximum runup height reachs ELMAX at'
2822        write(*,'(2i10)')i,j
2823        stop
2824      ELSE
2825 C..CALCULATE X-DIRECTION LINEAR TERMS
2826         IF (HP(I,J) .LE. ELMAX) THEN
2827            P(I,J,2) = 0.0
2828            GOTO 110
2829         ENDIF
2830
2831 C..MOVING BOUNDARY
2832         IF (DZ(I,J,2) .LE. 0.0) THEN
2833            IF (DZ(I+1,J,2) .GT. 0.0 .AND.
2834     +          HZ(I,J)+Z(I+1,J,2) .GT. 0.0) THEN
2835               DD = HZ(I,J) + Z(I+1,J,2)
2836               DF = DD
2837            ELSE
2838               P(I,J,2) = 0.0
2839               GOTO 110
2840            ENDIF
2841         ELSE
2842            IF (DZ(I+1,J,2) .LE. 0.0) THEN
2843               IF (HZ(I+1,J)+Z(I,J,2) .LE. 0.0) THEN
2844                  P(I,J,2) = 0.0
2845                  GOTO 110
2846               ELSE
2847                  DD = HZ(I+1,J) + Z(I,J,2)
2848                  DF = DD
2849               ENDIF
2850            ELSE
2851               DD = DP(I,J,2)
2852               DF = DP(I,J,1)
2853            ENDIF
2854         ENDIF
.
.
.
2953      ENDIF
2954  110 CONTINUE
2955  100 CONTINUE       !7!

With the perfview observations and compiler messages in mind, it was clear that the IF statement starting at line 2819 provided the alternate exit that inhibited tasking. It also occurred to me that the IF statement at line 2843 was quadruply nested.

The first step in modifying the code was to remove the alternate exit (stop) statement from the loops. Once I verified that HZ(I,J) is not modified in the loop body, it was simple to create a separate loop to contain the if statement that had started at line 2819.

The next step was to modify the conditionals in an attempt to simplify the code. Looking at the code segment from lines 2832 to 2854, I created the following truth table:


   ------------------------------------------------------------------
   DZ(I,J,2)            <       <       <       >       >       >
   DZ(I+1,J,2)          >       >       <       <       <       >
   HZ(I,J)+Z(I+1,J,2)   >       <
   HZ(I+1,J)+Z(I,J,2)                           <       >
   ------------------------------------------------------------------
   outcome              1       2       2       2       3       4
  
   Where outcome is:
  
   1) DD = HZ(I,J) + Z(I+1,J,2)
      DF = DD
  
   2) P(I,J,2) = 0.0
      GOTO 110
     
   3) DD = HZ(I+1,J) + Z(I,J,2)
      DF = DD
     
   4) DD = DP(I,J,2)
      DF = DP(I,J,1)

Turning the truth table into an IF statement and combining it with the new separate 'stop' loop reduced the complexity of the conditionals from quadruply nested to a single flat IF-ELSEIF statement. The final transformed code segment follows:


     C..GUESS THE RUNUP NEVER REACH AT HZ(I,J)=ELMAX
           DO I=2,IX-1
           DO J=2,JY
             IF (HZ(I,J) .LE. ELMAX) THEN
     C       GOTO 100
               WRITE(*,*)'WARNING !!!  Maximum runup height reachs ELMAX at'
               write(*,'(2i10)')i,j
               stop
             ENDIF
           END DO
           END DO
    
     C!3  MODIFY THE DOMAIN OF COMPUTATION
           DO 100 I = 2, IX-1       
           DO 100 J = 2, JY         
          
     C..CALCULATE X-DIRECTION LINEAR TERMS
              IF (HP(I,J) .LE. ELMAX) THEN
                 P(I,J,2) = 0.0
                 GOTO 110
              ENDIF
    
     C..MOVING BOUNDARY
             IF      (DZ(I,J,2).GT.0.0 .AND. DZ(I+1,J,2).GT.0.0) THEN
                   DD = DP(I,J,2)
                   DF = DP(I,J,1)
             ELSE IF (DZ(I,J,2).GT.0.0 .AND. DZ(I+1,J,2).LE.0.0 .AND.
          &           HZ(I+1,J)+Z(I,J,2) .GT. 0) THEN
                   DD = HZ(I+1,J) + Z(I,J,2)
                   DF = DD
             ELSE IF (DZ(I,J,2).LE.0.0 .AND. DZ(I+1,J,2).GT.0.0 .AND.
          &           HZ(I,J)+Z(I+1,J,2) .GT. 0.0) THEN
                  DD = HZ(I,J) + Z(I+1,J,2)
                   DF = DD
             ELSE
                  P(I,J,2) = 0.0
                   GOTO 110
             ENDIF
     .
     .
     .
       110 CONTINUE
       100 CONTINUE       !7!
      

Without going into the details, the same form of nested conditionals occurred in the loops starting at 2958 of the original code. Thus, a nearly identical transformation was applied. The only difference was that this second set of loops did not contain an alternate exit, so no separate loop was required.

Running the modified code was a gratifying experience.


  Group 0: CPU seconds   : 429.58201    CP executing     :  214791004970

  Million inst/sec (MIPS) :     64.45    Instructions     :   27685033782
  Avg. clock periods/inst :      7.76
  % CP holding issue      :     81.35    CP holding issue :  174741062314
  Inst.buffer fetches/sec :      0.78M   Inst.buf. fetches:     334511361
  Floating adds/sec       :    192.86M   F.P. adds        :   82850712958
  Floating multiplies/sec :    126.88M   F.P. multiplies  :   54504611332
  Floating reciprocal/sec :     13.73M   F.P. reciprocals :    5898268294
  Cache hits/sec          :    174.08M   Cache hits       :   74783227950
  CPU mem. references/sec :    273.61M   CPU references   :  117539868588
    
  Floating ops/CPU second :    333.47M

The perfview pie chart shows that the modified version only spends 75.3 percent of its time in CONMOME:

Also, from the 'Observations' report of perfview, I determined that CONMOME itself was now achieving 381 Mflops. The observations for CONMOME included the following:


        This routine appears to be a highly efficient vectorized code.

So, the code was now performing at roughly eight times its initial rate. However, additional improvement was immediately gained by utilizing chilkoot's autotasking ability. In compiling the code with the -Otask3,aggress switch, I had already specified that autotasking would be utilized. To use the autotasking, it only required setting the environment variable NCPUS to the number of processors desired. Here are the results for 1 - 8 processors:


  #PEs    CPU-Time    Wall Clock   Mflops/PE   Total Mflops
  -----   --------    ----------   ---------   ------------
  1       429         432          333         333
  2       479         254          299         598
  4       523         147          274         1095
  8       570         91           251.5       2012

In summary, using the tools HPM, perfview, and ftnlist, along with compiler options for displaying positive and negative compilation messages, I was able to readily analyze COMCOT's performance, diagnose the problem, pinpoint the code's bottleneck, and receive valuable hints on what changes were required.

By slight code modification and the use of compiler options for aggressive optimization and autotasking, I was able to increase the performance of this code by a factor of 8 on a single CPU and by a factor of 48 using 8 CPUS. What started out as a 40 Mflops code now runs at just over 2 Gflops.

Hail Hail CRAY's tools and compilers!

Portable Nucleotide String Compression: Part I, Endian Enigmas

[ Thanks to Jim Long of ARSC for contributing this 2-part series. ]

This series of articles present some issues that arise when writing portable code in the context of compressing nucleotide text strings and producing the reverse complement of such strings.

Portability across both big- and little-endian architectures is a key issue in bioinformatics. Researchers in bioinformatics are strong proponents of open-source and commodity Linux boxes (often running little-endian, Intel processors), but also perform significant work on higher-end Unix boxes (typically big-endian).

Part I -- ENDIAN ENIGMAS

Before we can discuss compression schemes, lets first look at a little code that exposes the issue of "which"-endian:


main()
{
   char c[]="acgta";
   unsigned int a[2], i;

   for(i=0; i<5; i++) printf("address of c[%d] = %X\n", i, &c[i]);
   for(i=0; i<2; i++) printf("%X\n",((int *) c)[i]);

   a[0] = ((unsigned int *) c)[0] >> 8;

   printf("%X\n",a[0]);
   printf("%s\n",(char *) a);
}

The code says to print the address of each byte (%X = hex) of the text string "acgta". Then cast the address of the "c" array to an unsigned int and display how the string is stored in two 4-byte words. Then we take the first 4-byte word, shift it to the right by 8 bits, store it in a[0], and look at it again. Finally we cast the address of the "a" unsigned int array containing that shifted word to be a character pointer, and see what string we get. On a big-endian machine we get


address of c[0] = 7FFF2F00
address of c[1] = 7FFF2F01
address of c[2] = 7FFF2F02
address of c[3] = 7FFF2F03
address of c[4] = 7FFF2F04
61636774
61002F4C
616367

As one might expect, each successive letter of the string occupies the next higher byte of memory. When we examine the string as a word of memory, we see 61636774 as the first word, "acgt" where a=0x61, c=0x63, g=0x67, and t=0x74. The second word is 61002F4C, which is an "a", the last letter of the string, followed by the string null terminator (0x00) and whatever junk happened to be in the rest of the word (0x2F4C). The last entry, 0x00616367, is the shifted word with 0's filling in from the left. The 0's look like a null terminator so that when we ask to print out the string that "(char *) a" points to, we get only a newline from the printf statement.

Now let's look at output from the same code on a little-endian machine:


address of c[0] = BFFFFAD0
address of c[1] = BFFFFAD1
address of c[2] = BFFFFAD2
address of c[3] = BFFFFAD3
address of c[4] = BFFFFAD4
74676361
40000061
746763
cgt

Again, each successive letter of the string occupies the next higher byte of memory. When we examine the string as a word of memory, however, we see the letters reversed. This is sometimes explained by saying that the little endian scheme stores the least significant byte of an integer in the lowest address of a word, and the most significant byte of an integer in the highest address of a word. Big endian schemes do just the opposite. Since my computer prints the contents of a memory word (a number) on the screen in English (from left to right), the most significant byte will always on the left, followed by the lesser significant bytes to the right. For the programmer, it is conceptually easier to think of big endian machines as starting their first word of memory on the left and continuing to the right (like English), while little endian machines start their first word of memory on the right and continue to the left (like Hebrew). On a 32-bit machine this looks like:

Big-endian:



<--------word0-------->
<--------word1-------->
<--etc-->
  
 byte0 byte1 byte2 byte3 byte4 byte5 byte6 byte7
   a     c     g     t     a    null

Little-endian:



<--etc-->
<--------word1-------->
<--------word0-------->
  
           byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte0
                        null   a     t     g     c     a

With this scheme in mind, we can now see why word0 prints out the way it does, and we can interpret the rest of the code output on a little-endian machine. Shifting the first word to the right results in byte0 falling off the word, instead of byte3 falling off as in a big-endian machine. The result is what we see, 0x00746763. Now when we ask for the string pointed to by "(char *) a", we get "cgt" because the byte0 "a" fell off the word, and the 0's that filled in from the left became the null terminator.

The above example looked at something that was already in memory, placed there byte-by-byte from a text string. What happens when we want to put some value into a word ourselves? Look at the next code:


main()
{
   unsigned int i = 0x00006100;
   printf("%X\n", i);
   i = i >> 8;
   printf("%X\n", i);
   printf("%c\n", *((char *)&i));
}

The output on a big endian machine is:


6100
61

The output on a little endian machine is:


6100
61
a

What happened? Lets look at our mental picture when "i" is declared and assigned:

Big-endian:



<--------word0-------->
<--etc-->

 byte0 byte1 byte2 byte3
  00    00    61    00

Little-endian:



<--etc-->
<--------word0-------->
  
           byte3 byte2 byte1 byte0
            00    00    61    00

Both architectures print out the same thing for the integer as initially stored, and when the integer is shifted to the right. The difference happens when the address of "i" is cast to a (char *), dereferenced, and printed as a character. The big-endian machine prints out its byte0, which is null (we get a newline from the printf statement), while the little-endian machine prints out its byte0, which is the letter "a". Code similar to this can be used in portability scenarios if it is important to determine which type endian-ness some application is running on.

Note that I've used unsigned ints in these examples. If an int is signed, then 1's will be shifted in from the left if the leftmost bit is a 1, and I want 0's shifted in no matter what bit pattern is in the word. 0's fill in from the right during a left shift no matter if the variable is signed or unsigned.

As mentioned at the start of this article, these issues are being discussed in the context of encoding nucleotide text strings on different architectures. The nucleotide alphabet consists of only 4 characters, A, C, G, & T. When manipulating strings of these letters, it is desirable to compress the text strings such that each letter occupies only 2 bits in a word instead of 8. Next time we'll look at some compression code for 2-bit encoding of nucleotide text strings on each architecture, and how to produce the reverse complement of a nucleotide string.

Draft Fortran 2000 Standard Available for Comment

This announcement was submitted to the Fortran 90 List, <COMP-FORTRAN-90@JISCMAIL.AC.UK> by John Reid. ARSC users are encouraged to look over the draft standard, and submit comments if interested:


> I am pleased to tell you that the draft Fortran 2000 standard is now out
> for comment. The official version is available from
> 
>   http://www.dkuug.dk/jtc1/sc22/open/n3501.pdf
> 
> The J3 (USA Fortran committee) version, which is identical except for
> the title page and the headers and footers, is available in ps, pdf,
> text, or source (latex) from
> 
>   ftp://ftp.j3-fortran.org/j3/doc/standing/007/
> 
> This is a very significant milestone for Fortran 2000. It is a major
> extension of Fortran 95 that has required a significant amount of
> development work by the J3. The main features were decided at a meeting
> of the ISO Fortran committee WG5 in 1997.

[... cut ...]

> I have written an informal description of the new features, which will
> be published in the next issue of Fortran Forum (about to appear).  It
> is also available from
> 
>   ftp://ftp.nag.co.uk/sc22wg5/N1451-N1500/N1495.pdf
> 
> It is an unofficial document written by me and has not been formally
> approved by either WG5 or J3.  If you base a comment on what I say,
> check with the draft standard in case I got it wrong.

Quick-Tip Q & A


A:[[ When I run my MPI program, some tasks start spitting error messages,
  [[ which get all mixed up together, and then it stops.
  [[
  [[ I'd like to know which message comes from which task, and, sure, I
  [[ could fix the code so every message is prefaced with the task number,
  [[ but there might be an easier way.  Is there?



# Thanks to Kevin Kennedy for this solution for AIX platforms:

  Simple, when using csh, use:
    setenv MP_STDOUTMODE ordered


# On other platforms you can use the "mpirun ... -prefix ..."
# option.  From (IRIX) "man mpirun":

    -p[refix] prefix_string
     Specifies a string to prepend to each line of output from stderr and
     stdout for each MPI process. To delimit lines of text that come from
     different hosts, output to stdout must be terminated with a new line
     character.

     Some strings have special meaning and are translated as follows:

        *   %g translates into the global rank of the
            process producing the output. (This is
            equivalent to the rank of the process in
            MPI_COMM_WORLD.)

        *   %G translates into the number of processes
            in MPI_COMM_WORLD.

        [... cut ...]

     Example:

       % mpirun -prefix "<process %g out of %G> " 4 a.out
       <process 1 out of 4> Hello world
       <process 0 out of 4> Hello world
       <process 3 out of 4> Hello world
       <process 2 out of 4> Hello world





Q: Can I see the list of file names the shell _would_ generate from a
   wildcard expansion?  Without actually _doing_ something with the
   list?  This command,

     $ ls -1 *2001*

   for instance, lists the all files and directories selected with the
   wildcard string, "*2001*", but it also descends into any directories
   selected.

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top