ARSC HPC Users' Newsletter 287, February 27, 2004

ARSC Events

Engineering Open House:
When: Saturday, February 28 from 9am to 5pm
Where: Duckering Building, ARSC Access Lab

ARSC's Duckering access lab will be open to the public with demonstrations of ARSC research as well virtual reality demonstrations on the ImmersaDesk.

Discovery Tuesday:
When: March 2, 1pm
Where: ARSC Discovery Lab, 375C Rasmuson Library

The subject of ARSC monthy presentation in the Discovery Lab is "Visualizing Chemistry: Tools for the Discovery Lab". It will be presented by Roger Edberg, ARSC Visualization Specialist, and Tom Marr, President's Research Professor in Bioinformatics.

Vectorizing a Recurrence, Part II

As discussed in the last issue, the vectorization of a loop can be inhibited by a variety of factors, including recurrence. A recurrence occurs when a calculation in one iteration depends on a result computed in an earlier iteration.

Here's a reformulation of the example from the last issue, which produces a table of factorials:

      RD = 1
      F(0:RD-1) = 1
      do i = RD,N
        F(i) = F(i-RD) * i

As noted, the SX-6 has a special instruction which allows this unvectorizable loop to "vectorize."

Forgetting about the intent of the loop, what happens if "RD" (what I'm calling the "recurrence distance") is allowed to increase? With RD==2, neither the SX-6 nor X1 can vectorize it. With RD>=3, however, the X1 vectorizes the loop. This is not a big surprise if you're familiar with the concept of "safe vector length." The surprise (to me, anyway) is that, except when RD==1, the SX-6 doesn't vectorize the loop.

What is "safe vector length" and how can it be used to vectorize a recurrence?

Back to the example, the computation of some block of "RD" elements, F(i) .. F(i+RD-1), depends on the results already computed for the previous block. This is enumerated below for RD==5, where F'(i) indicates that element "i" has been updated by the loop, and F(i) indicates that the element has not been updated yet and retains its original value:

  F(25) = F('20) * 25 
  F(26) = F('21) * 26
  F(27) = F('22) * 27
  F(28) = F('23) * 28
  F(29) = F('24) * 29

Since F(20)..F(24) have already been computed, it appears we could process the entire block, F(25) through F(29) simultaneously and still get the correct result.

A single vector instruction on the X1 produces a block of 64 "simultaneous" results, so if we made RD==64 (or more), the loop should be able to vectorize. This is exactly what happens, but it gets even better.

The Cray compilers are able to process blocks of less than 64, and thus, even the loop with RD==5 would vectorize. The term "safe vector length" means the size of the block, for a given loop, which is safe to process simultaneously. The compiler tries to determine the safe vector length for a recurrence, like the above, automatically. If "RD" were defined in the program as a parameter, then the safe vector length for the above loop would be known at compile time.

If RD were computed, or read in from a file, then it wouldn't be known until run-time. In this case, the compiler inserts instructions to computes the safe vector length at run-time. Or, alternatively, if the programmer knows in advance what the safe vector length should be, he or she can insert an "IVDEP" directive, like the following, into the code:


(I've searched the SX-6 manuals and not found a concept similar to "safe vector length," and tested various recurrence loops on the SX-6 which vectorize on the X1 and they don't vectorize. If I've missed it, let me know!)

Here's a program to see the effect of vectorizing recurrences using safe vector length:

       program veclen
       implicit none
       integer(kind=8),parameter :: RD=RECURDIST
       integer(kind=8),parameter :: N=100000000
       real(kind=8), dimension(0:N + RD) :: F
       integer(kind=8) :: i, j

       call random_number (F(0:RD-1))

       do j = 0, 10
         do i = RD + 1, N + RD
           F(i) = (F(i-RD) + F(i-RD-1)) / 2.0

       do i = 0,16
         write (*,"(I12,'  ', E)") 2**i, F(2**i)

Here's an X1 command to compile it, and set RD=3:

    ftn -Omsgs -Onegmsgs -o veclen -eZ -F -D RECURDIST=3 veclen.f 

In this command, "-Omsgs" asks the compiler to tell how it optimized the loops, "-Onegmsgs" asks it to tell why it couldn't perform various optimizations.

When RECURDIST=1 or 2, the ftn compiler gives us this "negmsg" (the inner loop "do i = RD + 1, N + RD" is line 11):

      A loop starting at line 11 was not vectorized because a recurrence
      was found on "F" at line 12.
When RECURDIST=3, the compiler gives us this "msg" (it gives a similar "msg" for any RECURDIST up to 63):

      A loop starting at line 11 was vectorized with a vector 
      length of 3.
When RECURDIST=64 or greater, the compiler gives us this "msg":

      A loop starting at line 11 was vectorized.

This last message tells us that the loop was unconditionally vectorized, using the maximum X1 vector length (the size of the vector registers) of 64.

Here's a script to automate the process of recompiling and running the code. (The X1 command, "pat_hwpc," dumps performance information for each run, like setting F_PROGINF=DETAIL on the SX-6 or running "hpm" on the SV1ex).


  for RD in 1 2 3 4 5 6 7 8 10 20 30 40 50 100 200 300 400 500 1000
    echo "=========================" 
    echo "Recompiling with RD=${RD}"
    ftn -o veclen -eZ -F -D RECURDIST=${RD} veclen.f 
    pat_hwpc ./veclen  

And, having run the script, here's a table of results. In this table, each line is one run of the program, where:

      Value of RECURDIST used
    CPU time (from pat_hwpc)
  MFLOPS, total for the run (from pat_hwpc)
    Average vector length used by all vector instructions (from pat_hwpc)
   Total number of vector instructions in the run (from pat_hwpc)

    RD     CPUT  MFLOPS    VLEN         VINST

     1   311.11     7.1   61.19         33470 
     2   263.63     8.3   61.58         33470 
     3    76.73    28.7    3.00    1833366827 
     4    58.79    37.4    4.00    1375033453 
     5    47.17    46.6    5.00    1100033457 
     6    39.90    55.1    6.00     916700142 
     7    34.31    64.1    7.00     785747782 
     8    30.11    73.1    8.00     687533453 
    10    24.24    90.8   10.00     550033457 
    20    12.77   172.3   20.00     275033457 
    30     8.72   252.3   30.01     183366827 
    40     7.04   312.6   40.01     137533457 
    50     5.86   375.3   50.00     110033457 
   100     4.29   513.1   64.00      85970964 
   200     2.12  1037.2   64.00      85970978 
   300     1.81  1212.6   64.00      85970985 
   400     1.52  1444.7   64.00      85970999 
   500     1.49  1476.3   64.00      85971006 
  1000     1.49  1476.9   64.00      85971062 

The number of operations done in each run is nearly identical, so looking at the CPUT column, the benefit of longer vector length is clear. Since each VINST performs VLEN actual operations, the inverse relationship between VLEN and VINST is as expected.

One might ask why MFLOPS tops out at about 1500 on this 12.8 GFLOPS multi-streaming processor. Part of the answer is clear from another of the "negmsgs" returned by the compiler:

     A loop starting at line 11 was not multi-streamed because a
     recurrence was found on "F" at line 12.

While the compiler can vectorize the recurrence it can't multi-stream it. Thus, it's confined to only one of the four 3.2 GFLOPS single-streaming processors which together comprise one (multi-streaming) processor.

For comparison, here are SX-6 results for the same program and a similar script. The SX-6 compiler command is:

  f90 -Ep -D RECURDIST=${RD} -Wf"-pvctl infomsg" -o veclen veclen.f

Performance numbers are taken from output of F_PROGINF=DETAIL.

    RD     CPUT  MFLOPS    VLEN         VINST

     1    38.68    56.9  185.88          186 
     2    28.97    76.1  185.92          186 
     3    28.51    77.5  186.60          184 
     4    23.13    95.1  186.02          186 
     5    23.14    95.1  186.70          184 
     6    24.28    90.6  186.74          184 
     7    23.39    94.1  186.79          184 
     8    23.29    95.3  186.22          186 
    10    23.73    92.7  186.94          184 
    20    23.42    93.9  187.43          184 
    30    24.04    91.5  187.92          184 
    40    21.38   102.9  188.41          184 
    50    21.38   102.9  188.90          184 
   100    21.05   104.6  191.34          184 
   200    22.20    99.1  186.47          195 
   300    21.21   103.8  182.11          206 
   400    21.09   104.3  178.20          217 
   500    21.26   103.5  182.35          217 
  1000    20.95   105.0  172.74          261 

Note that, unlike the factorial example, the SX-6 doesn't vectorize the recurrence in this test program, even when RD==1. From the table, we see no vectorization or dramatic improvement from increasing the value of RD.

Programming Environment Upgrade on Klondike

On 2/25/2004, we upgraded the default programming environment on the X1 from PE 5.0 to PE 5.1. At this point:

:           points to the former PrgEnv (5.0)
PrgEnv [the default]
: points to PE 5.1
:           points to the current PrgEnv (5.1), 
                        but can be updated with little
                        notice and no internal review
                        as Cray releases new versions of 
                        compilers and other PE components.

Anyone needing to conduct tests using the old PE can switch back with the command:

  module switch PrgEnv PrgEnv.old

For more on programming environments and "module" commands read "news prgenv", "man module", or contact

PEvers Utility Available on Klondike

Thanks to John Metzner of Cray Inc. for porting the "PEvers" tool to the X1. This shows you all available versions of the programming environment products, and most importantly, shows which is the default.

The default PE is what you get with the following command, which is included of every user's .profile or .login shell startup file:

    %  module load PrgEnv

Here's a portion of the output of PEvers, giving information the ftn compiler:

  KLONDIKE:baring$ PEvers
  The following Programming Environment Packages are installed:
The current default version is


PEvers is available on the X1, T3E, and SV1ex at ARSC.

Quick-Tip Q & A

A:[[ A word processor I use on another operating system is always
  [[ changing what I type.  For instance "..." becomes one character which
  [[ looks like three dots spaced a little differently.  Text export of a
  [[ file containing these things doesn't fix them.  How do I get rid of
  [[ these when I ftp the file to my Unix box, where my "..." now looks
  [[ like "\311" ?

# Thanks to Andrew Markiel

1) Somewhere in the menus is an "AutoCorrect" menu. Turn off all of the
auto-correction features.

2) Use OpenOffice (, which is a free open-source
cross-platform version of Office. It'll still AutoCorrect your text
(unless you turn it off), but you don't have to convert the file in
order to read it on your Unix box.

3) Use JEdit (, which is a free open-source cross-platform
Java-based source-code editor (which also work OK for text). It does a
better job of avoiding AI (artificial ignorance).

# Thanks to Greg Newby:

Changing '...' to an ellipsis character is a form of auto-correction,
just like your word processor probably changes 'hte' to 'the'.  You
should be able to turn this off for any substitutions that you would
rather not have.  Other common and annoying substitutions include
replacing '(c)' with a circle-C, and automatically superscripting 'tm'.
Turning these substitutions off will save you trouble later.

If you can't prevent the strange characters, try the "recode" command
(pre-installed on most Linux systems; available for download via  This takes a file and changes it from
one character set to another.

Something along these lines would work (but check first with a backup
file; add "-sqf" to force one-way transformations):
        recode cp1252..latin1 filename.txt
or      recode cp1252..ascii filename.txt
or      recode cp1252..dos filename.txt

(there are different input and output character sets; use "recode -l"
for a listing).

# From the editor

If you've got a file containing unwanted codes here's another way to
work them out:

1) look at the file (or part of it) using "od -c".  This shows the 
   octal codes for the non-printing characters: 
  %    od -c file.txt
  0000000  \r   t   h   i   s       i   s       a       l   i   t   t   l
  0000020   e       d   e   m   o       o   f       t   h   e     311    
  0000040   f   e   a   t   u   r   e   s       o   f  \r   t   h   e    
  0000060   a   u   t   o   c   o   r   e   c   t     311       f   u   n
  0000100   c   t   i   o   n       o   f       t   h   i   s       w   o
  0000120   r   d       p   r   o   c   e   s   s   o   r     252   ,    
  0000140   h   e   r   e   .      \r  \r  \n

2) Decide on suitable replacements for the non-printing codes.  For
instance, change octal 311 to "...", 252 to "(tm)", and carriage 
returns to newlines.

3) The following perl command will do it, printing the results to
stdout, like this:

  %    perl -p -e 's/\r/\n/g; s/\252/\(tm\)/g; s/\311/.../g;'  file.txt

  this is a little demo of the ... features of
  the autocorect ... function of this word processor (tm), here.

Q: Is there a way to invalidate my kerberos ticket before I trot off 
   to lunch?  It seems a little risky to leave valid tickets sitting on
   my workstation when I'm not around.

[[ Answers, Questions, and Tips Graciously Accepted ]]

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top