ARSC HPC Users' Newsletter 255, October 4, 2002
Contents
ARSC Accepting Account Applications for the SX-6
Start at our home page:
and descend into the page, "Benchmarking the SX-6". You'll find a "Getting Started" guide for new users, more information on the SX-6, and account application requirements and forms.
NOTE: We will be staging the creation of accounts on the SX-6. Please apply when convenient, but realize there may be a delay before we activate your account.
Cray Bioinformatics Library Installed on Chilkoot
Version 1.0 of the Cray bioinformatics library (cbl) has been installed on chilkoot, and is available for testing.
From man: INTRO_LIBCBL(3B)
"LIBCBL routines perform low level bit manipulation and searching operations useful in the analysis of nucleotide and amino acid sequence data."
The bioinformatics library requires programming environment 3.6 (PE 3.6). Please note that although PE 3.6 has been installed and is available as module "PrgEnv.new," internal testing of PE 3.6 is not yet complete.
To use the bioinformatics library, load modules, as follows:
chilkoot$ module switch PrgEnv PrgEnv.new chilkoot$ module load biolib
The man pages are available at:
http://www.arsc.edu/support/manuals/LIBCBL/
Please let us know if you use biolib, as we're quite interested in your observations, difficulties, and successes.
COMCOT: Case Study of Port to the Cray SV1ex
[ Thanks to Tom Logan of ARSC for this contribution. ]
I recently was passed COMCOT, the COrnell Multi-grid COupled Tsunami model, to port to the ARSC SV1ex system. This multi-grid tsunami simulation model is based on shallow water wave equations that have been implemented in FORTRAN.
My first task was creation of a Makefile for the code. I copied another Chilkoot FORTRAN makefile that I had, and used compiler options for aggressive optimization ( -Otask3,aggress ).
Compilation was not a difficult task, and, once I verified that the code was giving reasonable results, I moved on to a performance analysis. Using the hpm tool, I found that my initial run only achieved a dismal 42 Mflops:
Group 0: CPU seconds :3285.91257 CP executing : 1642956285740 Million inst/sec (MIPS) : 148.93 Instructions : 489377395309 Avg. clock periods/inst : 3.36 % CP holding issue : 50.46 CP holding issue : 828998044929 Inst.buffer fetches/sec : 2.79M Inst.buf. fetches: 9164820769 Floating adds/sec : 23.95M F.P. adds : 78705961705 Floating multiplies/sec : 16.12M F.P. multiplies : 52964667681 Floating reciprocal/sec : 1.56M F.P. reciprocals : 5118700959 Cache hits/sec : 103.31M Cache hits : 339467490952 CPU mem. references/sec : 20.57M CPU references : 67600608759 Floating ops/CPU second : 41.63M
My next step was to profile the code using perfview. To do this, I added some compiler options ( -Otask3,aggress -ef -l perf ) and re-ran the code.
Using perfview on the resulting perf.data file, I quickly noticed that 96.5% of the time was spent in a single routine named "CONMOME." This appears in the following perfview pie chart giving the percentage of time spent in each subroutine:
Using the 'Observations' report of perfview, I determined that CONMOME was only achieving 38.4 Mflops. The observations for CONMOME included the following:
This routine appears to be a partially vectorized code.
WARNING: this routine does not use efficient execution paths.
It is jumping too much and too far, which causes a very high
rate (2.9) of instruction buffer fetches.
Suggestions:
* Study this routine to determine whether the large amount
of jumping can be lessened. It might be desirable to move
frequently-called routines/functions into this routine. This
process is called 'inlining'. The Fortran and C compilers are
capable of performing automatic inlining.
At this point, I knew what routine in the code I needed to work on, however additional information could be helpful. To this end, I added compiler switches to give full compilation messages, both positive and negative ( -eo -Otask3,msgs,negmsgs,aggress -Ca ), re-ran the code, and used the ftnlist command to get a complete compilation report. Looking at the compiler messages displayed to stdout (which are also listed in the output of ftnlist) I saw:
DO 100 I = 2, IX-1
f90-6518 f90: TASKING CONMOME, File = comcot_v1_4.f, Line = 2815
A loop starting at line 2815 was not tasked because it contains
an alternate exit.
f90-6250 f90: VECTOR CONMOME, File = comcot_v1_4.f, Line = 2815
A loop starting at line 2815 was not vectorized for an unspecified
reason.
DO 100 J = 2, JY
f90-6518 f90: TASKING CONMOME, File = comcot_v1_4.f, Line = 2816
A loop starting at line 2816 was not tasked because it contains
an alternate exit.
f90-6270 f90: VECTOR CONMOME, File = comcot_v1_4.f, Line = 2816
A loop starting at line 2816 was not vectorized because it contains
conditional code which is more efficient if executed in scalar mode.
DO 200 J = 2, JY-1 !3
f90-6294 f90: VECTOR CONMOME, File = comcot_v1_4.f, Line = 2958
A loop starting at line 2958 was not vectorized because a better
candidate was found at line 2959.
f90-6403 f90: TASKING CONMOME, File = comcot_v1_4.f, Line = 2958
A loop starting at line 2958 was tasked.
DO 200 I = 2, IX !3
f90-6270 f90: VECTOR CONMOME, File = comcot_v1_4.f, Line = 2959
A loop starting at line 2959 was not vectorized because it contains
conditional code which is more efficient if executed in scalar mode.
f90-6419 f90: TASKING CONMOME, File = comcot_v1_4.f, Line = 2959
A loop starting at line 2959 was tasked as part of the loop starting
at line 2958.
The compiler messages reinforced the perfview observations - the code was not vectorizing because too much jumping was occurring. It also gave the additional information of exactly where the loops occurred and the hint that the vectorization was inhibited by conditional code. Looking at the code, I found the loop at line 2815 contained:
2815 DO 100 I = 2, IX-1 2816 DO 100 J = 2, JY 2817 2818 C..GUESS THE RUNUP NEVER REACH AT HZ(I,J)=ELMAX 2819 IF (HZ(I,J) .LE. ELMAX) THEN 2820 C GOTO 100 2821 WRITE(*,*)'WARNING !!! Maximum runup height reachs ELMAX at' 2822 write(*,'(2i10)')i,j 2823 stop 2824 ELSE 2825 C..CALCULATE X-DIRECTION LINEAR TERMS 2826 IF (HP(I,J) .LE. ELMAX) THEN 2827 P(I,J,2) = 0.0 2828 GOTO 110 2829 ENDIF 2830 2831 C..MOVING BOUNDARY 2832 IF (DZ(I,J,2) .LE. 0.0) THEN 2833 IF (DZ(I+1,J,2) .GT. 0.0 .AND. 2834 + HZ(I,J)+Z(I+1,J,2) .GT. 0.0) THEN 2835 DD = HZ(I,J) + Z(I+1,J,2) 2836 DF = DD 2837 ELSE 2838 P(I,J,2) = 0.0 2839 GOTO 110 2840 ENDIF 2841 ELSE 2842 IF (DZ(I+1,J,2) .LE. 0.0) THEN 2843 IF (HZ(I+1,J)+Z(I,J,2) .LE. 0.0) THEN 2844 P(I,J,2) = 0.0 2845 GOTO 110 2846 ELSE 2847 DD = HZ(I+1,J) + Z(I,J,2) 2848 DF = DD 2849 ENDIF 2850 ELSE 2851 DD = DP(I,J,2) 2852 DF = DP(I,J,1) 2853 ENDIF 2854 ENDIF . . . 2953 ENDIF 2954 110 CONTINUE 2955 100 CONTINUE !7!
With the perfview observations and compiler messages in mind, it was clear that the IF statement starting at line 2819 provided the alternate exit that inhibited tasking. It also occurred to me that the IF statement at line 2843 was quadruply nested.
The first step in modifying the code was to remove the alternate exit (stop) statement from the loops. Once I verified that HZ(I,J) is not modified in the loop body, it was simple to create a separate loop to contain the if statement that had started at line 2819.
The next step was to modify the conditionals in an attempt to simplify the code. Looking at the code segment from lines 2832 to 2854, I created the following truth table:
------------------------------------------------------------------
DZ(I,J,2) < < < > > >
DZ(I+1,J,2) > > < < < >
HZ(I,J)+Z(I+1,J,2) > <
HZ(I+1,J)+Z(I,J,2) < >
------------------------------------------------------------------
outcome 1 2 2 2 3 4
Where outcome is:
1) DD = HZ(I,J) + Z(I+1,J,2)
DF = DD
2) P(I,J,2) = 0.0
GOTO 110
3) DD = HZ(I+1,J) + Z(I,J,2)
DF = DD
4) DD = DP(I,J,2)
DF = DP(I,J,1)
Turning the truth table into an IF statement and combining it with the new separate 'stop' loop reduced the complexity of the conditionals from quadruply nested to a single flat IF-ELSEIF statement. The final transformed code segment follows:
C..GUESS THE RUNUP NEVER REACH AT HZ(I,J)=ELMAX
DO I=2,IX-1
DO J=2,JY
IF (HZ(I,J) .LE. ELMAX) THEN
C GOTO 100
WRITE(*,*)'WARNING !!! Maximum runup height reachs ELMAX at'
write(*,'(2i10)')i,j
stop
ENDIF
END DO
END DO
C!3 MODIFY THE DOMAIN OF COMPUTATION
DO 100 I = 2, IX-1
DO 100 J = 2, JY
C..CALCULATE X-DIRECTION LINEAR TERMS
IF (HP(I,J) .LE. ELMAX) THEN
P(I,J,2) = 0.0
GOTO 110
ENDIF
C..MOVING BOUNDARY
IF (DZ(I,J,2).GT.0.0 .AND. DZ(I+1,J,2).GT.0.0) THEN
DD = DP(I,J,2)
DF = DP(I,J,1)
ELSE IF (DZ(I,J,2).GT.0.0 .AND. DZ(I+1,J,2).LE.0.0 .AND.
& HZ(I+1,J)+Z(I,J,2) .GT. 0) THEN
DD = HZ(I+1,J) + Z(I,J,2)
DF = DD
ELSE IF (DZ(I,J,2).LE.0.0 .AND. DZ(I+1,J,2).GT.0.0 .AND.
& HZ(I,J)+Z(I+1,J,2) .GT. 0.0) THEN
DD = HZ(I,J) + Z(I+1,J,2)
DF = DD
ELSE
P(I,J,2) = 0.0
GOTO 110
ENDIF
.
.
.
110 CONTINUE
100 CONTINUE !7!
Without going into the details, the same form of nested conditionals occurred in the loops starting at 2958 of the original code. Thus, a nearly identical transformation was applied. The only difference was that this second set of loops did not contain an alternate exit, so no separate loop was required.
Running the modified code was a gratifying experience.
Group 0: CPU seconds : 429.58201 CP executing : 214791004970
Million inst/sec (MIPS) : 64.45 Instructions : 27685033782
Avg. clock periods/inst : 7.76
% CP holding issue : 81.35 CP holding issue : 174741062314
Inst.buffer fetches/sec : 0.78M Inst.buf. fetches: 334511361
Floating adds/sec : 192.86M F.P. adds : 82850712958
Floating multiplies/sec : 126.88M F.P. multiplies : 54504611332
Floating reciprocal/sec : 13.73M F.P. reciprocals : 5898268294
Cache hits/sec : 174.08M Cache hits : 74783227950
CPU mem. references/sec : 273.61M CPU references : 117539868588
Floating ops/CPU second : 333.47M
The perfview pie chart shows that the modified version only spends 75.3 percent of its time in CONMOME:
Also, from the 'Observations' report of perfview, I determined that CONMOME itself was now achieving 381 Mflops. The observations for CONMOME included the following:
This routine appears to be a highly efficient vectorized code.
So, the code was now performing at roughly eight times its initial rate. However, additional improvement was immediately gained by utilizing chilkoot's autotasking ability. In compiling the code with the -Otask3,aggress switch, I had already specified that autotasking would be utilized. To use the autotasking, it only required setting the environment variable NCPUS to the number of processors desired. Here are the results for 1 - 8 processors:
#PEs CPU-Time Wall Clock Mflops/PE Total Mflops ----- -------- ---------- --------- ------------ 1 429 432 333 333 2 479 254 299 598 4 523 147 274 1095 8 570 91 251.5 2012
In summary, using the tools HPM, perfview, and ftnlist, along with compiler options for displaying positive and negative compilation messages, I was able to readily analyze COMCOT's performance, diagnose the problem, pinpoint the code's bottleneck, and receive valuable hints on what changes were required.
By slight code modification and the use of compiler options for aggressive optimization and autotasking, I was able to increase the performance of this code by a factor of 8 on a single CPU and by a factor of 48 using 8 CPUS. What started out as a 40 Mflops code now runs at just over 2 Gflops.
Hail Hail CRAY's tools and compilers!
Portable Nucleotide String Compression: Part I, Endian Enigmas
[ Thanks to Jim Long of ARSC for contributing this 2-part series. ]
This series of articles present some issues that arise when writing portable code in the context of compressing nucleotide text strings and producing the reverse complement of such strings.
Portability across both big- and little-endian architectures is a key issue in bioinformatics. Researchers in bioinformatics are strong proponents of open-source and commodity Linux boxes (often running little-endian, Intel processors), but also perform significant work on higher-end Unix boxes (typically big-endian).
Part I -- ENDIAN ENIGMASBefore we can discuss compression schemes, lets first look at a little code that exposes the issue of "which"-endian:
main()
{
char c[]="acgta";
unsigned int a[2], i;
for(i=0; i<5; i++) printf("address of c[%d] = %X\n", i, &c[i]);
for(i=0; i<2; i++) printf("%X\n",((int *) c)[i]);
a[0] = ((unsigned int *) c)[0] >> 8;
printf("%X\n",a[0]);
printf("%s\n",(char *) a);
}
The code says to print the address of each byte (%X = hex) of the text string "acgta". Then cast the address of the "c" array to an unsigned int and display how the string is stored in two 4-byte words. Then we take the first 4-byte word, shift it to the right by 8 bits, store it in a[0], and look at it again. Finally we cast the address of the "a" unsigned int array containing that shifted word to be a character pointer, and see what string we get. On a big-endian machine we get
address of c[0] = 7FFF2F00 address of c[1] = 7FFF2F01 address of c[2] = 7FFF2F02 address of c[3] = 7FFF2F03 address of c[4] = 7FFF2F04 61636774 61002F4C 616367
As one might expect, each successive letter of the string occupies the next higher byte of memory. When we examine the string as a word of memory, we see 61636774 as the first word, "acgt" where a=0x61, c=0x63, g=0x67, and t=0x74. The second word is 61002F4C, which is an "a", the last letter of the string, followed by the string null terminator (0x00) and whatever junk happened to be in the rest of the word (0x2F4C). The last entry, 0x00616367, is the shifted word with 0's filling in from the left. The 0's look like a null terminator so that when we ask to print out the string that "(char *) a" points to, we get only a newline from the printf statement.
Now let's look at output from the same code on a little-endian machine:
address of c[0] = BFFFFAD0 address of c[1] = BFFFFAD1 address of c[2] = BFFFFAD2 address of c[3] = BFFFFAD3 address of c[4] = BFFFFAD4 74676361 40000061 746763 cgt
Again, each successive letter of the string occupies the next higher byte of memory. When we examine the string as a word of memory, however, we see the letters reversed. This is sometimes explained by saying that the little endian scheme stores the least significant byte of an integer in the lowest address of a word, and the most significant byte of an integer in the highest address of a word. Big endian schemes do just the opposite. Since my computer prints the contents of a memory word (a number) on the screen in English (from left to right), the most significant byte will always on the left, followed by the lesser significant bytes to the right. For the programmer, it is conceptually easier to think of big endian machines as starting their first word of memory on the left and continuing to the right (like English), while little endian machines start their first word of memory on the right and continue to the left (like Hebrew). On a 32-bit machine this looks like:
Big-endian:
<--------word0--------> <--------word1--------> <--etc--> byte0 byte1 byte2 byte3 byte4 byte5 byte6 byte7 a c g t a null
Little-endian:
<--etc-->
<--------word1-------->
<--------word0-------->
byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte0
null a t g c a
With this scheme in mind, we can now see why word0 prints out the way it does, and we can interpret the rest of the code output on a little-endian machine. Shifting the first word to the right results in byte0 falling off the word, instead of byte3 falling off as in a big-endian machine. The result is what we see, 0x00746763. Now when we ask for the string pointed to by "(char *) a", we get "cgt" because the byte0 "a" fell off the word, and the 0's that filled in from the left became the null terminator.
The above example looked at something that was already in memory, placed there byte-by-byte from a text string. What happens when we want to put some value into a word ourselves? Look at the next code:
main()
{
unsigned int i = 0x00006100;
printf("%X\n", i);
i = i >> 8;
printf("%X\n", i);
printf("%c\n", *((char *)&i));
}
The output on a big endian machine is:
6100 61
The output on a little endian machine is:
6100 61 a
What happened? Lets look at our mental picture when "i" is declared and assigned:
Big-endian:
<--------word0--------> <--etc--> byte0 byte1 byte2 byte3 00 00 61 00
Little-endian:
<--etc-->
<--------word0-------->
byte3 byte2 byte1 byte0
00 00 61 00
Both architectures print out the same thing for the integer as initially stored, and when the integer is shifted to the right. The difference happens when the address of "i" is cast to a (char *), dereferenced, and printed as a character. The big-endian machine prints out its byte0, which is null (we get a newline from the printf statement), while the little-endian machine prints out its byte0, which is the letter "a". Code similar to this can be used in portability scenarios if it is important to determine which type endian-ness some application is running on.
Note that I've used unsigned ints in these examples. If an int is signed, then 1's will be shifted in from the left if the leftmost bit is a 1, and I want 0's shifted in no matter what bit pattern is in the word. 0's fill in from the right during a left shift no matter if the variable is signed or unsigned.
As mentioned at the start of this article, these issues are being discussed in the context of encoding nucleotide text strings on different architectures. The nucleotide alphabet consists of only 4 characters, A, C, G, & T. When manipulating strings of these letters, it is desirable to compress the text strings such that each letter occupies only 2 bits in a word instead of 8. Next time we'll look at some compression code for 2-bit encoding of nucleotide text strings on each architecture, and how to produce the reverse complement of a nucleotide string.
Draft Fortran 2000 Standard Available for Comment
This announcement was submitted to the Fortran 90 List, <COMP-FORTRAN-90@JISCMAIL.AC.UK> by John Reid. ARSC users are encouraged to look over the draft standard, and submit comments if interested:
> I am pleased to tell you that the draft Fortran 2000 standard is now out > for comment. The official version is available from > > http://www.dkuug.dk/jtc1/sc22/open/n3501.pdf > > The J3 (USA Fortran committee) version, which is identical except for > the title page and the headers and footers, is available in ps, pdf, > text, or source (latex) from > > ftp://ftp.j3-fortran.org/j3/doc/standing/007/ > > This is a very significant milestone for Fortran 2000. It is a major > extension of Fortran 95 that has required a significant amount of > development work by the J3. The main features were decided at a meeting > of the ISO Fortran committee WG5 in 1997. [... cut ...] > I have written an informal description of the new features, which will > be published in the next issue of Fortran Forum (about to appear). It > is also available from > > ftp://ftp.nag.co.uk/sc22wg5/N1451-N1500/N1495.pdf > > It is an unofficial document written by me and has not been formally > approved by either WG5 or J3. If you base a comment on what I say, > check with the draft standard in case I got it wrong.
Quick-Tip Q & A
A:[[ When I run my MPI program, some tasks start spitting error messages,
[[ which get all mixed up together, and then it stops.
[[
[[ I'd like to know which message comes from which task, and, sure, I
[[ could fix the code so every message is prefaced with the task number,
[[ but there might be an easier way. Is there?
# Thanks to Kevin Kennedy for this solution for AIX platforms:
Simple, when using csh, use:
setenv MP_STDOUTMODE ordered
# On other platforms you can use the "mpirun ... -prefix ..."
# option. From (IRIX) "man mpirun":
-p[refix] prefix_string
Specifies a string to prepend to each line of output from stderr and
stdout for each MPI process. To delimit lines of text that come from
different hosts, output to stdout must be terminated with a new line
character.
Some strings have special meaning and are translated as follows:
* %g translates into the global rank of the
process producing the output. (This is
equivalent to the rank of the process in
MPI_COMM_WORLD.)
* %G translates into the number of processes
in MPI_COMM_WORLD.
[... cut ...]
Example:
% mpirun -prefix "<process %g out of %G> " 4 a.out
<process 1 out of 4> Hello world
<process 0 out of 4> Hello world
<process 3 out of 4> Hello world
<process 2 out of 4> Hello world
Q: Can I see the list of file names the shell _would_ generate from a
wildcard expansion? Without actually _doing_ something with the
list? This command,
$ ls -1 *2001*
for instance, lists the all files and directories selected with the
wildcard string, "*2001*", but it also descends into any directories
selected.
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
