ARSC HPC Users' Newsletter 237, January 18, 2001
- Jan 24: SV1ex Celebration and Reception
- Jan 25: High Performance Bioinformatics Lecture
- Tuning a C++ MPI Code with VAMPIR (part III of III)
- CUG SUMMIT 2002 Call For Papers
- Quick Tip
Jan 24: SV1ex Celebration and Reception
ARSC is proud to have installed the first SV1ex memory upgrade (last month) and the first SV1e processor upgrade (last April).
You are invited to attend a reception and multimedia presentation with ARSC staff, celebrating the first Cray SV1ex.
Visitors will include Cray representatives including CEO, Michael Haydock, and researchers from the Institute for Systems Biology and the National Cancer Institute. Local Alaskan researchers and UA leadership will also be present.
Thursday, Jan 24, 4:30 pm. Butrovich 109, Board of Regents Conference Room, UAF.
Jan 25: High Performance Bioinformatics Lecture
SPEAKER : Jack R. Collins, Ph.D. National Cancer Institute Advanced Biomedical Computing Center TITLE : High Performance Bioinformatics and Proteomics: Mining the Genome and Proteome DATE : Friday Jan. 25, 2002 TIME : 1:30 - 2:30pm LOCATION : Butrovich 109, Board of Regents Conference Room ABSTRACT : The public availability of the human genome along with protein, SNP, EST, and cancer databases has led to an effort to determine the underlying information content contained within the cornucopia of data. With the vast amounts of data and unending number of questions that can be asked, a high performance computational approach is the only practical approach to converting data to usable information. To this end, we are developing tools to analyze this data for both DNA and protein sequences. Specific examples of comparative genomics and analysis, protein family classification, and correlation with etiological implications using data from the Cancer Genome Anatomy Project (CGAP) and the Online Mendelian Inheritance in Man (OMIM).
[Sponsored by ARSC]
Tuning a C++ MPI Code with VAMPIR (part III of III)
[ Thanks to Jim Long of ARSC for this series of articles. ]
In part one of this series ( issue #233 ) we looked at a C++ MPI tracefile of the Terrestrial Ecosystem Model (TEM) with VAMPIR in order to identify areas of the MPI code that might be tuned to improve scalability in a sensitivity analysis scenario. In part two ( issue #235 ) we implemented changes to the transient portion of the TEM MPI code and compared old vs. new, finding a platform dependent 10% to 40% reduction in the transient computation time for a typical run with 8 output variables written to file. In this last part, we examine the impact of the tuned MPI code on scalability for a full suite of runs on the Cray T3E and IBM SP.
The suite consists of a group of 1, 2, 4, 8, 16, 24, 32, and 48 output variables written to file for each number of slave processors chosen. TEM always computes these variables, but requesting more of them to be written increases the load on the system.
On the T3E, the number of slave processors chosen was 1, 2, 4, 8, 16, 32, 64, and 128. On the SP, because there are 4 CPUs per node and resources are allocated by node, the total number of CPUs (slaves + 1 master) had to be either 1, 2, 3, or 4 CPUs/node times the number of nodes chosen. Thus the number of slaves used was 1, 2, 3, 7, 15, 31, 63, and 127.
In either case, we are interested in how well the code scales from the 1 slave case on each platform for each of the output variable scenarios. The traditional view of scalability looks at how close the speedup chart for a series of runs compares to the ideal linear speedup case. Figure 1 shows this view for the T3E (yukon) and Figure 2 for the SP (Icehawk). Both figures show the original code contrasted with the tuned code.
Speedup for a particular N output variable case is defined as [runtime for 1 slave case] / [runtime for number of slaves used]. Ideal speedup occurs when doubling the number of slaves halves the runtime. Increased speedups from the original code imply increased scalability, since the speedup curves for the tuned code are closer to the ideal linear speedup curve than the curves for the original code. Both the T3E and the SP show improved speedups for all cases of number of output variables.
Another way to view the results is to look at the cost for doing a computation. Cost in our case is defined as
Cost = [total runtime] x [number of CPUs (master+slaves) used]
In code that does not need a master, the minimum cost is usually attained on just one CPU, unless superlinear speedup occurs due to optimal cache use with more CPUs. For code that requires a master that does no computation, like TEM, the total cost is usually minimized using more than one slave, amortizing the cost of the master. In a sensitivity scenario, cost might be a factor since the model may be run thousands of times. Increased scalability might mean that cost will be a minimum for a greater number of processors than before. Figures 3 & 4 contrast the cost curves for both the T3E (yukon) and the SP (Icehawk).
The "sweet spot" for minimum cost changes little for the T3E, although overall costs have dropped. Minimum cost has moved out to 8 slaves for 1 or 2 output variables, but remains at 4 slaves for other numbers of output variables. It should be noted, however, that numbers of slaves between 4 and 8 were not tested, and the flattening of the curves for the tuned situation might indicate that the sweet spot could move out a CPU or two to 5 or 6 slaves for minimum cost. For the SP, minimum cost is lower for more CPUs when asking for more outputs, in contrast to the T3E situation of lower cost with fewer output variables. A definite shift to the right of the minimum is more evident on the SP.
For those considering a cluster, the performance/price ratio, or "bang per buck", should be considered. How many nodes should you buy for your application? Figure 5 shows "bang per buck" for the IBM SP, nicely illustrating the idea.
If performance is defined as [runs per day], and we divide that by the cost, we get our performance/price ratio. The ratio itself increased for all intermediate numbers of slaves, almost doubling for the case of 48 outputs with the peak value occurring at 16 CPUs. It is interesting that the maximum "bang per buck" occurred in both cases at 16 CPUs, and the effect of increased scalability only raised the value.
One final consideration. The analyses in this series of articles considered the MPI implementation only, not on TEM itself. Other opportunities for improvement exist, for instance I/O schemes and file formats.
CUG SUMMIT 2002 Call For Papers
Jan 25th is the deadline for submitting abstracts to the Manchester CUG. Details:
Quick-Tip Q & A
A:[[ When building cvs on AIX the 'make check' fails. [[ [[ Tracking it down comes down to a difference in the expr command [[ [[ aix$ expr 'Name: $.' : "Name: \$\." ; echo $? [[ 0 [[ 1 [[ [[ irix$ expr 'Name: $.' : "Name: \$\." ; echo $? [[ 8 [[ 0 [[ [[ It works "correctly" on Irix, Unicos, Tru64 and Linux. It fails on AIX [[ and Solaris. The above works on both Solaris and AIX if one adds a [[ space between the '$' and the '.' (both sides of the ':). [[ [[ Anybody else ever run into this? Any suggestions? Suggestion: Expect variability between Unices. The purpose of the above command within the "make check" is to try to detect and manage differences on different systems. Explanation from cvs support regarding "make check": The tests use a number of tools (awk, expr, id, tr, etc.) that are not required for running CVS itself. In most cases, the standard vendor- supplied versions of these tools work just fine, but there are some exceptions. [...] The test script tries to verify that the tools exist and are usable; if not, it tries to find the GNU versions and use them instead. If it can't find the GNU versions either, it will print an error message and, depending on the severity of the deficiency, it may exit. Q: The quick tip needs a short vacation... If you have any ideas, please send them in.
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.