ARSC HPC Users' Newsletter 254, September 13, 2002
Review of Bioinformatics Codes Available at ARSC
[ Thanks to Jim Long for this contribution. ]
ARSC is building a Bioinformatics infrastructure to support research in this important new area. To date, the following software has been installed on the SGIs, and is planned for the IBM platforms:
BLAST (Basic Local Alignment Search Tool)
A heuristic algorithm to query a protein or DNA sequence against available databases. Documentation: http://www.ncbi.nlm.nih.gov/BLAST/ for the ncbi web version.
CLUSTALW and CLUSTALX
A heuristic algorithm based on phylogenetic analysis for multiple sequence alignment. clustalx is a nice X window interface to clustalw allowing postscript output and other features.
FASTA & SSEARCH
FASTA is another heuristic algorithm similar to BLAST, while SSEARCH uses the Smith-Waterman algorithm. Documentation: http://www22.ncifcrf.gov/app/html/SeqAnaly/fasta/fasta_doc.html
HMMER
Protein sequence analysis using hidden Markov models. Documentation: http://hmmer.wustl.edu/
PHYLIP
A collection of software for inferring phylogenies. Documentation: http://evolution.genetics.washington.edu/phylip.html
SEQALN
A collection of software based on a library of functions to align nucleotide and protein sequences using Smith-Waterman. Documentation: http://hto-13.usc.edu/software/seqaln/
---Currently, about 10 GB of nucleotide and protein sequence data is maintained on ARSC systems and updated on a monthly basis. All data is additionally formatted for use by BLAST. The current base files are:
- ecoli.aa
- E. coli genomic CDS translations (peptides).
- ecoli.nt
- E. coli genomic nucleotide sequences.
- human.nr
- All sequences from nr that have [Homo sapiens] in the comment field.
- mouse.nr
- All sequences from nr that have [Mus musculus] in the comment field.
- nr
- Non-redundant GenBank CDS translations+PDB+SwissProt+PIR (peptides).
- nt (dna)
- Non-redundant GenBank+EMBL+DDBJ+PDB nucleotide sequences (no EST, STS, GSS, or HTGS)
- yeast.aa
- Yeast (Saccharomyces cerevisiae) protein sequences.
- yeast.at
- Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences.
Loadleveler, Who Am I?
MPI, PVM, SHMEM, OpenMP, all have mechanisms which allow a given task (or thread) to determine:
- how many tasks there are in total and,
- its own identity.
The same information can be obtained for processes started on different nodes of an IBM SP by poe ("parallel operating environment"), in LoadLeveler.
This can be used to parallelize some naturally parallel jobs by simply having poe launch multiple copies of a serial or threaded program, and having each such process read and crunch a different input file.
How can a process determine how many nodes are there, and which it is?
One way is to use the loadleveler environment variable, LOADL_PROCESSOR_LIST to get the complete list of nodes (by name), and the Unix "hostname" function, to get each individual name.
What follows is a perl subroutine "getNodeInfo" which implements this idea, contained in a test perl script.
"getNodeInfo" sets the four global variables,
- LL_NUM_NODES
- LL_MY_NODE_NUM
- LL_MY_PROCESSOR_NAME
- LL_PROCESSOR_LIST
There's one complication (which may not apply on your cluster), in that LOADL_PROCESSOR_LIST returns switch names. On ARSC's SP, a simple "tr" converts these names, as shown in the script:
#!/usr/local/bin/perl -w
my ($myHost, $myNode, $hostList, $nNodes);
&getNodeInfo;
print "LL_NUM_NODES=$nNodes;\n";
print "LL_MY_NODE_NUM=$myNode;\n";
print "LL_MY_PROCESSOR_NAME=$myHost;\n";
print "LL_PROCESSOR_LIST=$hostList;\n";
# print "export LL_NUM_NODES LL_MY_NODE_NUM LL_MY_PROCESSOR_NAME;\n";
######################################################################
sub getNodeInfo () {
my ($n);
# For testing: --------------------------------------------
# $hostList = "i1s1 i1s11 i1s16 i1s2 i1s7 i1s8 i2s18 i2s19 i2s21 i2s30";
# $myHost = "i1n1";
# # For real: --------------------------------------------
$hostList = $ENV{LOADL_PROCESSOR_LIST};
$myHost = `hostname`;
chomp $myHost;
@procs = split ('\s', $hostList);
$n = 0;
$myNode = -1;
foreach (@procs) {
if ($myHost eq $_) {
$myNode = $n;
}
else {
#
# Needed because "node" names in "LOADL_PROCESSOR_LIST"
# appear as their corresponding "switch" names, which
# are identical, except with the "n" translated into an "s".
#
tr/s/n/;
if ($myHost eq $_) {
$myNode = $n;
}
}
$n++ ;
}
$nNodes = $n;
if ($myNode == -1) {
print STDERR "ERR: $0 hostname $myHost not found in list $hostList\n";
}
}
Here's a loadleveler script to test this on five nodes of icehawk:
#!/bin/ksh # # @ output = $(Executable).$(Cluster).$(Process).out # @ error = $(Executable).$(Cluster).$(Process).err # @ notification = never # @ wall_clock_limit=60 # @ job_type = parallel # @ node = 5 # @ tasks_per_node = 1 # @ network.mpi = css0,not_shared,US # @ class = large # @ node_usage = not_shared # # @ node_usage = shared # @ queue export POE=/usr/bin/poe cd /u1/uaf/baring/LoadLeveler echo "hostnames: " $POE hostname echo "llNodeNum.prl: " $POE ./llNodeNum.prlAnd here's output from a run of this loadleveler script:
ICEHAWK2$ cat test_llNodeNum.ll.4628.0.out hostnames: i1n5 i2n20 i2n23 i3n40 i3n44 llNodeNum.prl: LL_NUM_NODES=5; LL_MY_NODE_NUM=2; LL_MY_PROCESSOR_NAME=i2n23; LL_PROCESSOR_LIST=i1s5 i3s44 i2s23 i3s40 i2s20 ; LL_NUM_NODES=5; LL_MY_NODE_NUM=3; LL_MY_PROCESSOR_NAME=i3n40; LL_PROCESSOR_LIST=i1s5 i3s44 i2s23 i3s40 i2s20 ; LL_NUM_NODES=5; LL_MY_NODE_NUM=1; LL_MY_PROCESSOR_NAME=i3n44; LL_PROCESSOR_LIST=i1s5 i3s44 i2s23 i3s40 i2s20 ; LL_NUM_NODES=5; LL_MY_NODE_NUM=0; LL_MY_PROCESSOR_NAME=i1n5; LL_PROCESSOR_LIST=i1s5 i3s44 i2s23 i3s40 i2s20 ; LL_NUM_NODES=5; LL_MY_NODE_NUM=4; LL_MY_PROCESSOR_NAME=i2n20; LL_PROCESSOR_LIST=i1s5 i3s44 i2s23 i3s40 i2s20 ;
Let us know if you make use of this, and we'll have another article for the newsletter!
UAF / ARSC Courses
The following for-credit UAF courses are taught with reliance on ARSC hardware and sponsorship from ARSC.
ART 472 Visualization and Animation Bill Brody
An introduction to visualization and animation with applications in fine and commercial art and science. Students will produce a series of three dimensional animation projects which will introduce them to the tools and concepts used by animation and visualization professionals.
PHYS 693 Concepts in Parallel Scientific Computation Guy Robinson
This course will introduce the concepts of parallel scientific computation. Primarily in support of graduate students in the physical sciences, this course is designed for students with research interests requiring the application of parallel computation techniques for specific science applications. Topics will include the basics of problem decomposition, and how to identify the necessary communication, with particular attention to scalability and portability of the algorithm. Techniques to assess the reliability, stability and validity of large-scale scientific computations will also be covered. After successful completion of the course, students will be able to solve scientific problems on the parallel computers commonly found in the modern research environment.
BIOL F693/ CHEM F693 Special Topics in Bioinformatics Nat Goodman
This 2-credit course brought to you by the Institute for Arctic Biology, the UAF Chemistry department, the Arctic Region Supercomputing Center, and the UAF Biology department will introduce the concepts of bioinformatics through lectures, presentation, and student projects.
Topics:
- Sequence comparison and database search: Smith-Waterman, FASTA, BLAST and similar methods.
- Advanced sequence analysis methods: Hidden Markov Models, profile search, etc.
- Microarray analysis: statistical issues, clustering, and other numeric methods.
- Systems biology: pathway modeling and simulation, inference of regulatory networks.
ARSC Training, Fall 2002
All classes will be held at UAF in Butrovich Bldg rm. 007 starting at 2pm.
September 25th:
User's Introduction to ARSC Supercomputers.
-
Introduction to ARSC's supercomputers
- Architectures and capabilities of the Cray SV1ex, Cray SX-6, Cray T3E, IBM SP cluster, IBM Regatta, and linux cluster.
- Programming models
-
Programming Environments:
- Compilers
- Debuggers
- Performance analysis tools
-
Running jobs
- Interactive and batch
- Submitting batch jobs
- Checking job status
Introduction to using the ARSC IBM SP and P690 (Regatta)
This course is an introduction to the ARSC IBM systems, icehawk and iceflyer.
Topics to be covered include:
- Architecture and storage overview
- Compilers and options
- LoadLeveler
- Performance monitoring/profiling
- Mixed-mode programming (MPI and openMP)
- A few words on writing portable makefiles and code
The class is intended for those who already have some computing experience and are interested in running codes on the ARSC IBM systems. After attending this class users will be able to determine which of the two ARSC IBM systems will best suit needs and how to develop, optimize and debug codes in the ARSC IBM environment.
More details and registration will be available presently. Watch the "Hot-Topics" on #250
AAAS Arctic Division Meeting, Next Week at UAF
The American Association for the Advancement of Science (AAAS) Arctic Division 2002 Meeting starts next Wednesday at UAF. For details, see:
http://arctic.aaas.org/meetings/2002/
Several ARSC users and affiliates are presenting research or otherwise involved.
Quick-Tip Q & A
A:[[ What's an "ulp"--a typo? a word? an acronym? I noticed it in issue
[[ #250. Should I care?
From the glossary of CrayDoc manual: "Cray T3E(TM) Fortran Optimization
Guide":
ulp
Unit of least precision. It is used to discuss the accuracy of
floating-point data. It represents the minimum step between
representable values near a desired number: the true result to
infinite precision, assuming the argument is exact. For instance,
the ulp of 1.977436582E+22 is 1.0E+13, since the least significant
digit of the mantissa is in the 10^13 place. Within 0.5 ulp is the
best approximation representable.
Thanks to those who pointed out that "ulp" didn't actually appear in
issue #250 . IBM's vector intrinsics documentation, referenced in that
issue, has it. Sorry for the confusion... Here's the relevant section
-- "ulps" appears after the table:
Mathematical Acceleration SubSystem (MASS)
MASS Version 2.7
[...]
The following table provides sample accuracy data for the libx,
libmass, libmassv, and libmassvp3 libraries. The numbers are based on
the results for 10,000 random arguments chosen in the specified
ranges. Real*16 functions were used to compute the errors. There may
be portions of the valid input argument range for which accuracy is
not as good as illustrated in the table. Also, the user may
experience accuracy which varies from the table when argument values
are used which are not represented in the table.
The Percent Correctly Rounded (PCR) column elements are obtained by
counting the number of correctly rounded results out of the 10,000
random argument cases. A result is correctly rounded if the function
returns the IEEE 64 bit value which is closest to the
infinite-precision exact result.
Math Library Accuracy
libm libmass libmassv
function range PCR MaxE PCR MaxE PCR MaxE
exp D 99.95 .50 96.55 .63 96.58 .63
sexp E 100.00 .50 100.00 .50 98.87 .52
sin B 81.31 .91 96.88 .80 97.28 .72
sin D 86.03 .94 83.88 1.36 83.85 1.27
tan D 99.58 .53 64.51 2.35 50.48 3.19
[ ... 25 functions cut, for brevity ... ]
* indicates hardware instruction was used
Range Key PCR = Percentage correctly rounded
A = 0, 1 MaxE = Maximum observed error in ulps
B = -1, 1
C = 0,100
D = -100,100
E = - 10, 10
[...]
Q: When I run my MPI program, some tasks start spitting error messages,
which get all mixed up together, and then it stops.
I'd like to know which message comes from which task, and, sure, I
could fix the code so every message is prefaced with the task number,
but I'd like an easier way. Do you know one?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
