ARSC HPC Users' Newsletter 404, June 12, 2009

Introduction to Git

[ By Kate Hedstrom ]

A colleague and I used to argue the merits of RCS vs. SCCS for maintaining source code. We then migrated to CVS then SVN and currently have an SVN site at Rutgers. Each transition was a step forward for the better. But what about the future? Does SVN have any shortcomings that have been bugging you?

I'll tell you about one that's been bugging me. My colleague and I both have write permission on our SVN server, him on the trunk, me on a branch. However, most of the people downloading our code see it as a read-only site. This is great for software (such as svn) in which the average user is not expected to change anything or contribute patches. However, we are dealing with a complex ocean model in which the average user will at least be changing a few files for setting cpp choices. The above average user might be changing quite a few files, adding new capabilities, etc. If these people want to save their own revision history, they can set up a private svn repository, but any given sandbox directory can only have one parent repository - either their own or the Rutgers one. If they aren't saving their own changes, they are subject to all kinds of unsafe surprises whenever they do "svn update" from us. Even I keep two directories going, one pointing to the trunk and the other pointing to my branch.

So how to solve this problem? Linus Torvalds got so fed-up with his options for developing the Linux kernel that he wrote a new distributed version control system, called Git. It is available from http://git-scm.com/, including a documentation page pointing to all sorts of available resources. There's even a YouTube video of Linus himself ranting about these issues. The various tutorials listed are perhaps more useful, plus I am enjoying "Pragmatic Version Control Using Git" by Travis Swicegood. For the O'Reilly fans out there, a new bat book is in the works, called "Version Control with Git" by Jon Loeliger.

How does Git solve your problems, you ask? Both CVS and SVN have centralized repositories from which one or many people can check out the code. Git has a new-to-me concept of distributed repositories, something I'm sure I still don't entirely comprehend. The repository itself is a binary database, so the files are compressed. Each time you do a checkout (git clone) from someone, you obtain a copy of the entire repository, giving you the entire history right there at your fingertips. Git was designed to be fast since the Linux kernel is quite large. The Linux kernel also has many, many people working on it in a cooperative manner; Git was designed to help them rather than hinder them. SVN advertises that it makes branching easy - Git advertises that it makes both branching and merging easy.

For those of us with one foot still in the past, Git provides interface tools for both SVN and CVS repositories. In other words, my local Git sandbox can be pointed to an SVN server and do uploads and downloads using the SVN protocol. It can then be pointed to a different SVN server - from the same sandbox! Magic!

How can two people use Git to work on the same project? Each person would have their own Git sandbox with the entire repository in it. If person A makes a change, person B could do a "git fetch" or "git pull" to get the changes. Person B then adjusts those changes and makes some new ones. Person A then does a "git pull" from B's repository. These could be simply files on the same machine in which they both have accounts, or there are ways to set up public read-only servers via http or the git protocol. Changes that don't work out can be kept in a branch, but never "pushed", or they can be made to disappear completely as long as you haven't yet shared them with anyone.

Finally, git frees us from the "revision number" concept. It instead signs each view of the repository with a cryptographically sound "SHA1" string, encoding the time and owner, etc. of that repository. An example SHA1 is 152aee0b44a104b07d40f4401e5f1ea1ea2fe1b0, guaranteed to be unique among all git repositories.

I'd like to thank Brian Powell for showing me the way of the Git.

Fortran Optimization on Pingo

[ By Lee Higbie ]

The following article is excerpted from paper presented at CUG09, available here:    Paper: http://www.arsc.edu/files/arsc/news/HPCnews/misc/misc404/CompilerComparePaperCray2.pdf    Slides: http://www.arsc.edu/files/arsc/news/HPCnews/misc/misc404/CUGSlides.pdf

The usual approach to comparing Fortran compilers has been to evaluate the execution speed of programs, which is usually the measure of interest. However, measurements of entire programs provides little insight for a person trying to optimize his or her code (and little useful information for compiler writers who want to improve their optimizers). Furthermore, the effects of compiler optimization options are sometimes surprising. To demonstrate and explore these effects, I created a program with 708 small code snippets that are individually timed. The graphs on the web site above show comparative performance for the four Pingo/Ognip compilers both to themselves (*IntraComp.jpg) and to the PGI compiler (*InterComp.jpg).

Each graph is a semi-log plot of the execution times of a group of loops. In the intra-compiler graphs, negative values (points below the main part of the graph) show faster operation with high optimization and positive values indicate that "optimization" produced slower-running code. For the inter-compiler comparisons, negative points represent slower execution of the PGI-compiled code than on the other code. Positive values indicate PGI's executable was faster.

Looking at the intra-compiler graphs it is apparent that most compilers did not produce much better code for these loops when asked to optimize the compilation (for Cray "-fast" means -O3 -O vector3 -O scalar3; for Gnu it means -O5; for PathScale -O3 -Ofast -ffast-math; for PGI -O3 -fast). The one exception is the PathScale compiler. Its yellow line is often well below the x axis.

It is also clear that the "high" optimization settings produced slower running code much of the time for all four compilers. For the person trying to produce the fastest-running executable possible, this means that experimentation with optimization is necessary, at least for the hot-spot routines. Some of the specific loops that produce the largest timing swings are shown in CUGSlides.pdf, at the web site mentioned. All the loops are shown in the sourceCode directory.

Because the PGI compiler is the default on Pingo, the other compilers were compared to it, using default optimization and high optimization. On the graphs, the blue, red and yellow lines are for default optimization and the green, rust and light blue are for high optimization. To me there appears to be little to say one is better than another. On any specific loop, one compiler may produce faster running code, but the overall performance looks like a toss-up. As before, if you want to produce the shortest possible run time, then you should experiment with the different compilers.

The CUGSlides.pdf file shows a few of the loops that produced the largest inter-compiler time swings and the sourceCode directory has all the code for each loop. The CUG paper, CompilerComparePaperCray2.pdf at the website, gives more details of the procedure.

New Software

Our software specialists have been hard at work installing the latest and greatest software packages across ARSC's machines. New software includes:

Pingo:


Package Name         Version
-----------------    --------
Silo                 4.6.2
MPSCP                1.3a
NetCDF               3.6.3
NAMD                 2.7b1
NWChem               5.1.1
HDF                  5-1.8.1
Gaussian 03          E.01
MPT                  3.1.1
Cobalt               4.2.1
GEOS                 2.2.3
Python               2.6.2

Midnight:


Package Name         Version
-----------------    --------
MPSCP                1.3a
libdap               3.8.2
libnc-dap            3.7.3
cURL                 7.19.4
Boost                1.38.0
NCL                  5.1.0
GEOS                 2.2.3
Python               2.6.2
NetCDF               4.0.1
NWChem               5.1.1
PGI                  8.0.6

Linux Workstations:


Package Name         Version
-----------------    --------
Matlab               7.7.0
Comsol               3.5a
VisIt                1.11.2
ParaView             3.4.0
NCL                  5.1.0
Avizo                6.0
Blender              2.48a

For more information, type "news software" on the machine of interest.

Quick-Tip Q & A


A:[[ Is there a way to get the processor count for a PBS job on the Cray
 [[ XT5? On midnight I use something like this to get the processor count
 [[ for my PBS job:
 [[ 
 [[    NP=$( cat $PBS_NODEFILE | wc -l )
 [[    mpirun -np $NP ./a.out
 [[ 
 [[ When I tried this on pingo, $NP is always set to 1.  Why is that?  Is
 [[ there anyway to get the value of "mppwidth" from my PBS script so I can
 [[ use that value with aprun?
 [[ 
 [[ e.g.
 [[    NP=???
 [[    aprun -n $NP ./a.out
 [[ 
 [[ Currently I'm creating a different script for each processor count (i.e.
 [[ mppwidth value) and it's driving me crazy!

#
# Don Bahls submitted the following solution, using qstat to gather the
# compute node count missing from $PBS_NODEFILE:
#

You can do this by running qstat on the job from within the script.  PBS
sets the $PBS_JOBID variable to the job id for the job.

e.g.
# bash syntax
NP=$( qstat -f $PBS_JOBID | grep mppwidth | awk -F '= ' '{print $2}' )

Here's a complete script:

#!/bin/bash
#PBS -l mppwidth=16
#PBS -l walltime=1:00
#PBS -j oe
#PBS -q standard

cd $PBS_O_WORKDIR

# run qstat on this job to read the mppwidth value.
#
NP=$( qstat -f $PBS_JOBID | grep mppwidth | awk -F '= ' '{print $2}' )

aprun -n $NP ./a.out


You can do something similar for mppdepth, etc if you happen to use
those Cray options.


Q: I just exceeded my $HOME directory quota.  I would like to move some
files to my $ARCHIVE_HOME directory to make room, but I'm not sure where
to start.  Is there some command I can use to help me free up space by
moving files I haven't used in a while?
 

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top