ARSC HPC Users' Newsletter 358, March 23, 2007

Porting and Migration--Part I: Data

[ By: Lee Higbie]

Porting codes as technologies are modernized is part of the life of a computational scientist. This is the first of a few articles describing some of the things you should do or can do to facilitate porting. In this article I will concentrate on data portability issues.

First some background on data representations.

Machines now use similar floating point data representations, more or less. They all use IEEE format which specifies the number of bits in the exponent (range) and mantissa (accuracy) of various floating data types. The IEEE standard does not specify the data layout in memory. Fortunately there are only two common ways of shuffling the bits into memory, but there are two. If you write binary data on one machine and read it on another, you are asking for trouble. And that is the raison d'etre of this article.

In addition to the arrangement of bits in memory, different compilers use different numbers of bits for variables. Though the bit-lengths can be specified, there can still be portability issues.

Fortran Data-Portability Suggestions

Mixing data types in COMMON

COMMON data are stored and accessed by location. If there is a data storage length change, all variables after that will be accessed incorrectly.

Solutions:
  1. Convert COMMONs to MODULEs, which are accessed by name (MODULEs are the Fortran-90 replacement for COMMONs).

  2. If you can't convert to Fortran 90, separate data in COMMON blocks so that each block only contains data of a single, uniform type. For example suppose you have the following two COMMON statements in two different functions or subroutines (and assume the variables are typed as their names imply).

    
      In subr1: COMMON /first/ int1, string, realArray
      In fn2:   COMMON /first/ int1, int2, realArray
    

    What mistakes can we find here? First, the COMMONs mix data types. Second, the data type of the second variable varies from subr1 to fn2. Third, the COMMON statements are not syntactically identical.

    If the subr1 COMMON uses a short string (to describe realArray, say) that the second doesn't use, or the first COMMON is used twice but compiled separately with some obscure (are there any other kind?) compiler option, then references to realArray may be offset, (just as they are typographically above).

    If you must use COMMONs, separate types:

    
      COMMON /firstInts/ int1, int2
      COMMON /firstReals/ realArray
    
    and use identical statements in all routines.
Implicit data typing

The second statement of every program, function or subprogram should be IMPLICIT NONE, and to be certain of typing, you should tell the compiler to require it. There are many famous stories of programs that failed in spectacular ways. My favorite is the spacecraft that went to Mars instead of Venus because DO 10 i=1,10 was typed with a period instead of a comma. Don't let your career be blasted into space because of mistyped code.

To enforce typing: For the IBM xlf compiler, use -u; for Nelchina's PGI's compiler, use -Mdclchk. I haven't found a comparable flag for Midnight's PathScale compiler.

Data-portability Suggestions For Any Language

Binary reads and writes of permanent files

The probability of being able to read a binary data file on a new machine is low. Even if the internal data representations are identical, you have no guarantee that external representations (in files) are. It is reasonable and proper to use binary I/O for temporary and emergency restart files to leverage their decreased computational load. That's all.

Binary data files

You have no way, at least almost none, of knowing how data are represented externally. We recommend netCDF. Kate Hedstrom has contributed several Newsletter articles on netCDF and friends:

And there are many other sources of information:

Basically, netCDF adds an ASCII (plain text) header to a binary file so it can assure portability, at least for a decade or two. This means you only pay the speed penalty of non-binary I/O for the small header and not for the large dataset.

If the amount of data is small, formatted I/O is a practical alternative. The formatting changes the data into ASCII or plain text, which also makes it readable with tail, more, an editor, etc.

There are several reasons to use netCDF for external data files:

  • Its headers provide some of the metadata that current ISO and DoD standards require.
  • It's portable.
  • It provides fast I/O.
  • There are many utilities that allow you to look at large datasets. Quick & dirty visualization tools, for example, allow you to easily display plots, maps, etc of the data.
  • It provides easy access to subsets of arrays.
  • It is the external format in use or being adopted for weather data of all types.

One last note about permanent files. Files that have been archived will be saved forever, but that doesn't guarantee that you can read them next year. If they are binary files, you will have to provide information about the file format for anyone to have a chance of helping you recover the data. It is easy for me to save a Linear A or Sanskrit document. Either is easier to read than an unknown-format binary file, for any of us. In practice, both are info-free works of art.

Great Alaska Weather Modeling Symposium, Recap

[ By: Greg Newby ]

A symposium sponsored by ARSC and others was held on the UAF campus March 12-14. The focus of the symposium was methods and outcomes for computational weather modeling for Alaska and surrounding areas. Symposium participants came together from across Alaska and the rest of the US, for nearly three days of presentations and discussion.

One of the most exciting outcomes was information sharing among the three distinct groups of participants: weather practitioners, modelers, and researchers.

Despite many interests in common, these three groups have limited opportunity for in-depth discussion. This is because weather practitioners (with representatives including the Fairbanks and Anchorage National Weather Service forecasting office) are very focused on daily forecasts and other high-visibility work. Modelers, and those who develop models (with representatives including UAF academic members, as well as the major developers of the Weather Research and Forecasting model, WRF), face often competing demands from different user groups, on top of the computational challenges of getting their models functional, validated, and accepted. Meanwhile, researchers (with representatives from UAF and elsewhere) can often spot deficiencies in approaches of models and practitioners, but might not share their need for time-sensitive model output and forecasts.

Crowding all three groups in the room together, along with several other users of weather model output, created wonderful interaction and cross-fertilization. Attendees learned about a variety of topics, including:

  • The status and future of the WRF model, and allied efforts including WRF/Chem (a model for chemical processes, often used to forecast air quality)
  • Application areas utilizing model output, including fire modeling and snow depth estimation (the classic Alaska "fire and ice" duo!)
  • Phenomena related to weather, including sea ice, ocean currents, and permafrost
  • Long-term weather roles for climate change

The event included daily detailed weather briefings from National Weather Service staff members. These offered an opportunity for lively questions and answers about how Alaska weather patterns behave, and the challenges these patterns produce for models and forecasters.

The International Polar Year has just commenced, which made the timing of this symposium particularly appropriate. The role of weather and weather models is changing, along with global climate patterns. Symposium participants recognized that today's models might need to be adjusted for the future, and were also aware of the special challenges that Alaska presents (such as dramatic topography and vast areas, but sparse weather monitoring stations), even without the prospect of climate change.

The conference presentations will be archived at http://weather.arsc.edu/ (which is also the home of ARSC's twice-daily WRF output covering the Alaska domain). ARSC's support for computational science for investigation of weather activity extends to its intern and post-doctoral fellows programs, to related disciplines such as fire and climate modeling, and will be ongoing.

CVS Setup for Projects at ARSC

[ By: Anton Kulchitsky and Sergei Maurits ]

Extended groups of developers of a large project benefit tremendously from version control systems. Version control systems are used for two primary purposes: record keeping and collaboration. They are capable of tracking all modifications in the codes made by different contributors with thorough documentation of all changes.

They allow different branches of the project to develop and support the merging of different versions. Even a project with a single developer can benefit from such a tool as it can assist with documenting, tracking all modifications, and working with several versions of the software concurrently. See > HPC Users' Newsletter issue 278 Quick-Tip for more encouraging information.

There are many version control systems available now. Some of them, like CVS and Subversion, use a repository where the project is saved together with information of all its modifications. They are also known as server-client systems. Other, like darcs , do not require a centralized repository. They are known as distributed systems. Some comparisons of different version control software available at Wikipedia .

For the server-client systems, the server facility must have sufficient capacity, be accessible to all contributors, be regularly backed up, and be connected to a fast network. At ARSC, two storage systems, nanook and seawolf , fit these requirements. See for details.

Below are instructions on how to start using CVS at ARSC. Subversion set up should be similar.

  1. By request of the project PI, ARSC consultants can create a project directory in the nanook partition, /projects, e.g., for the project "MY_PROJECT," the directory would be /projects/MY_PROJECT. Its owner will be the project PI and its group, the project group.

    Only members of the project group will be able to work with the repository.

  2. The access permissions on the repository directory should be set to "drwxrws---" (note the "s" for "setgid bit"). This ensures that when project members add files or directories, they will be assigned the correct group. Failure to do this will cause problems if the primary group of any project member is anything other than the project group.

    Assuming you have a project directory, "/projects/MY_PROJECT," and a repository directory, "/projects/MY_PROJECT/repo," the following command will set the setgid bit on the directory:

    
        chmod 2770 /projects/MY_PROJECT/repo
    

    ** WARNING: It's against ARSC's security policy to set the SGID bit ** on any file. You can set it on a directory as described here, but ** don't set it on anything else.

  3. If you have an existing repository and are transferring it to a new project directory, then ownership and permissions must be set properly. To do this, you should:

    1. copy your previous repository to the project directory
    2. change all ownership to the whole "repo" directory tree:

      chown -R project_leader:project_group /projects/MY_PROJECT/repo

    3. set the proper permission to all *directories* in the tree. You do not have to change permissions to files:

      find /projects/MY_PROJECT/repo -type d -exec chmod 2770 {} \;

  4. To be able to use the repository, you must set the CVSROOT environment variable to point to it. On the ARSC network of Linux boxes, the nanook projects directories are mounted on the local machines. Thus, continuing the example, you would make this setting:

    
      [ksh/bash users:]
        export CVSROOT=/projects/MY_PROJECT/repo
    
      [csh users:]
        setenv CVSROOT /projects/MY_PROJECT/repo
    

    This command can be placed in the appropriate shell initialization file.

  5. On the remote machine you will need to use kerberized ssh:

    
      [ksh/bash users:]
        export CVS_RSH=/usr/local/bin/ssh
        export CVSROOT=:ext:nanook.arsc.edu:/projects/MY_PROJECT/repo
    
      [csh users:]
        setenv CVS_RSH /usr/local/bin/ssh
        setenv CVSROOT :ext:nanook.arsc.edu:/projects/MY_PROJECT/repo
    

The above steps should complete the ARSC-specific CVS (or subversion) configuration. To learn how to use CVS, please check the appropriate documentation. You may also stay tuned to this Newsletter for follow-up articles.

Quick-Tip Q & A


A:[[ A utility writes some text, most of which is junk.  There's a
  [[ pattern repeated here and there within the text.  I want a simple
  [[ filter which prints every instance of the pattern.  E.g., from the
  [[ following output of "prog," it should only print the 5-digit numbers:
  [[ 
  [[  $ /usr/local/bin/prog
  [[  Attachments: 31613:  (text/plain / 110b), 31614:  (text/plain / 1k),
  [[               31615:  (text/plain / 2b), 31616:  (text/plain / 0b),
  [[               31619:  (multipart/mixed / 0b),
  [[               31620:  (text/plain / 25b),
  [[               31621:  (multipart/mixed / 702b), 31622:  (text/plain / 0b)
  [[  
  [[ In my attempt to solve the problem I use "tr" to remove all linefeeds.
  [[ Then with perl, I do a global, non-greedy match (".*?") of everything
  [[ prior to the interesting pattern and another non-greedy match of
  [[ everything after the pattern, and then replace everything matched with
  [[ just the interesting part.
  [[ 
  [[ This approach works perfectly... except... as you can see, it doesn't
  [[ delete the final unwanted text:
  [[ 
  [[   $ /usr/local/bin/prog 
 tr -d '\n' 
 perl -pe "s/.*?(\d+):.*?/\1 /g"
  [[   31613 31614 31615 31616 31619 31620 31621 31622   (text/plain / 0b)
  [[ 
  [[ Yes, I know, I could pipe this through another filter to delete the
  [[ remaining unwanted text.  But this problem is so darn simple, I'm
  [[ frustrated I can't solve it with one regular expression.  Any ideas?



  #
  # From one editor... 
  # 

  With perl's "zero-width positive look-ahead assertion" you can
  force the final non-greedy match to grab everything up to the end
  of the string.

     perl -pe "s/.*?(\d+):.*?(?=(\d+:
$))/\1 /g


  Here's an easy example of look-ahead: 

    $ echo "the time is come the walrus said" 
 perl -pe "s/the/MY/g"           
      MY time is come MY walrus said

    $ echo "the time is come the walrus said" 
 perl -pe "s/the(?= walrus)/MY/g"
      the time is come MY walrus said



  #
  # Dale Clark gives a different perl solution:
  #

    helios> /usr/local/bin/prog 
\
      perl -ne 'print map "$_ ",/\d{5}/g; END { print "\n" }'
    31613 31614 31615 31616 31619 31620 31621 31622

  The '-ne' flags direct Perl to execute the code following for each line
  of input, but with [n]o print statement (in contrast to '-pe'). The code
  itself just greps for all 5-digit sequences, then postpends a space to
  each via the map function. The END sequence supplies code to end with.


  #
  # The other editor has another idea...
  #

  You can use a list and a regular expression to save all of the numbers
  in one operation.

    /usr/local/bin/prog 
 tr -d '\n' 
\
      perl -e 'while(<>){@w=($_ =~/(\d+):/g);print "@w \n";}'

  Here's another version which doesn't require the tr -d '\n' operation.
  
    /usr/local/bin/prog 
\
       perl -e 'while(<>){@w=($_ =~/(\d+):/g);print "@w ";} print "\n";'
    31613 31614 31615 31616 31619 31620 31621 31622


  #
  # Martin Luthi wins the easiest answer prize for "grep -o" ...
  #

  Within the shell you could use the repetition syntax \{N\} of grep:

    cat /usr/local/bin/prog 
 grep  -o "[0-9]\{5\}"


  Of course I would do it in Python, because I am able to understand
  even after years, and it is trivial to extend and adapt:

    for line in file('/home/tinu/del'):
        for x in line.split():
            if x[:5].isdigit():
                print x[:5]

  There are, of course, terser variants, but with the cost of being less
  readable.




Q: Sometimes when I "cp" a file, it changes the group ownership.  What's 
   up with that?  

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top