ARSC HPC Users' Newsletter 204, September 15, 2000

Exciting Times at ARSC: SV1 Up and Running

ARSC's new 32-processor Cray SV1 is undergoing internal configuration and testing. For a photo of this rare albino SV1, see:

http://www.arsc.edu/pubs/bulletins/SV1bulletin.html

ARSC USERS:

The downtime scheduled on yukon, chilkoot, /allsys, and /viztmp due to the SV1 install has changed. For schedule updates: Type: "news downtime" on any ARSC host, or Click the "Next scheduled downtime..." link for your system of concern on the page: HPC Newsletter 201 we announced the new partnership between ARSC, ERDC, and SGI to bring the first 512-processor single system image Origin3000 on-line.

Introduction to ERDC and the Origin3000 for Local Folks

In HPC Newsletter 201 we announced the new partnership between ARSC, ERDC, and SGI to bring the first 512-processor single system image Origin3000 on-line. Brad Comes, the Director of the ERDC DSRC and Steve Jones, Chief Technologist there, will be visiting ARSC next week. At 1:30pm on Monday Sept. 18, in Butrovich 109, they will be giving a joint presentation on ERDC and the O3000. ARSC users and other interested parties are welcome to attend.

Notes on porting SHMEM codes from the Cray T3E to the SGI Origin

[ Many thanks to Alan Wallcraft of NRL for this article. ]

  1. SGI does not currently support Co-Array Fortran (although it would be very easy for them to do so), but they do provide the SHMEM library for Origin systems. It is part of the MPT (Message Passing Toolkit) module, which you may need to explicitly load at login.

  2. The man pages, intro_shmem, say that SHMEM compilation should use: cc -64 c_program.c -lsma CC -64 cplusplus_program.c -lsma f90 -64 -LANG:recursive=on fortran_program.f -lsma f77 -64 -LANG:recursive=on fortran_program.f -lsma If libsma isn't found, you probably need to load the MPT module.

  3. On the Origin, there must be enough free disk space in /var/tmp, or $TMPDIR, to memory map all static memory (on all PEs) when starting a SHMEM job.

  4. All SHMEM programs must start with CALL START_PES(npes), where npes can be a positive integer or 0. If npes is zero, the number of PEs is taken from the environment variable NPES. START_PES is also allowed on the T3E (although it has no effect), so a portable program should include it.

  5. On the T3E, SHMEM_PUT and SHMEM_GET always transfer 64-bit objects (i.e. default REAL or INTEGER objects). On the Origin, SHMEM_PUT and SHMEM_GET are not available, presumably because both INTEGER and REAL are 32-bit by default. A portable program should replace these with type-specific calls, such as SHMEM_REAL_GET and SHMEM_INTEGER_GET. On the Origin there is no need for the target and source array to be aligned in memory on a 64-bit boundary (as there is for efficient 32-bit get on the T3E).

  6. If your program relies on REAL or INTEGER being 64-bit, there are compiler switches on the Origin that will make this the default. Note however, that -i8 (default 64-bit INTEGER) is dangerous because most system calls (including SHMEM) expect 32-bit INTEGER arguments. The -r8 and either -d8 or -d16 switches are safer and may simplify porting of T3E codes requiring 64-bit REAL. I prefer using explicit REAL*4 or REAL*8, which is as portable as the more general F90 KIND approach. In any case, the 4-byte and 8-byte SHMEM routines, such as SHMEM_GET4 and SHMEM_GET8, should then be used as appropriate to transfer REALs. Note that the Fortran standard requires REAL and INTEGER be the same size, so 64-bit REAL and 32-bit INTEGER is not standard Fortran. This is typically only an issue if REAL and INTEGER variables are equivalenced, either explicitly or implicitly via (say) a redefined common block.

  7. Symmetric objects (accessible via SHMEM) on both T3E and Origin systems include those in common blocks or with the SAVE attribute in Fortran and non-stack C and C++ variables. Fortran arrays allocated with shpalloc, and C and C++ data allocated by shmalloc are also symmetric on both systems. However, #pragma symmetric and !DIR$ SYMMETRIC directives are only available on the T3E.

  8. The intro_shmem man page indicates that "asymmetric accessible" data objects (e.g. stack variables) are not available to SHMEM on the Origin. These are probably rarely used even on the T3E.

  9. The Origin is a non-uniform memory access (NUMA) system, but SHMEM start-up processing automatically ensures that the process associated with a SHMEM PE executes on a processor near the memory associated with that PE. The intro_shmem man page includes a discussion of environment variables that can be used to tune memory placement, but they are rarely necessary.

  10. A SHMEM program is SPMD (Single Program Multiple Data), and performance is typically controlled by the slowest process. Unlike on the T3E, processes do not automatically get dedicated access to a processor on the Origin. Also there is no distinction between application and command processors. It is very important that the total number of active processes be no larger than the number of physical processors on an Origin. This is largely a system management issue (e.g. correct batch queue setup, etcetera). A few processors should also be "reserved" for the operating system. This can be significant on small Origins, for example a 28-PE SHMEM job will probably run faster than an equivalent 32-PE job on a 32-processor Origin. Most large Origins will not "over subscribe" processors - in fact this is a DoD HPCMP requirement for all its large machines of any kind. However, some Origins are configured primarily for single processor jobs and may have more active processes than processors. A SHMEM job might run 10x slower in such an environment than on an under-subscribed system.

  11. Some T3E SHMEM programs only work because how long a given operation takes is highly repeatable on the T3E. For example, a BARRIER might be skipped because the timing is such that the data it is protecting is always available when needed. This is poor programming practice, even on the T3E, and such programs will typically fail on the Origin. Often there is no intent to rely on timing, but if the T3E program always works there is no indication that a barrier is "missing" (until the program fails on the Origin).

  12. On the other hand, global barriers are so fast on the T3E that there is a tendency to add more barriers than necessary. Barriers are significantly more expensive on the Origin, so all unnecessary barriers should be removed. (This is also true of MPI programs, which almost never need MPI_BARRIER for correctness but may include extra MPI_BARRIER calls if targeted for the T3E only.)

  13. In many cases, optimal Origin SHMEM performance can only be obtained by replacing global barriers with local (PE to PE) synchronization. One good way to do this is to use the Co-Array Fortran SYNC_ALL(WAIT) routine, because on the Origin this can use local synchronization while on the T3E it can still use a hardware global barrier. It also has built-in deadlock detection. An alternative is to write your own synchronization operators using SHMEM_PUT and SHMEM_WAIT_UNTIL or SHMEM_WAIT.

  14. The wall clock time required for a given SHMEM program is more variable on an Origin (i.e. more dependent on system load) than on a T3E. A variation of about 5% between runs would be typical.

  15. Despite the long list of differences above, it is typically fairly easy to convert an existing T3E-only SHMEM program so that it runs efficiently on both the T3E and the Origin. Compaq and IBM also have SHMEM libraries for some of their machines, so SHMEM is becoming more viable for writing portable SPMD programs. An alternative might be MPI-2 put/get, which is partially implemented on the SGI Origin and Sun E10000. One-sided communication (e.g. SHMEM or MPI-2 put/get) is typically faster (i.e. has lower overall latency) today on machines with a hardware global memory system than MPI-1 message passing. However, this may be due to sub-optimal MPI-1 implementations from some vendors - since Sun MPI message passing is very fast on the E10000 and about equivalent in efficiency to MPI-2 put/get.

Understanding Global Change in the Arctic

"Understanding Global Change in the Arctic" is a full-color brochure outlining the accomplishments and future objectives of the NSF Arctic System Science (ARCSS) Program for a broad general audience. This brochure provides the arctic research community with an informative and useful aid in outreach and education efforts. It was published by the Arctic Research Consortium of the United States (ARCUS) for the NSF ARCSS Program.

Copies are available on request from ARCUS (phone: 907/474-1600; fax: 907/474-1604; email: arcus@arcus.org ) or you may download an electronic copy from the ARCUS web site at:

http://www.arcus.org/ARCSS/brochure/index.html .

Mannheim SC2000 Tutorials on "Clusters & Grids"

[ This is from a notice we received. ]

All presentations of the "Clusters & Grids" Tutorial (June 13-14, 2000 in Mannheim) are freely available now and may be downloaded from our Webserver:

http://www.supercomp.de/

Day1:

http://www.supercomp.de/programm/tutorium/13_06_00.htm

Day2:

http://www.supercomp.de/programm/tutorium/14_06_00.htm

The proceedings of the Mannheim SC2000 Conference are still available (CD-ROM edition with all lectures) and my be ordered. Go to: http://www.supercomp.de/ and click right button, bottom of homepage.

Quick-Tip Q & A



A:[[ How can I find out what the current UNICOS/mk version number is?  
  [[ I use three different T3Es and they're always upgrading the OS
  [[ whenever they darn well feel like it.  If I don't start logging this
  [[ stuff I think I'm gonna wind up in deep sauerkraut!

From Barbara Herron, LLNL:


  I like "uname -a". The response will be of the format 

    hostname nodename release version hardware 

  where "release version" is the version and system that is running.
  "uname -r" will print just the operating system release number, and
  "uname -s" will print just the system name. Run "man uname" for more
  details.

  
Brad Chamberlain, U. Washington, had the same answer, and adds:

  [...]

  A general tip on uname:  While it's fairly well-supported from one
  operating system to the next, the options and what they do tend to
  vary just enough to drive you mad.  The most general solution is to
  use uname -a which gives (A)ll the information about that system.  The
  trick is then picking out what each element means (not always as easy
  as you might think).

    yukon% uname -a
    sn6327 yukon 2.0.5.38 unicosmk CRAY T3E




Q: Should I stop using Fortran 90's "WHERE" construct?  It makes my code
   more intelligible, but my sister-in-law's boyfriend told her it's
   very slow.

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top