| Newsletter Index | Quick-Tip Index | Search Newsletters |
Ed Anderson of the benchmarking group at CRI sent in the following technical publication:
Title: The Benchmarker's Guide to Single-processor Optimization
for CRAY T3E Systems
By: Ed Anderson, Jeff Brooks, and Tom Hewitt; Benchmarking Group,
Cray Research.
Abstract: "How to use features of the CRAY T3E series processors
from a high-level language to improve the performance of
application codes."
Chapters: - Introduction
- Hardware Overview
- Functional unit optimization
- Cache optimizations
- Stream buffer optimizations
- E-register optimizations
- Compiling for performance
- Case study: the NAS benchmark kernels
- Performance tools
Every T3E programmer is encouraged to retrieve a copy. It is available in postscript via anonymous ftp to: ftp.arsc.edu. It is in the directory: pub/mpp/docs, and is named: bmguide.ps.Z.
[ Many thanks to Dr. Alan Wallcraft, a scientist with the Naval Research Center, for sending this in and sharing his experience and results. ]
The SHMEM library is now available on the SGI Origin 2000. Use of the library is very similar to on the T3E, except that REAL and INTEGER are 32-bit by default. If you use SAVEd local variables in SHMEM calls, you must use -LANG:recursive=on to prevent optimization around BARRIER's:
f77 -64 -LANG:recursive=on fortran_program.f -lsma
I have found that the 64-bit ABI (f77 -64) is essential for practical cases. The only other differences I have encountered are START_PES to initialize SHMEM (can also be optionally used on T3E), and SHMEM_BARRIER_ALL in place of BARRIER. I have not been able to get T3E-style parallel I/O to work on the Origin and thus modified my code to optionally do all I/O (writes) from PE 0. The other gotcha is that startup involves memory mapping from /tmp. If this is too small, point TMPDIR to a large enough scratch disk.
On my standard ocean model test case, the Origin is almost identical in speed to a T3E-600. However, unlike the T3E, the Origin does not dedicate processors to a single job. I have seen very little variation in SHMEM performance on "undersubscribed" systems (with fewer active processes than processors). However, I have heard of up to 10x degradation in MPI performance on over-subscribed Origins, and would expect similar problems with SHMEM.
Most existing Origin systems have 64 or fewer CPU's. Larger machines have been sold, but are currently running in "pieces". The Origin is showing less scalability to a large number of nodes than the T3E-600, and falls over completely on 48 (and 64) nodes. A larger problem might exhibit better scalability above 32 nodes on the Origin, but the T3E's are still showing good scalability beyond 32 nodes even on this relatively small region.
PERFORMANCE OF NRL LAYERED OCEAN MODEL ON EXISTING HPC PLATFORMS
----------------------------------------------------------------
MACHINE PARALLEL NUM. TIME MFLOPS SPEEDUP
METHOD NODES
Cray C90 VECTOR 1 1.36 hrs 355 (REAL*8)
Cray T3D SHMEM 4 5.15 hrs 94
Cray T3D SHMEM 8 2.70 hrs 179 1.91x 4 nodes
Cray T3D SHMEM 16 1.41 hrs 342 1.91x 8 nodes
Cray T3D SHMEM 32 0.81 hrs 596 1.74x 16 nodes
Cray T3E-600 SHMEM 4 1.71 hrs 282 3.00x T3D
Cray T3E-600 SHMEM 8 0.91 hrs 532 1.89x 4 nodes
Cray T3E-600 SHMEM 16 0.49 hrs 993 1.87x 8 nodes
Cray T3E-600 SHMEM 32 0.28 hrs 1752 1.76x 16 nodes
Cray T3E-600 SHMEM 48 0.20 hrs 2365 2.38x 16 nodes
Cray T3E-600 SHMEM 88 0.16 hrs 3115 1.32x 48 nodes
SGI O2000 SHMEM 4 1.70 hrs 284 0.99x T3E-600
SGI O2000 SHMEM 8 0.84 hrs 575 2.02x 4 nodes
SGI O2000 SHMEM 16 0.43 hrs 1123 1.95x 8 nodes
SGI O2000 SHMEM 32 0.28 hrs 1724 1.54x 16 nodes
SGI O2000 SHMEM 48 0.27 hrs 1788 1.59x 16 nodes
Cray T3E-900 SHMEM 4 1.44 hrs 335 1.19x T3E-600
Cray T3E-900 SHMEM 8 0.78 hrs 623 1.86x 4 nodes
Cray T3E-900 SHMEM 16 0.40 hrs 1203 1.93x 8 nodes
Cray T3E-900 SHMEM 32 0.23 hrs 2109 1.75x 16 nodes
o TIMES ARE FOR A ONE YEAR 1/2 DEGREE GLOBAL OCEAN MODEL RUN
o NOT A BENCHMARK: RUN INCLUDES TYPICAL I/O AND DATA SAMPLING
o THIS IS A SMALL PROBLEM (GRID SIZE: 512x288x6), LARGER GRIDS
SCALE TO MANY MORE NODES
Here is the beginning of a reading list. If you have any favorite texts or other sources, please send them in. We will expand this list and reprint it every few months.
MPI information sources: MPI: The Complete Reference. Snir, Otto, Huss-Lederman, Walker and Dongarra. MIT Press. ISBN 0 262 69184 1 Using MPI. Gropp, Lusk, Skjellum. MIT Press. ISBN 0 262 57104 8 Parallel Programming Skills/Examples: Practical Parallel Programming. Gregory V. Wilson. MIT Press. ISBN 0 262 23186 7 Designing and Building Parallel Programs. Ian Foster. Addison Wesley. ISBN 0 201 57594 9 http://ww.mcs.anl.gov/dbpp/ Fortran: Fortran90/95 Explained. Metcalf and Reid. Oxford Science Publications. ISBN 0 19 851888 9 Fortran90 Programming. Ellis, Philips, and Lahey. Addison-Wesley. ISBN 0-201-54446-6 C and C++: Parallel Programming using C++. G.V.Wilson and P Lu. MIT Press. ISBN 0 262 73118 5 Background information: Hal's Legacy: 2001's Computer as a Dream and a Reality. ISBN 0 262 19378 7 High Performance Compilers for Parallel Computing. Michael Wolfe, Addison-Wesley. ISBN 0-8053-2730-4 Supermen. C.J Murray. Wiley. ISBN 0 471 04885 2
Yukon's queue structure has been revised as follows:
m_64pe_8h@yukon has removed. m_64pe_4h@yukon has been added.
The queue priorities are as follows:
From 06:00 - 18:00, Alaska time:
m_32pe_8h@yukon has priority over m_64pe_4h@yukon.
From 18:00 - 06:00, Alaska time:
m_64pe_4h@yukon has priority over m_32pe_8h@yukon.
Based on the 1.5.0 release overview, here is the impact on some user features. This release also has improved stability due to resolution of earlier problems.
Removed Features:
- The C_JOBPROCS resource was removed as a resource category for the
getlim system call.
- The C_JOBPROCS resource was removed as a resource category for the
setlim system call.
- jobprocs category was removed from the nlimit -c option
Added Features:
- grmview(1) changes:
- -m option displays PE map with compression
- -M option displays PE map without compression
- ApId (application ID) and command fields added to queue displays
- ja -S option produces a summary report in table format.
- Resource limits
- nlimit(1) has a new option mppt (MPP time limits) under the resource
option.
- jstat(1) options added:
- -m indicates that multi-PE application memory and times are
requested.
- -M indicates that both command & multi-PE app memory & times are
requested.
- -t indicates extended time info requested.
- limit(1) options added:
- -F option to limit file size a process or job can create
- -M option to limit MPP memory size
- Added limit values to command output
- limit(2) has 2 new options:
- L_MPPM added to resource option
- C_APTEAM category added under category option
- nlimit(3c) added C_APTEAM as a resource category
- Limited checkpoint and restart support
- Processes using open files cannot be checkpointed or restarted.
- System V IPC message queues and semaphores available
A: {{ On the T3E, how can you limit the size of core files? }}
# Try the "limit" command. From the man page:
#
# -d dlim Indicates a limit on core file sizes. The
# dlim argument refers to memory words and is
# rounded up to the nearest click boundary.
# There are 512 decimal words per click on
# Cray Research systems. A dlim of 0
# indicates the maximum core file size
# allowed. This option is supported only for
# processes.
#
# This command seems less flexible than advertised, however.
# In multiple intended crashes of a 1 PE test program with a
# variety of dlim settings, I found that any dlim value other
# than 0 resulted in a core file of 160 bytes. A dlim value of
# 0 resulted in core files of 2168992 bytes. The limit command
# acts more like an on/off switch.
Q: Here's a frustrating situation. You are using the "copy" and
"paste" features of your workstation's desktop environment to insert
text from one window into a file you're editing with "vi" in a
second window. You "copy" the text, then, in the vi window, hit "i"
for "insert," then you execute the "paste," but the result comes
out with an
additional
indentation
on each
subsequent
line
(like
this).
What is (probably) wrong?
[ Answers, questions, and tips graciously accepted. ]
Contact:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619 Donald Bahls ARSC User Consultant ph: 907-450-8674 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.Email Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources