ARSC HPC Users' Newsletter 271, June 27, 2003
The SX-6 was installed a year ago at ARSC. It will remain here for one more year, until June 2004.
The SX-6 is available, by application only, to the U.S. HPC community for testing and benchmarking of codes. It's intended for research, not production. In the United States, this remains a unique opportunity.
For practical purposes, we recommend you not delay too long. A rush of users at the end of its tenure could make scheduling of dedicated runs more difficult and otherwise stress the system.
- ARSC's primary SX-6 page, with links to application forms, statement of purpose, etc:
- For current and potential SX-6 users, here's info on compilers, profilers, performance analysis, optimization, and just getting started:
- For porting and performance case studies, as well as user experiences of the SX-6, download the PDF file: "SX-6 Comparisons and Contrasts" from:
More questions? Contact: firstname.lastname@example.org
Technical Papers and ReportsWe've updated our web site with the following recent research papers by ARSC staff:
Portable Cray Bioinformatics Library, J. Long, ARSC, Proceedings of the Cray User Group, May 2003
The ARSC Storage Solution, G. McGill, ARSC, Proceedings of the Cray User Group, May 2003
SX-6 Comparisons and Contasts, T. Baring, ARSC, Proceedings of the Cray User Group, May 2003
Performance of ROMS on Regatta and SX-6
[ ARSC hosted two Air Force Academy Cadets as interns earlier this summer. Our thanks to Cadet Ryan Roper for this report. ]
My study of optimal performance between the SX-6 and the Regatta has been interesting in my brief time at ARSC. My task was this: to find the best way to run the MPI compilation of the Regional Ocean Modeling System (ROMS) in 200 timesteps, discover the ARSC HPC system on which it runs the best, and compare it to the runners up. ROMS is an ocean simulation managed at ARSC by oceanographer Kate Headstrom, and for this benchmark, it runs an idealized square portion of a southern ocean. Before coming to ARSC, I had had absolutely no exposure to supercomputing and limited UNIX experience (which served me well once I finally remembered it) and this project was an excellent initial exposure.
From my tests, the ARSC host best suited to this ROMS benchmark is the IBM Regatta (known here as Iceflyer). I had the fewest compilation problems and the fastest absolute run times on that machine. When all was said and done, I had made 13 runs, varying the number of processors and tilings, to test the hypothesis that tilings closer to squares (2x2, 3x3, etc.) would run faster. A "tiling" is the way that the data is distributed across the processors. While I did find that large numbers of processors vastly improved the running time, the results were somewhat erratic and the square tiling did not always produce the best results.
My guess is that a square tiling may be ideal, but one would need the machine to one's self to test this. (The machine was shared with other batch users during these runs.) However, my recommendation is that if one needs to run ROMS on the Regatta, use as many processors as you can get your hands on (I found that 20 at 5x4 was best) and tile them so that the dimensions are decently close together. Unfortunately, f1n1, the Regatta main batch node, has only 24 processors or else a 5x5 tiling might have been ideal.
Num Procs Tiling Running Time ----- ------ ------------ 1 1x1 1458.32 2 2x1 689.83 4 1x4 311.17 6 3x2 254.08 9 3x3 140.46 12 3x4 107.82 16 8x2 84.43 20 5x4 71.44 Table 1. Best case total running time on the Regatta
The SX-6 was no less busy than the Regatta, but there was a clear pattern with the most efficient tilings. From my tests, the best way to run a ROMS job on the SX-6 is to tile 1xN, where N = number of processors. This happens because the long skinny tilings make the most efficient use of the SX-6's long, 256 element, vector registers. The SX-6 at ARSC has 8 processors in a single shared memory cabinet.
What was also interesting is that the single and dual processor runs of the SX-6 were faster than those of the Regatta. As the Regatta increases its processor number, performance more closely matched the SX-6 until, by sheer availability of processors, the Regatta pulled away. One will notice that the time spiked up at 6 processors. I would like to investigate that in more depth, but from a rough re-run of the simulation, I can guess that it is due to greater usage of the machine; the re-run was run in a less crowded batch queue and produced a time more in line with expected results. What would be interesting to see would be if there were more available processors in the SX-6 to compare it better to the Regatta. It certainly is a capable machine.
Num Procs Tiling Running Time ----- ------ ------------ 1 1x1 1159.859 2 1x2 596.588 3 1x3 416.795 4 1x4 321.181 5 1x5 261.259 6 1x6 272.898 7 1x7 199.577 Table 1. Best case total running time on the SX-6
The following graph summarizes the timings. It shows how similarly the code performs on the Regatta and SX-6 on 1-7 processors, and the advantage of additional processors on the Regatta:
Figure 1. Regatta and SX-6 ROMS scaling
In conclusion, it is best at ARSC to run ROMS on the Regatta. I did additional tests with the Cray SV1ex and the Cray T3E, but I had more than my share of difficulties getting the MPI and Open MP versions to even compile let alone run. Once they were running they performed nowhere near as well as the Regatta had, even with large numbers of processors. I was impressed by the SX-6's performance and I'm sure it would be well suited to run ROMS if it weren't for its limited number of CPUs.
Ryan L. Roper USAF Academy Class of 2004
Quick-Tip Q & A
A:[[ Here's one person's definition of "code-blindness", grabbed off the [[ web: [[ [[ "... the inability to actually work out what on earth your code is [[ doing, even though you were wholly responsible for it..." [[ [[ Programmers: do you have a technique for snapping yourself out of [[ code-blindness, or avoiding it in the first place? Here are a few ideas from the editors: - Explain the problem to someone else. It's amazing how often you'll see your own problem in a new light when you attempt to explain it. - Re-document, and even over-document, the code. This can force you to rethink it and understand it from the algorithms down to the names of variables. - Revise it (or start) with better variable names. Mix case. [ Feel free to respond late to this question. We'd like to expand the above list with your experience. ] Q: Is there an easy way to extract a column from a regular text file? For instance, a column of data. Or am I back to writing a perl script?
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.