| Newsletter Index | Quick-Tip Index | Search Newsletters |
WOMPAT 2002 Announcement
========================
Dates:
August 5-7, 2002.
Location:
Arctic Region Supercomputing Center,
University of Fairbanks,
Fairbanks, ALASKA.
The Workshop on OpenMP Applications and Tools (WOMPAT 2002) will serve as a forum for users and developers of OpenMP to meet, share ideas and experiences, and to discuss the latest developments in OpenMP and applications.
WOMPAT2002 follows a series of workshops on OpenMP, such as WOMPAT2001, EWOMP2001, and WOMPEI2002. It is part of the cOMPunity initiative whose main objective is the dissemination and exchange of information about OpenMP. (See http://www.compunity.org/ for more details of this activity and contents of past meetings.)
WOMPAT2002 is co-sponsored by the OpenMP Architecture Review Board.
Contributions are welcome and a one page extended abstract, either ASCII text or pdf, should be sent to:
The deadline for submission of abstracts is April 19th. More details of the meeting can be found at http://www.compunity.org/ shortly. See http://www.arsc.edu/ and http://www.uaf.edu/ for details on host institutions, ARSC and University of Alaska Fairbanks
Workshop Program Committee:
If you're not familiar with OpenMP, it is emerging as an industry standard interface for shared memory programming of parallel computer applications and it offers a way to write applications portable to a wide range of parallel computers. For instance, it is available at ARSC on the Cray SV1ex, all multi-processor SGI systems, and within the 4-processor shared memory nodes of the IBM SP.
In addition, a number of research groups are actively developing future enhancements to the language, debugging and performance monitoring tools, optimizing compilers, and run-time environments. To learn a lot more... consider attending this workshop!
The Arctic Region Supercomputing Center (ARSC) is pleased to invite interested members of the UA community and their associates as well as users from other research institutions to attend the 2002 ARSC Faculty Camp. Building on the success of past Faculty Camp events held in 2000 and 2001, the 2002 Faculty Camp will be held August 12th-23rd and will bring together a diverse group of researchers to learn about high performance computing and share research experiences. Past Faculty Camps have introduced many researchers to ARSC resources and built strong links between attendees and ARSC staff.
The Camp will combine a series of seminars presented by ARSC staff, UAF/ARSC Joint Faculty, and current users with independent/self-guided study and access to ARSC specialists. The exact seminar topics will depend on the needs of the selected attendees but will cover the basics of programming high performance computers, visualization software and skills, and using collaborative environments among others.
Individuals or groups wishing to attend are invited to register their interest by May 1st, and to submit a short, approximately 250 word, description of the skills they would like to develop and how they intend to apply them by May 31st. Please submit text in ASCII or pdf format. This description is important as it is the basis for ARSC to organize events and speakers for the Faculty Camp to match the attendees' needs.
Successful applicants will be notified by June 7th. Those accepted for the camp are expected to participate full-time for the 2 week period. UA researchers will be compensated at regular salary.
More details about ARSC can be found at http://www.arsc.edu/. Applications and questions regarding Faculty Camp should be sent to Guy Robinson, . Please feel free to circulate this announcement around your departments and to others who might find it of interest.
| 1st May: | Expression of interest. |
| 31st May: | Submission of proposals. |
| 7th June: | Confirmation of Acceptance. |
| 12th-23rd August: | Faculty Camp. |
| Fall term: | Seminar on Faculty Camp work and followup. |
Last Wednesday, we made Cray's message passing toolkit (MPT) 1.4.0.4 the new default on yukon (see, "news MPT.1.4.0.4").
Some MPI codes that work fine under MPT.1.3.0.0 deadlock or otherwise fail under the new default. (See the next article for an example...)
Our belief is that Cray's new MPT is likely correct (if not more correct) according to the MPI standard, and thus ARSC will keep this upgrade. However, if your code requires the old MPT, simply execute:
module switch mpt mpt.1.3.0.0
prior to recompiling and running your code. You might add it to your .cshrc or .profile file. We won't delete mpt.1.3.0.0! (This is another example of Cray's wisdom in issuing modular upgrades. We wish other vendors would use "modules" as well, and remind ARSC users that the local "PEvers" script is also available to show you all available versions and defaults.)
Perhaps Plato is responsible for the pesky ideal of permanence. Since computing is so closely related to math (Plato loved those Pythagoreans), the hope that programs won't need to be rewritten might just be an artifact of the history of ideas. The reality of working with codes is more like Heraklitus', "all is flux." Especially in the high-performance computing world, which is quite a lot like H's river, constantly changing beneath your feet. Fortunately, though permanence exists only as an ideal, there are ways to reduce the impermanence the industry tends towards in the relentless pursuit of progress.
Eratosthenes wrote down his prime-number sieve algorithm over 2 millennia ago, and it still works. Contrast: I wrote some MPI extensions to a code back in 1998 and, you guessed it, their days are numbered. On Feb 6, 2002 the message passing toolkit on the T3E was upgraded from 1.3.0.0 to 1.4.0.4, updating the "feature" on which my code depended. This is not an uncommon occurrence, as hardware and system software tend to change much faster than large software projects. Just because you use a high-level language and a "portable" message-passing paradigm doesn't mean your code will work on different systems -- or even on the same system as it changes.
Admittedly, though, I could have done a better job of checking that the code meant the same thing on more than one system (with its potentially unique configuration which could change at any time).
The code never actually worked on anything but the T3E. At the time I wrote it, this was the only system I was worried about. This was probably my biggest error. As Wittgenstein might say -- having tried to create one language in which to express all of philosophy which looked oddly like computer code -- any expression only makes sense within the rules of the game. So one must make sure that all parties involved are actually playing the same game... Assuming this will stay constant, even on the same piece of hardware, is folly. Wittgenstein, by the way, spent the second half of his life recanting the work of his first as oversimplified.
The following is a test code which demonstrates, on a very simple level, what happens in the problematic MPI exchanges in my bigger program:
program MPIblock
implicit none
include 'mpif.h'
integer i,ierr,totpes,mype,frompe,n,req
real,dimension(:),allocatable::A
integer status(MPI_STATUS_SIZE)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, mype, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, totpes, ierr)
n=2
do
allocate(A(n),stat=ierr)
if (ierr/=0) then
print *,"error allocating array. exiting"
stop
end if
do i=0,totpes-1
A=mype
print *,mype," sending msg size ",N," to ",i
call mpi_send(A,N,MPI_INTEGER,i,i,MPI_COMM_WORLD,ierr)
end do
do i=0,totpes-1
call mpi_recv(A,N,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,&
MPI_COMM_WORLD,status,ierr)
frompe=status(MPI_SOURCE)
print *,mype," recv'd msg size ",N," from ",fromPE
if (sum(A)/=fromPE*n) print *,mype,":error from ",fromPE
end do
deallocate(A)
n=n*2
end do
call mpi_finalize(ierr)
end program MPIblock
This test code loops with increasing array sizes until an allocation surpasses available memory or deadlock occurs. Communication is done in two separate loops. One for sends, the other for receives. In this case I've simplified things so it's essentially a broadcast, though I use this basic approach to implement a 3D transpose in the real program. The way it works under MPT 1.3.0.0 is pretty slick. The portable replacement is more ominous, in terms of complexity and required understanding of how MPI works.
If I had tested the code on more systems during development (it was my first real MPI code... I was naive...), I would have better understood the meaning of blocking. All of the sends and receives in this code use this unsafe method. Whether it's the interpretation of the MPI standard used in the current version, or some "sweet spot" the test code might hit between current system hardware and software, it's surprisingly easy to use MPI unsafely -- and get away with it long enough to believe the code actually works.
Until the recent MPT upgrade, a strategy like the program above could handle vast numbers of very large messages on yukon. (Even this test code will happily pass lots of 16MW messages under MPT.1.3.0.0. Under MPT.1.4.0.4 it deadlocks past four. Words, that is, not millions of them.)
The new MPT just brings the T3E more in line with other systems' behavior. Testing on more systems would have revealed how lucky I was to escape deadlock, which really should be expected as defined by the standard. A blocking send, if not matched by a receive, will wait until the receive is completed. Deadlock is the state where more blocking sends must happen before the code can post a receive. Since the sends "ahead" of the one waiting will never happen, the receive will never happen, and the code will hang until it times out or is killed.
So why did this once work on the T3E? One part of the picture, at least, is buffering. The system will preserve the integrity of the buffer AND allow the code to proceed. (When using immediate mode, i.e. isend/irecv, it is up to the programmer to guarantee that the variable used isn't overwritten before the operation completes.) The former behavior of blocking sends was something of a best-of-both-worlds compromise between the two approaches. In general, this is pretty limited -- and should not be depended on, ever. I could say I was "spoiled" by the excellent way the old MPT allowed such extensive use of buffering, as this could be found nowhere else.
As you can see from the following VAMPIR screenshot (figure 1) of the real application under the old MPT, the buffering is extensive. Each processor is sending part of the layer it has just computed (using its vertical domain decomposition) to the other processors so they can proceed with another section of the algorithm using horizontal domain decomposition. In my code, it isn't a trivial change to get around the assumption that buffer integrity is guaranteed AND sends don't need to be immediately matched by receives.
Figure 1:
It worked for what I needed it to do. I got the project finished on time. I graduated. But as I learned more about MPI, when I'd go back over those sections of code I'd cringe a little. As noted, the code deadlocks anywhere other than pre MPT 1.4 yukon. But hey, I'm a busy guy. As Voltaire said, "the best is the enemy of the good." One lifetime is just way too short for perfection. However, it would not have taken all that much more time to develop the MPI on more than one system -- which would have revealed assumptions that change all too quickly. In just one software upgrade major changes have been made. Just look at these VAMPIR plots of the first two iterations of the demo code. (I've zoomed in on the first two iterations, even though under the old MPT the demo runs until memory is filled, as beyond four-word messages deadlock occurs under the new MPT.)
old (MPT.1.3.0.0)
Figure 2:
new (MPT.1.4.0.4)
Figure 3:
I took these VAMPIR images to our MPP specialists, who couldn't deduce what was happening differently under the hood (though if any Cray engineers happen to be reading this, we'd love to be enlightened). In the end, though, it doesn't matter that some particulars changed. It was foolishness to expect that particulars would stay constant -- though there would be no way to know what was particular or not without actually experimenting in several places where basic assumptions might be different.
The T3E MPI is in no way broken, it just works more like other systems now. This latest change is the goad to stop procrastinating: the fluke-dependant code has to go if I want the program to keep working.
Fortunately, there are many wise and helpful people at ARSC. After listening to their input I've come up with a strategy to attack this re-engineering challenge.
An important first consideration: problematic codes written under previous MPTs may be quite easy to fix. Immediate mode (isend, irecv, and wait) might work with some trivial reordering. Buffered send is another trivial "fix". Theoretically, it should solve the problem of running out of buffers before receives can be posted. With a few simple changes my demo worked under the new MPT on yukon, on chilkoot, and the SGIs.
It's absolutely critical to test on many different systems. At least as I'd coded it, my bsend "fix" didn't scale well on the IBM or Linux cluster (alas, it worked so beautifully elsewhere). MPI isn't implemented uniformly, as I touched on in my "Portable MPI?" article in issue 217 ). Out of the huge number of MPI operations available, there is a small portable subset -- and a small number of portable ways to use that subset.
In general, codes get to be more robust as they are ported to different systems. Compilers tend to interpret "correct" code uniformly, and differ on the weird stuff. If a program works the same way on several different systems, it's also more likely to be easily maintainable and scalable. Thus it will also better survive system upgrades and their inevitable retirement.
You may notice there are some familiar elements from software engineering in the above list. Looking at requirements and design as much as possible on paper first (as opposed to just sitting down and coding) has been recognized as good practice for decades. If you can't draw a good picture of how the data needs to move around, you'll probably spend much longer coding -- with less chance of it working correctly in the end -- than if you had done the preliminary work.
Making sure code continues to work correctly while adding a new message passing scheme can be difficult. If a data set is 3D, visualizing in 3D can save a lot of headaches. I've found that visualization, combined with some numerical check (i.e. iterating over all values in an output file, checking that all differences fall within an allowable tolerance range from an output file as produced by a verified run of the original code) works best. The shortcomings of each are compensated by the other.
So far I've only started on the test code. The original is far too cumbersome for repeated trials. The savings in compile time alone far exceed the time required to create the test. It's also easier to fit how the test code works into my little brain, focusing on one problem at a time.
Out of all of the absurdities of life, this one is pretty small. If Nietzsche could conclude "that which does not kill us makes us stronger" after pondering many of the bigger strangenesses of how hard-won things so easily shift to irrelevancy in the modern world, I can rewrite a little program. I figure I'll just have to be philosophical about it... Fortunately, with all of the different systems on which to test the code (MPI is available on ARSC's IBM, Crays, SGIs, and Linux cluster) and all of the helpful people at ARSC it is quite possible to write a program that will be robust and portable -- thus far better able to withstand the constant change on the systems where it must run.
The quick tip is still in hibernation... If you have any tips to share, we'd love to see them.
Many thanks to those who have already responded to this plea!
[[ Answers, Questions, and Tips Graciously Accepted ]]
Contact:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Craig Stephenson ARSC User Consultant ph: 907-450-8653 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.E-mail Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources