ARSC T3E Users' Newsletter 165, April 05, 1999

VAMPIR Images of Parallel IO

VAMPIR is a performance analysis tool for parallel MPI programs. It does more than display just the message passing taking place. User-selected parts of code activity can also be inspected.

In this article, we use VAMPIR and last week's example code for MPI_GATHERV to illustrate the potential benefit of doing IO in parallel.

Given last week's test code, the IO activity can be plotted by adding calls to three routines from the VAMPIR API. The first call defines a new symbol for IO activity, the second and third mark the start and end of IO processing. Here's the modified IO section of code:

      call VTSYMDEF(10,"IO_ACTIVITY","IO_ACTIVITY",ierr)      call MPI_BARRIER(MPI_COMM_WORLD,ierr)      call VTBEGIN(10,ierr)!! write data out on host processor.      if(myid.eq.Uroot) call var_dump(iuchan,ipglb,uglb)      if(myid.eq.Vroot) call var_dump(ivchan,ipglb,vglb)      if(myid.eq.Wroot) call var_dump(iwchan,ipglb,wglb)      if(myid.eq.Troot) call var_dump(itchan,ipglb,tglb)      call VTEND(10,ierr)

Comparing the single-processor IO versus parallel IO versions is now easy. For the following examples a problem size of 5,000,000 integers was used.

In the single-processor IO version, all the data is gathered onto one root processor and written out there. The IO requires over three seconds. Here's the VAMPIR "global timeline" plot of this method:

In the parallel IO version, the data is gathered and written out on four different roots. The IO now takes just over a second, as this plot shows:

Two caveats are in order. First, since the files are on the same filesystem, there is a point of diminishing returns as additional processors are enlisted to perform IO. This effect was determined to occur at about the eighth processor, as shown in newsletter #157:

Second, this approach of gathering single variables together for IO activities limits the parallelism to the number of variables.

Despite these caveats, there is clearly a reward to implementing this simple form of parallel IO, greatest when codes need to write many sets of data during a run. Such simple multi-processor IO is used to great effect in some codes currently running at ARSC.

(For details, examples, and help on VAMPIR, see ARSC's tutorial at:

http://www.arsc.edu/support/howtos/usingvampir.html)

cache_bypass Accelerates Co-Array Fortran Programs

The CACHE_BYPASS directive specifies that local memory references in a loop should be passed through E registers. This can speed co-array data transfers.

An example. Given these declarations:

       real :: c(1000000)[*], a(1000000)[*]

Here are two ways to move n elements from array "c" on image 1 to array "a" on image 2. They use explicit DO loops:

CAF "get" loop version:

       if (THIS_IMAGE () .eq. 2) then         do  i=1,n           a(i) = c(i)[1]         enddo       endif

CAF "put" loop version:

       if (THIS_IMAGE () .eq. 1) then         do  i=1,n           a(i)[2] = c(i)         enddo       endif

We found through testing that the above CAF methods using explicit DO loops benefit tremendously by addition of the "cache_bypass" directive:

       if (THIS_IMAGE () .eq. 1) then!dir$ cache_bypass a,c         do  i=1,n           a(i)[2] = c(i)         enddo       endif

Here are two more ways to move n elements from array "c" on image 1 to array "a" on image 2. They use Fortran 90 array syntax:

CAF "get" array syntax:

       if (THIS_IMAGE () .eq. 2) then           a(1:n) = c(1:n)[1]       endif

CAF "put" array syntax:

       if (THIS_IMAGE () .eq. 1) then           a(1:n)[2] = c(1:n)       endif

Neither array syntax version benefitted from cache_bypass. Also, both array syntax versions were slower than their explicit loop counterparts when cache_bypass was used with the latter. This is similar to the finding given in newsletter #127:

that array syntax is slower for processor to local memory transfers.

Finally, here are the SHMEM equivalents of the same transfer:

SHMEM_GET:

       if (SHMEM_MY_PE () .eq. 1) then         call shmem_get (a, c, n, 0)       endif

SHMEM_PUT:

       if (SHMEM_MY_PE () .eq. 0) then         call shmem_get (a, c, n, 1)       endif

SHMEM bypasses the cache by definition, and, as expected, achieves the fastest transfer rates. However, CAF get, with an explicit DO loop, and with cache_bypass, is nearly as fast as SHMEM.

For reference, there's documentation on cache_bypass in the Cray on-line docs. (ARSC users, read "news documents" on any ARSC system to see how to log onto the doc server.") It's in: "CF90(TM) Commands and Directives Reference Manual," section 3.4.3.

Book Reviews: MPI and PVM News

The cover of first edition of 'MPI, The Complete Reference', was a somewhat calming blue color. However, the new two volume set, a second edition of the first book and a volume covering many of the MPI2 extensions, are alarming orange and yellow colors respectively. (Another good text on MPI, 'Using MPI,' was also blue in color. This often caused confusion when helping users over the telephone: "Have you got an MPI book?" "Yes, which one?" "The blue one.") So, what else has changed apart from color!

Volume 1, The Complete Reference.

The MPI2 meetings revised MPI and added some new features which are in the core of the language. All calls and examples have been updated to comply with MPI2 and a C++ binding is now added along with C and Fortran. Layout is also improved with a cleaner distinction between the MPI function argument lists and the text. The conclusions section still covers many important parallel processing issues and should be compulsory reading for all programmers who intend serious work with MPI or any message passing system. As in all books which are frequently referenced an updated edition is always welcome. (The first edition of the MPI book is frequently referenced during daily work here. Editor's note, perhaps the bright colors will help finding the books when they have been borrowed.)

Volume 2, MPI Extensions.

The second volume covers the majority of new features discussed and added in the MPI2 meetings. These cover the following topics.

  • MPI and threads and mixed language programming are now covered.
  • One sided communications have been added, these will be of particular interest to anybody who has worked with shmem on Cray systems.
  • Process creation and management has been added, building on one of the most popular features of PVM not carried across to MPI in the first specification.
  • There is now a set of parallel IO routines which will allow a portable mechanism to express the parallelism of data access and hopefully reduce the effort needed by programmers. An aim was to reduce the effort needed to write general code or tune code to each different configuration of filesystem found.
  • An interface to F90 has been added to try and smooth the use of new language features from F90 and MPI and should be read before starting an F90 and MPI programming project.

Overall, both volumes provide an essential reference and day to day survival guide to any programmer developing MPI programs.

From the basic programmer trying to determine if there is a function which performs a needed collective operation to advanced programmers trying to get different programs to work together on heterogeneous networks, these books cover the important issues. One of the most useful features is the advice to users and advice to implementors notes for each MPI function. The former helps the user know what was behind the minds of the committee when the function was considered necessary, the implementors advice is also useful for the programmer since it gives some idea of how the MPI function should work.

The orange one:

MPI - The Complete Reference, Volume 1, The MPI Core, Snir, Otto Huss-Lederman, Walker and Dongarra. MIT Press, 0-262-69215-5

The yellow one:

MPI - The Complete Reference, Volume 2, The MPI Extensions, Gropp, Huss-Lederman, Lumsdaine, Lusk, Nitzberg, Saphir, Snir, MIT Press 0-262-57123-4

The other blue MPI book is:

Using MPI, Gropp, Lusk and Skjellum, MIT Press, 0-262-57104-8.

PVM

As is mentioned above, MPI2 does go some way to adding the desired features from PVM, but there is still much to recommend PVM as a parallel language for certain applications.

A good website to catch up on PVM activity is:

http://www.epm.ornl.gov/pvm/pvm_home.html

Here there are tutorials, news of latest releases, conference announcements, a summary of project activity and many useful code examples and libraries to use in your own programs.

Of particular note are the Harness and CUMULVS projects. Harness is a next generation PVM aimed at large heterogeneous networks of systems and is looking a better control and fault tolerance of large scientific applications. CUMULVS eases adding visualization and steering to PVM and MPI programs.

EPCC Survey on HPC

We received an announcement of the following:

>
> EPCC is co-ordinating a project for the European Commission to
> determine which of the facilities and services provided by HPC centres
> are most relevant to their users.
> 
> This short questionnaire will take you around 10 minutes to complete.
> 
> By participating, you will ensure that your views and opinions are
> taken into account in determining the future provision of services at
> HPC centres like EPCC. The closing date for responses is 30 April
> 1999.  
> 

To participate, go to: 
http://www.epcc.ed.ac.uk/direct/directq_index.html

Math Trio Named Nation's Best in Modeling Contest

[ This just arrived... Congrats to the UAF Math Department and the
"trio"! ]
April 2, 1999
Fairbanks, Alaska - With long pony-tailed hair, pierced body parts and Teva sandals, University of Alaska Fairbanks seniors Gregg Christopher, Orion Lawlor and Jason Tedor could pass for MTV musicians, not award-winning mathematicians. But don't let their appearance fool you. When it comes to math muscle, these guys are simply the best in the nation. This Alaskan trio of brainiacs just won top honors in the 1999 Mathematical Contest in Modeling, one of the most grueling competitions in the country, and earned bragging rights over teams from powerhouse schools like Harvard and Yale. Winning the competition is nothing new to Nanook mathematicians. UAF has ranked in the top two percent of all schools participating a record number of six times. "No other school in the universe can match this record," said Clif Lando, UAF mathematical sciences department head. The MCM, held each February for college undergraduates, is designed to improve problem-solving and writing skills in a team setting. Students have 89 hours to come up with the solution to real-world problems involving natural sciences and mathematics. More than 400 universities from around the world compete. This year's triumphant triumvirate have all competed on modeling teams in the past few years, so they knew what challenges they faced. Their coach, assistant math professor Chris Hartman, could empathize with the team's anticipation of the event- he was on the winning 1990 modeling team. Hartman, who got his bachelor's degree from UAF in 1991, went on to get a Ph.D. with a focus in graph theory from the University of Illinois. He now holds a joint appointment at UAF's Arctic Region Supercomputing Center and the Department of Mathematical Sciences. At precisely midnight Thursday, the UAF team tore open an envelope containing this year's problems. Teams across the nation were synchronized to open their envelopes at the same time, with the same 89-hour deadline. UAF decided to tackle the problem of how to demonstrate people evacuating from rooms during an emergency, based on how many occupants were in the room. The trio modeled two scenarios for the problem. One was a mathematical tree using fractions to show how quickly people move through parts of a room. The second was a simulated room designed on a computer with people represented as red discs. The discs were programmed with many human foibles- they shoved and bounced off one another, navigated around furniture, and were indecisive about where to exit. In 89 hours, the team researched everything from fire safety codes to psychological profiles of World Trade Center bombing escapees. They measured the dimensions of several campus facilities- Schaible Auditorium, the olympic-sized pool and gym at the Patty Center, the Wood Center Ballroom- to use as parameters for their models. Then, they created a computer program to crunch numbers and visually display the models. As the competition ran down to its last critical hours, the team wrote their paper explaining their techniques- a whopping 90-page mathematical modeling manifesto. "After the competition last year, I was so tired I couldn't even lift a nacho chip to eat at the Pub," said team member Gregg Christopher, from Anchorage. "At least this year I was able to eat my nachos afterwards." Teammate Orion Lawlor, from Glenallen, didn't worry about food. He slept for 21 consecutive hours after the competition. And Jason Tedor, the third member, slept on and off for two days after their winning paper was postmarked and in the mail. "We've all been through this competition hell multiple times and ask ourselves, why do it again?" said Tedor, who hails from North Pole. "The answer is simple. We did it for the challenge."

Quick-Tip Q & A


A:{{ Sourdough Sam sells moose for $10 each, reindeer for $3 each, and
    ducks for $0.50 each.  He floats his raft down the Yukon River to
    the Chilkoot Pass Trading Post one spring morning and sells
    exactly 100 animals for exactly $100, selling at least one of each
    species.  How many of each did he sell? }}

  The contest is now officially closed and we'll be sending awesome
  prizes to seven winners next week.  Here's my favorite reply:

  > He sold 0 Moose, 20 Reindeer (Totaling $60), and 80 ducks (Totaling
  > $40).

  I responded that, nope, there had to be at least one of each critter,
  and then received a second reply:

  > Ok,
  > 
  > How about 5 moose ($50), 1 reindeer ($3), and 94 ducks ($47).
  > 
  > 
  >   program farm
  >     do i=0,10
  >       r1=100 - (10 * i)
  >       r2=100 - i
  >     ducks = (3*r2 - r1) / 2.5
  >     deer = r2 - ducks
  >     print *, i, deer, ducks
  >   enddo
  >   end
  > 

  Guy lamented that this wasn't a parallel program.

Q: What simple change can I make to improve my code's performance on
  the T3E?

[ Answers, questions, and tips graciously accepted. ]

Back to Top