ARSC T3E Users' Newsletter 166, April 15, 1999

ARSC and UAF to Get OC-12 Connection to Internet2 in September


[ This is taken from a University of Washington news release.  The 
  full text is available at:  
     
http://www.washington.edu/newsroom/news/1999archive/04-99archive/k040299.html
 ]

The University of Washington and WCI Cable Inc. are very pleased to announce that WCI Cable, in support of the cooperative Pacific/Northwest Gigapop and national Internet 2 efforts, is providing to the UW a state-of-the-art fiber optic connection from Seattle to the University of Alaska Statewide System in Fairbanks, Alaska.

This connection will use high-speed "SONET OC-12" technology, and will bring the new Internet 2 and NGI (Next Generation Internet) capabilities and technologies to Alaska. It will link the University of Alaska to the Pacific/Northwest Gigapop, which is the major national (and only high speed) Internet 2 network hub for the region, and is located in Seattle.

This contribution will enable the University of Alaska and other research and education partners to become full and active participants in Internet2/Next-Generation-Internet development and will extend the new generation of Internet2 technologies, capabilities and opportunities to the entire research and education community in Alaska.

ARSC Internet Connection Upgraded Last Week

The internet connection between Seattle and ARSC was upgraded by a factor of four on April 7, expanding from 2 to 8 T1 links. Users who use the internet to connect to ARSC from the "lower 48" should notice the improvement.

For the rest of the story, see:

http://www.arsc.edu/pubs/bulletins/QuadAccess.shtml

VAMPIR Images of MPI, SHMEM, and Co-Array Fortran Broadcast

As noted last week, VAMPIR does more than display MPI message passing. Users can select any code activities to inspect, including message or data passing accomplished by packages OTHER than MPI.

A sample code, given below, broadcasts data using MPI_Bcast, SHMEM_BROADCAST, and a hand-coded Co-Array Fortran (CAF) algorithm.

The program times seven broadcasts of different sizes per method. Here's output from a run on 13 PEs on yukon:


   (Values are time in seconds per MW broadcast 
   using packets of "N Words" words on "NPES" PES)

  NPES     N Words        MPI     SHBCST    CAFBCST
    13           1  125.05054   39.57748   43.39218
    13          10    7.93934    3.69549    4.11272
    13         100    0.85235    0.49472    1.77145
    13        1000    0.19300    0.15414    0.93031
    13       10000    0.12649    0.12825    0.75670
    13      100000    0.10838    0.10720    0.33942
    13     1000000    0.10481    0.10537    0.29094

So we can "see" what's happening, the code also includes calls to the vampir_trace API to trace the SHMEM and CAF methods so that they will appear on the VAMPIR global-timeline display along with the MPI method.

The following graph is VAMPIR's complete global timeline. The legend includes the new "activities" introduced explicitly in the code: SHMEM_BCAST, CAF_BCAST, and PUT. The colors assigned to these, and all the symbols, are under user control.

Time progresses from left to right, but at this scale, many events are hidden. The final broadcasts of 1,000,000 words per method, however, take the most time, and are visible just before the program terminates.

Figure 1:

The next graph is "zoomed in" on the 100,000 word broadcast. To conserve space, the legend and the processor labels have been switched off.

In order to understand the hand-coded CAF method better, the code was instrumented to designate each individual "PUT" used within the overall broadcast operation. The PUT operations are revealed in the VAMPIR timeline. The tree-structure of the data distribution is apparent as the data flows from processor 0, down the branches, and eventually to all processors.

Figure 2:

This example suggests the power of VAMPIR to help you visualize the flow of data in your programs, regardless of the communication method you use.

Here's the test code:


c**********************************************************************
c Timing program to compare MPI_Bcast, SHMEM_BROADCAST, and 
c  hand-coded Co-Array Fortran broadcast.  Uses vampir trace
c  to trace broadcast sections. 
c**********************************************************************
      program main

      implicit none

      include 'mpif.h'
      include 'VT.inc'
      include "mpp/shmem.fh"

! processor identification
      integer :: mype, myimg, penum, imagenum, mypartner, npes
      integer :: mpi_root, caf_root, shmem_root

! distribution tree
      integer :: istage, nstages

! loops
      integer :: ido, nbcast, idata,ndata, i

! timing
      real    :: start, end, tmpi, tshmem, tcafbcst 

! internal 
      real    :: flagvalue, ierr, methn

! shmem 
      integer, dimension(SHMEM_BCAST_SYNC_SIZE) :: pSync

! data for processing
      real, allocatable, dimension(:)[:] :: bcArr

! parameters
      parameter  (ndata = 1000000, flagvalue = 2000.0 )
      parameter  (mpi_root = 0, shmem_root = 0, caf_root = 1 )

! setup shmem
      data pSync /SHMEM_BCAST_SYNC_SIZE * SHMEM_SYNC_VALUE/

! setup MPI
      call MPI_INIT( ierr )
      call MPI_COMM_RANK( MPI_COMM_WORLD, mype, ierr )
      call MPI_COMM_SIZE( MPI_COMM_WORLD, npes, ierr )

! setup CAF
      myimg = this_image ()

! setup VAMPIR definitions and initial setup
#define SHMEMBCAST 11
#define CAFBCAST   12
#define CAFPUT     13
      call VTSYMDEF (SHMEMBCAST, "SHMEM_BCAST", "SHMEM_BCAST", ierr)
      call VTSYMDEF (CAFBCAST, "CAF_BCAST", "CAF_BCAST", ierr)
      call VTSYMDEF (CAFPUT, "PUT", "PUT", ierr)

! allocate data arrays
      allocate (bcArr(ndata)[*])

! Generate array to broadcast
      if ( myimg .eq. caf_root ) then
        do idata=1,ndata
          call random_number(bcArr(idata))
        enddo

      endif

      call sync_images()

      ! copy bcArr to all PEs
      if ( myimg .ne. caf_root ) then
        bcArr(1:ndata) = bcArr(1:ndata)[caf_root] 
      endif


! print headers for output tables
      if ( mype .eq. mpi_root ) then
        write (6, '(A4,A12,7(A1,A10))') 
     &    'NPES',
     &    'N Words',
     &    ' ',
     &    'MPI',
     &    ' ',
     &    'SHBCST' ,
     &    ' ',
     &    'CAFBCST'

      endif

! loop in order to broadcast increasing sized packets.

      do ido = 0, floor (log10 (real (ndata)))
        nbcast = 10**ido

! broadcast data from root using MPI

        !  call MPI_Barrier(MPI_COMM_WORLD,ierr)
        methn=1

        ! all PEs but root reset target array
        if ( mype .ne. mpi_root ) bcArr(1:nbcast) = flagvalue


        call sync_images()
        start=MPI_WTIME()

        call MPI_BCAST(bcArr,nbcast,MPI_REAL,mpi_root,MPI_COMM_WORLD
     & ,ierr)

        call sync_images()
        end=MPI_WTIME()
        tmpi = end-start

! broadcast same data from root using Shmem broadcast
        methn=2

        ! all PEs but root reset target array
        if ( mype .ne. mpi_root ) bcArr(1:nbcast) = flagvalue

        call sync_images()
        start=MPI_WTIME()

        call VTBEGIN (SHMEMBCAST, ierr)
        call shmem_broadcast 
     &    (bcArr, bcArr, nbcast, shmem_root, 0, 0, npes, pSync)

        call sync_images()
        call VTEND (SHMEMBCAST, ierr)

        end=MPI_WTIME()

        tshmem = end-start


! broadcast same data from root using CAF Tree-structured broadcast
        methn=3

        ! all PEs but root reset target array
        if ( mype .ne. mpi_root ) bcArr(1:nbcast) = flagvalue

        call sync_images()
        start=MPI_WTIME()

        ! Total number of stages, or levels, in binary tree is
        ! the ceiling of log base-2 of the number of images.
        nstages = ceiling (alog (real (npes)) / alog (2.0))


        call VTBEGIN (CAFBCAST, ierr)

        do istage = nstages, 1, -1

          call sync_images ()

          if ( mod (myimg - 1, 2**istage) .eq. 0) then
            mypartner = (myimg) + 2**(istage - 1)

            if (mypartner .le. npes) then

              call VTEND (CAFBCAST, ierr)
              call VTBEGIN (CAFPUT, ierr)
!dir$ cache_bypass bcArr
              do i=1,nbcast
                bcArr(i)[mypartner] = bcArr(i)
              enddo
              call VTEND (CAFPUT, ierr)
              call VTBEGIN (CAFBCAST, ierr)

            endif
          endif
        enddo

        call sync_images ()

        call VTEND (CAFBCAST, ierr)

        end=MPI_WTIME()
        tcafbcst = end-start


! print timing results

        if ( mype .eq. mpi_root ) then
          write (6, '(I4,I12,3(A1,F10.5))') 
     & npes,
     & nbcast,
     & ' ',
     & 1000000*(tmpi)/nbcast,
     & ' ',
     & 1000000*(tshmem)/nbcast,
     & ' ',
     & 1000000*(tcafbcst)/nbcast

        endif
      enddo

      call MPI_FINALIZE(ierr)
      end
                       

NACSE Task Force Reports on Software and Tools

The final report of the Northwest Alliance for Computational Science and Engineering (NACSE) task force on requirements for HPC software and tools is now available.

Over the past six months, various groups have been meeting to discuss software and tools on HPC systems. Experiences from activities such as application development, user and system support, and tools and software research have been combined to define a baseline environment that should meet most users' needs for code development and run-time support. However, this wasn't just a group creating a wish list of software dreams for ideal worlds. It was a serious effort to consider both what would be needed now and in the near future, to help vendors/researchers prioritize their efforts, and to help centers acquire productive systems for their users.

Contributions came from a mixed background of national labs, universities, vendors, and other institutions. The power of the systems currently in use varied from small workgroup servers to Teraflop systems. Vendors and users discussed what was needed and could be completed both in the near future and at a realistic cost. The need for Cray compatibility and how the growth of highly powered desktop systems would influence the software development process were also strongly debated.

Some key features recommended as being essential to all HPC systems were:

  • OS features needed, which shells/utilities, threads etc.
  • Fortran95, C, C++ compilers and openMP within all.
  • Libraries such as optimized BLAS and MPI1, plus interopeability between MPI1 and openMP.
  • Standard tools, debuggers and basic timers.
  • Documentation and examples.

The final report contains details on each of the above topics. Each feature is described as it might be in a typical RFP by a center considering an HPC system or expansion of an existing resource. More details of the series of meetings, the final full report, and some typical examples of software needs for systems to perform specific activities can be found at:

http://www.nacse.org/projects/HPCreqts

Upcoming Conferences in 1999

CUG '99

Cray User's Group Minneapolis, Minnesota; May 24 - 28 Conference registration and accommodations now available on-line. http://www.cug.org/

HUG '99

High-Performance Fortran User Group Redondo Beach, California; Aug 1 - 2 [ Note: new location & date ] Submission Deadline: May 7 [ Note: new deadline ] http://www.icase.edu/hug99/

EuroPar '99

Euro-Par is a European conference on parallel computing Toulouse, France; Aug 31 - Sept 3 http://www.enseeiht.fr/europar99/

SC99

Supercomputing Portland, Oregon; Nov 13 - 19 http://www.sc99.org/

Quick-Tip Q & A


A:{{ What simple change can I make to improve my code's performance on
     the T3E?  }}


  Whether or not these qualify as "simple" or "changes" is obviously
  subjective:

    o Don't compile with "-g".

    o Compile with more optimization, like "-O3,aggress" (but compare
      results and performance).

    o Use a faster file system (at ARSC this means /tmp).

    o Use dmget in advance to retrieve potentially migrated input files.

    o Replace hand-coded algorithms with Cray libraries, if possible.

    o Test the code on different numbers of PEs.  Run on the number
      that optimizes "performance," however you define it.

    o Use tools (e.g., PAT, apprentice, VAMPIR) to locate inefficient
      code segments, and then, if possible, improve them.

Q: The following is legal input to a Unix command:
      [la1+dsa*pla10>y]sy
      0sa1
      lyx

   What's the command, and what's the result from this input?  (Hint:
   the command is an easily mistyped anagram of another, extremely
   popular, command.)

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top