ARSC T3E Users' Newsletter 189, February 17, 2000

Follow-up: Interactive Yukon Jobs and qstat -m

In issue #187 (and in "news interactive_jobs") we announced a new policy regarding interactive work on yukon.

To guarantee PEs for interactive work, we occasionally reduce the number available for batch jobs. To determine how many PEs are in the batch pool, use "qstat -m":


  YUKON$ qstat -m
  ----------------------------------
  NQS 3.3.0.5 BATCH QUEUE MPP LIMITS
  ----------------------------------
  QUEUE NAME           RUN     QUEUE-PE'S    R-PE'S  R-TIME  P-TIME
                     LIM/CNT     LIM/CNT     LIMIT   LIMIT   LIMIT
  ------------------ --- ---  ------ ------  ------  ------  ------
    [ ...snip... ]
  ------------------ --- ---  ------ ------  ------  ------  ------
  yukon              100/6       260/254
  ------------------ --- ---  ------ ------  ------  ------  ------
The last line shows that, while 260 PEs are available to batch jobs, 254 are currently being used by batch jobs. The 260 batch PE limit is normal.

When we switch to interactive mode, the batch PE limit will go as low as 252. For instance:


  ------------------ --- ---  ------ ------  ------  ------  ------
  yukon              100/6       252/248
  ------------------ --- ---  ------ ------  ------  ------  ------
In this second example, "qstat -m" shows that, of 252 PEs permitted for batch work, 248 are being used by batch work.

However, this display doesn't tell us if the remaining 4 "batch" PEs have been claimed by interactive work. Use "grmap" or "grmview" to determine how many PEs are actually unused. If there are at least 4, then a 4 PE batch job could acquire them.

Comparison of Languages for Multi-Grid Methods


[ This is the second in a two-part series contributed by Brad
  Chamberlain of the University of Washington. ]

REVIEW

Last week we presented an article summarizing results from an experiment to compare NAS MG as written in a number of parallel programming languages: F90+MPI, CAF, HPF, and ZPL. This week we follow up on that article with excerpts from the implementations in each language to give a sense of some of the differences in syntax and expressiveness.

INTRODUCTION TO NAS MG & DENSE MULTIGRIDS IN GENERAL

The core of the NAS MG benchmark can be summarized as a series of four 27-point stencils that are applied in a dense multigrid setting: two that work at a single level (resid and psinv) and two that span adjacent levels (rprj3 and interp). A 27-point stencil is merely a weighted sum of a point and its immediate neighbors in 3D space. Typically, values that are the same distance from the point are given the same weight. This is shown in ASCII-view here:


                         column         column        column
                      j-1  j  j+1    j-1  j  j+1    j-1  j  j+1
  
        i-1:          w3  w2  w3     w2  w1  w2     w3  w2  w3
    row i  :   res =  w2  w1  w2  +  w1  w0  w1  +  w2  w1  w2
        i+1:          w3  w2  w3     w2  w1  w2     w3  w2  w3

    plane:      k         k-1             k             k+1
The result (res) is assigned the sum of w0 times the element aligned with it, w1 times those that differ by 1 in one dimension, w2 times those that differ by 1 in two dimensions, etc. In a C-like language, this would be written in the following manner:


  x[i,j,k] = w0 * y[i, j, k] +
             w1 * (y[i-1, j, k] + y[i, j-1, k] + y[i, j, k-1] +
                   y[i+1, j, k] + y[i, j+1, k] + y[i, j, k+1]) +
             w2 * (y[i-1, j-1, k] + y[i-1, j, k-1] + y[i, j-1, k-1] +
                   y[i-1, j+1, k] + y[i-1, j, k+1] + y[i, j-1, k+1] +
                   y[i+1, j-1, k] + y[i+1, j, k-1] + y[i, j+1, k-1] +
                   y[i+1, j+1, k] + y[i+1, j, k+1] + y[i, j-1, k+1]) +
             w3 * (y[i-1, j-1, k-1] + y[i+1, j-1, k-1] +
                   y[i-1, j-1, k+1] + y[i+1, j-1, k+1] +
                   y[i-1, j+1, k-1] + y[i+1, j+1, k-1] +
                   y[i-1, j+1, k+1] + y[i+1, j-1, k+1]);
Multigrid computations are those that solve a coarse approximation of a problem in order to arrive at a solution for the original problem more quickly. Typically this involves a hierarchical array divided into "levels", each of which has half the elements per dimension as the previous level. Thus if the original problem had a scale of 8x8x8, the coarser levels might be 4x4x4, 2x2x2, and 1x1x1. In a dense multigrid problem, every point at each level is utilized. Sparse multigrid problems are those that refine only certain interesting regions of the problem space, and will not be discussed here.

For NAS MG, 27-point stencils within a level might look like the above, whereas those that move between levels might have to scale the indexing expressions in order to refer to elements that are in the corresponding locations in the finer or coarser grid.

Challenges to efficiently parallelizing multigrid computations can be grouped into two broad categories: load balancing and managing the gory details. In order for the problem to be load balanced, it is important that every level of the hierarchy be divided as evenly as possible between the processors. While this is conceptually simple, it leads to some interesting questions about how to index the arrays at each level. For example, do you declare your arrays to have an upper bound that's half as big for each level, or so that each level is strided by twice as much?



       [1:8, 1:8, 1:8], [1:4, 1:4, 1:4], [1:2, 1:2, 1:2], etc.

                                   OR

       [1:8, 1:8, 1:8], [1:8:2, 1:8:2, 1:8:2], [1:8:4, 1:8:4, 1:8:4], etc.
This choice probably depends on the characteristics of the language you're using and to a lesser degree on personal preference. For example, ZPL's performance model specifies that the latter will result in better performance, though the former is still an option. In contrast, F90+MPI favors the former due to the fact that it is based on a local view.

"Managing the details" includes exchanging boundary values with neighboring processors so that the stencil computations can run completely in parallel. Keep in mind that as the problem gets coarser and coarser, a desired value may be located on a processor other than those that are adjacent to you in the virtual processor grid. In addition, the details of controlling loop bounds for each processor at each level of the hierarchy (especially if the processors don't evenly divide the problem size) can become an issue.

With local-view approaches like F90+MPI and CAF, you must manage these details explicitly. In global-view languages like HPF and ZPL, the compiler will manage them for you, but you will probably want to pay attention to what it's doing to ensure that it's not incurring unnecessary overheads on your behalf. A language like ZPL eases this task by providing a syntax-level performance model for the user, whereas HPF tends to require post-execution profiling tools.

NAS MG EXCERPT OVERVIEW

The following sections will give a sense of the NAS MG code in each of the four languages. Since there's nothing more boring than poring over code that you aren't personally invested in, we'll do our best to keep it pretty minimal and restricted to some characteristic excerpts. The roadmap will be as follows:
  1. resid in each of the four languages as an example of a simple single-level 27-point stencil
  2. some helper code required by resid in each language
  3. characterization of how an inter-level stencil like rprj3 differs from a single-level stencil

RESID

The resid subroutine uses a 27-point stencil to calculate the residual r = v - Au, where r, v, and u are arrays at a single level of the computation and A is the 4-element weight vector for the stencil. One of the weights (w1 in our examples above) is 0 and is hand-optimized out in all versions.

We'll start with ZPL since it is the most succinct:



  ZPL:


   procedure resid(var R,V,U: [,,] double);
   begin
     R := V - a[0] *  U 
            - a[2]*(U@dir110{} + U@dir1N0{} + U@dirN10{} + U@dirNN0{} +
                    U@dir101{} + U@dir10N{} + U@dirN01{} + U@dirN0N{} +
                    U@dir011{} + U@dir01N{} + U@dir0N1{} + U@dir0NN{})
            - a[3]*(U@dir111{} + U@dir11N{} + U@dir1N1{} + U@dir1NN{} +
                    U@dirN11{} + U@dirN1N{} + U@dirNN1{} + U@dirNNN{});
     wrap_boundary(R);
   end;
This procedure takes in the 3 argument arrays R, V, and U, specified as being 3D, but with unspecified size. The first thing to note is that there is no loop or explicit indexing associated with this statement. In ZPL, the indices over which an array statement should execute is specified in a scoped manner using a "region specifier". In this case, no region specifier is specified within resid, so it is dynamically inherited from the callsite. This encourages code reuse and makes resid independent of the multigrid level.

The array statement expresses the stencil using the "@" operator which modifies an array reference by an offset vector called a "direction". These directions are declared globally by the user. For example:


   direction dirN00{0..num_levels} = [-1, 0, 0] scaledby 2^{};
This declaration specifies a group of num_levels+1 directions, each of which is scaled by twice the amount of the previous:

   [-1, 0, 0], [-2, 0, 0], [-4, 0, 0], etc.  
The result is an offset per level in the hierarchy that can be used to refer to an element in the previous row but the same column and plane. In an array reference like:

   U@dirN00{}
the "{}" inherits the direction's scale from U (it can also be specified explicitly or relative to U's scale), and thus refers to the element whose row index is just less than that referred to by R, V, and U.

The ZPL compiler automatically generates vectorized communication for each @ reference, combining communications for vectors that overlap, such as dir100, dir110 and dir1N0, all of which require a plane of data from the "south". In addition, these @ operators provide a visual cue for ZPL users that point-to-point communication will most likely be required to implement the statement as specified by its performance model.

The procedure ends with a call to "wrap_boundary()" which uses ZPL's wrap statement to update the global boundary conditions. wrap_boundary is given in the next section.


F90+MPI / CAF:

   
         subroutine resid( u,v,r,n1,n2,n3,a,k )

   c -- some declarations omitted for brevity

         integer n1,n2,n3,k
         double precision u(n1,n2,n3),v(n1,n2,n3),r(n1,n2,n3),a(0:3)
         double precision u1(m), u2(m)

         do i3=2,n3-1
            do i2=2,n2-1
               do i1=1,n1
                  u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
        >                + u(i1,i2,i3-1) + u(i1,i2,i3+1)
                  u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
        >                + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
               enddo
               do i1=2,n1-1
                  r(i1,i2,i3) = v(i1,i2,i3)
        >                     - a(0) * u(i1,i2,i3)
        >                     - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
        >                     - a(3) * ( u2(i1-1) + u2(i1+1) )
               enddo
            enddo
         enddo

         call comm3(r,n1,n2,n3,k)

         return
         end
The F90 and CAF versions of the benchmark are identical (save for one CAF line omitted here) due to the fact that all of the interprocessor communication is abstracted into a subroutine called comm3 (included below). This subroutine performs the 27-point stencil using an optimization in which partial sums are calculated and stored in vectors u1 and u2 to avoid redundant FLOPs. As mentioned in last week's article, this is an important optimization and results in great benefit for these implementations at the cost of obfuscating the code's intent. Ideally, Fortran compilers would recognize this optimization opportunity automatically, allowing the code to be written in a more intuitive form (that would be very similar to the stencil code given in the introduction).

Other than this optimization, there are no real surprises here. n1, n2, and n3 are the bounds of each processor's local block of data. The catch to this code is that comm3() is hiding all of the gory details, including the global and local boundary value updates (in this implementation the local communication is done after each computation as opposed to the demand-driven communcation used by ZPL and HPF).


HPF:

   
         extrinsic (HPF) subroutine resid( u,v,r,n1,n2,n3,a,k )

   c -- some declarations omitted for brevity

   !hpf$    distribute(*,block) :: grid
         double precision, intent (in) ::  u(:,:,:),v(:,:,:),a(0:3)
         double precision, intent (out) :: r(:,:,:)
   !hpf$ align(*,:,:) with grid :: u,v,r
         double precision u1(size(u,1)), u2(size(u,1))
 
   !hpf$ independent, new(u1,u2), onhome(u(i1,i2,i3)) 
         do i3=2,n3-1
            do i2=2,n2-1
               do i1=1,n1
                  u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
        >                + u(i1,i2,i3-1) + u(i1,i2,i3+1)
                  u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
        >                + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
               enddo
               do i1=2,n1-1
                  r(i1,i2,i3) = v(i1,i2,i3)
        >                     - a(0) * u(i1,i2,i3)
        >                     - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
        >                     - a(3) * ( u2(i1-1) + u2(i1+1) )
               enddo
            enddo
         enddo

          r(n1,:,:) = r(2,:,:)
          r(1,:,:) = r(n1-1,:,:)
          r(:,n2,:) = r(:,2,:)
          r(:,1,:) = r(:,n2-1,:)      
          r(:,:,n3) = r(:,:,2)
          r(:,:,1) = r(:,:,n3-1)

         return
         end
On the surface, the HPF code is very similar to that of the F90+MPI code. The biggest conceptual difference is that n1, n2, and n3 no longer refer to a processor's local bounds, but rather to the global bounds of the current level. HPF directives are specified to ensure that the arrays are distributed and aligned as necessary to minimize communication, though HPF makes no guarantees about how these directives will be implemented or even that they will be followed at all. The advantage over F90+MPI and CAF is that no communication code is required. The disadvantage compared to ZPL is that HPF has no performance model, and thus, no communication style or quanitity is guaranteed, forcing the programmer to tune according to their compiler.

The last six F90 statements update the global boundary conditions as in ZPL's call to wrap_boundary().

RESID HELPER CODE


ZPL

   
   procedure wrap_boundary(var X:[,,] double);
   begin
     [dir100{} of "] wrap X;
     [dirN00{} of "] wrap X;

     -- similar statements here for the other 23 directions, omitted for
     -- brevity

     [dirNNN{} of "] wrap X;
   end;
The ZPL implementation of resid() uses a call to wrap_boundary (as do all of the other stencil operations) in order to update the global boundary conditions. This routine opens a region specifier for each statement using ZPL's "of" region operator. This operator uses a direction vector to create a new region adjacent to the base region in the direction specified and is useful for specifying a problem's boundary conditions. In this case, the base region is `"', indicating it should be dynamically inherited from the callsite. The wrap statement assigns values to the array X within the region such that they are periodic with respect to the base region.

One of ZPL's primary goals is to reduce tedious, error-prone programming. While the @ operator and wrap statement have had this benefit in 2D problems, the use of 27 directions in NAS MG has demonstrated that there is still some room for making the programmer's job even easier, even though ZPL is still more concise than sequential Fortran or C (and significantly moreso than HPF, CAF, or F90+MPI).


F90+MPI
The update of each processor's local boundary values in the F90+MPI resid is implemented using 4 main routines: comm3, ready, give3, and take3. comm3 is the top-level routine which calls the others, ready posts non-blocking MPI receives, give3 marshalls outgoing data and posts MPI sends, and take3 waits for the receives to complete and unmarshalls the data. These routines involve 250+ lines of code, so will be liberally condensed here.

   c -- COMM3
         subroutine comm3(u,n1,n2,n3,kk)

   c -- declarations omitted

         if( .not. dead(kk) )then
            do  axis = 1, 3
               if( nprocs .ne. 1) then
   
                  call ready( axis, -1 )
                  call ready( axis, +1 )
   
                  call give3( axis, +1, u, n1, n2, n3, kk )
                  call give3( axis, -1, u, n1, n2, n3, kk )
   
                  call take3( axis, -1, u, n1, n2, n3 )
                  call take3( axis, +1, u, n1, n2, n3 )
   
               else
                  call comm1p( axis, u, n1, n2, n3, kk )
               endif
            enddo
         else
            call zero3(u,n1,n2,n3)
         endif
         return
         end


   c -- READY
         subroutine ready( axis, dir )

   c -- declarations omitted

         buff_id = 3 + dir
         buff_len = nm2

         do  i=1,nm2
            buff(i,buff_id) = 0.0D0
         enddo

         msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me

         call mpi_irecv( buff(1,buff_id), buff_len,
        >     dp_type, mpi_any_source, msg_type(axis,dir), 
        >     mpi_comm_world,msg_id(axis,dir,1),ierr)
         return
         end


   c -- GIVE3
         subroutine give3( axis, dir, u, n1, n2, n3, k )

   c -- declarations omitted

         buff_id = 2 + dir 
         buff_len = 0

   c -- THE FOLLOWING MOTIF REPEATS 3 TIMES FOR THE 3 DIMENSIONS
         if( axis .eq.  1 )then
            if( dir .eq. -1 )then

               do  i3=2,n3-1
                  do  i2=2,n2-1
                     buff_len = buff_len + 1
                     buff(buff_len,buff_id ) = u( 2,  i2,i3)
                  enddo
               enddo

               call mpi_send( 
        >           buff(1, buff_id ), buff_len,dp_type,
        >           nbr( axis, dir, k ), msg_type(axis,dir), 
        >           mpi_comm_world, ierr)

            else if( dir .eq. +1 ) then

               do  i3=2,n3-1
                  do  i2=2,n2-1
                     buff_len = buff_len + 1
                     buff(buff_len, buff_id ) = u( n1-1, i2,i3)
                  enddo
               enddo

               call mpi_send( 
        >           buff(1, buff_id ), buff_len,dp_type,
        >           nbr( axis, dir, k ), msg_type(axis,dir), 
        >           mpi_comm_world, ierr)

            endif
         endif

         return
         end


   c -- TAKE3
         subroutine take3( axis, dir, u, n1, n2, n3 )

   c -- declarations omitted

         call mpi_wait( msg_id( axis, dir, 1 ),status,ierr)
         buff_id = 3 + dir
         indx = 0

   c -- THE FOLLOWING MOTIF REPEATS 3 TIMES FOR THE 3 DIMENSIONS
         if( axis .eq.  1 )then
            if( dir .eq. -1 )then

               do  i3=2,n3-1
                  do  i2=2,n2-1
                     indx = indx + 1
                     u(n1,i2,i3) = buff(indx, buff_id )
                  enddo
               enddo

            else if( dir .eq. +1 ) then

               do  i3=2,n3-1
                  do  i2=2,n2-1
                     indx = indx + 1
                     u(1,i2,i3) = buff(indx, buff_id )
                  enddo
               enddo

            endif
         endif

         return
         end


   
CAF
The CAF implementation of communication is very similar to that of F90+MPI with two major exceptions: (1) there is no ready() subroutine since all CAF communication is one-sided, and (2) CAF's syncronization primitives are used to ensure that communication has completed before proceeding

         subroutine comm3(u,n1,n2,n3,kk)

   c -- declarations omitted

         if( .not. dead(kk) )then
            do  axis = 1, 3
               if( nprocs .ne. 1) then
                  call sync_all()
                  call give3( axis, +1, u, n1, n2, n3, kk )
                  call give3( axis, -1, u, n1, n2, n3, kk )
                  call sync_all()
                  call take3( axis, -1, u, n1, n2, n3 )
                  call take3( axis, +1, u, n1, n2, n3 )
               else
                  call comm1p( axis, u, n1, n2, n3, kk )
               endif
            enddo
         else
            do  axis = 1, 3
               call sync_all()
               call sync_all()
            enddo
            call zero3(u,n1,n2,n3)
         endif
         return
         end



         subroutine give3( axis, dir, u, n1, n2, n3, k )

   c -- declarations omitted

         buff_id = 2 + dir 
         buff_len = 0

   c -- THE FOLLOWING MOTIF REPEATS 3 TIMES FOR THE 3 DIMENSIONS

         if( axis .eq.  1 )then
            if( dir .eq. -1 )then

               do  i3=2,n3-1
                  do  i2=2,n2-1
                     buff_len = buff_len + 1
                     buff(buff_len,buff_id ) = u( 2,  i2,i3)
                  enddo
               enddo

               buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] =
        >      buff(1:buff_len,buff_id)

            else if( dir .eq. +1 ) then

               do  i3=2,n3-1
                  do  i2=2,n2-1
                     buff_len = buff_len + 1
                     buff(buff_len, buff_id ) = u( n1-1, i2,i3)
                  enddo
               enddo

               buff(1:buff_len,buff_id+1)[nbr(axis,dir,k)] =
        >      buff(1:buff_len,buff_id)

            endif
         endif
         return
         end

<strong>
  take3 in CAF:</strong>

      subroutine take3( axis, dir, u, n1, n2, n3 )

   c -- declarations omitted

         buff_id = 3 + dir
         indx = 0

   c -- THE FOLLOWING MOTIF REPEATS 3 TIMES FOR THE 3 DIMENSIONS

         if( axis .eq.  1 )then
            if( dir .eq. -1 )then

               do  i3=2,n3-1
                  do  i2=2,n2-1
                     indx = indx + 1
                     u(n1,i2,i3) = buff(indx, buff_id )
                  enddo
               enddo

            else if( dir .eq. +1 ) then

               do  i3=2,n3-1
                  do  i2=2,n2-1
                     indx = indx + 1
                     u(1,i2,i3) = buff(indx, buff_id )
                  enddo
               enddo

            endif
         endif

         return
         end

RPRJ3

All of the above code is simply for a single-level stencil like psinv. A reasonable question would be to ask how an inter-level stencil like rprj3 compares? rprj3 projects 27 data points in a fine grid to a single point at the next coarser level. In ZPL, the code is virtually identical, except that all 4 weights are now non-zero:

ZPL:

   
   procedure rprj3(var S,R: [,,] double);
   begin
     S := 0.5000 * R +
          0.2500 * (R@dir100{} + R@dir010{} + R@dir001{} + 
                    R@dirN00{} + R@dir0N0{} + R@dir00N{}) +
          0.1250 * (R@dir110{} + R@dir1N0{} + R@dirN10{} + R@dirNN0{} +
                    R@dir101{} + R@dir10N{} + R@dirN01{} + R@dirN0N{} +
                    R@dir011{} + R@dir01N{} + R@dir0N1{} + R@dir0NN{})+
          0.0625 * (R@dir111{} + R@dir11N{} + R@dir1N1{} + R@dir1NN{} +
                    R@dirN11{} + R@dirN1N{} + R@dirNN1{} + R@dirNNN{});
     wrap_boundary(S);
   end;
In F90+MPI/CAF, the code is almost identical as well, except that the indexing expressions on r are now multiplied by 2 in order to achieve the difference in scale.

In HPF, however, the code becomes significantly more complex due to the effort required to properly align the different levels such that the load remains balanced and communication is minimized, requiring ~100 additional lines of code.

CONCLUSIONS

To conclude, let us represent the line count summary information from last week's article:

    language   lines   decls       comp        comm
    --------   -----   ---------   ---------   ------
    F90+MPI     992    168 (16%)   237 (23%)   587 (59%)
    CAF        1150    243 (21%)   238 (20%)   669 (58%)
    HPF         433    129 (29%)   304 (70%)     0 ( 0%)
    ZPL         192     90 (46%)   102 (53%)     0 ( 0%)
Having spent some time looking at the code, the difference between expressiveness in the local view and global view has made itself apparent. In particular, we have seen that the communication code required by F90+MPI and CAF (only part of which was shown here, and even that condensed by 1/3) is not only long, but intricate. Anyone with experience debugging parallel programs knows that getting a set of processors working together correctly in a single-level code, let alone a hierarchical multigrid code, can be frustrating and time-consuming. This motivates the design of higher-level languages which take care of those details for you like HPF and ZPL. The question is whether the language provides the desired expressiveness and if the compiler generates adequate parallel performance. In HPF, you may need to spend time profiling and getting compiler feedback, and even then may not have a code that runs efficiently with a different platform or compiler. In ZPL, the goal is to provide portable performance by supplying a syntax-based performance model with which the user can understand the parallel implementation of their code.

For more information on this work, contact: brad@cs.washington.edu

For more on the languages, see:

MPI : http://www-unix.mcs.anl.gov/mpi/index.html CAF : http://www.co-array.org/ HPF : http://www.crpc.rice.edu/HPFF/home.html ZPL : http://www.cs.washington.edu/research/zpl/

The HPF code excerpted within was developed at NASA Ames and is described in their IPPS`99 paper:

"Implementation of NAS Parallel Benchmarks in High Performance Fortran" Michael Frumkin, Haoqiang Jin, and Jerry Yan IPPS `99

Thanks go to NASA Ames for allowing the code to be excerpted in this article.

An Analysis of Fortran Utilisation

In issue #175 we announced a research project into Fortran and programmer survey.

The project was supported by the ACM's Fortran Forum, and conducted by Niki Reid of The Queen's University of Belfast. The survey is now closed--results and analysis have been published by the Fortran Forum and are available on-line, at:

http://www.cs.qub.ac.uk/~N.Reid

Here are exceprts:

An analysis of Fortran utilisation

N. Reid and J.P. Wray School of Computer Science The Queen's University of Belfast Belfast BT7 1NN email: niki.reid@acm.org or jp.wray@qub.ac.uk  

[ ... ]

2. Languages in Use

A majority of respondents (61%) are still actively coding in Fortran 77, although only a minority (15%) are using it as their primary programming language. The vast majority of users (92%) have upgraded to the more recent dialects of Fortran 90 and Fortran 95 (80% and 51% respectively, where 42% of those respondents who have upgraded using both dialects). More interesting was the fact that a considerable number of respondents (61%) who are coding in Fortran 77 appear to be using the compilers for these later dialects for the compilation of their Fortran 77 code. This is certainly what the designers of the language were intending, by the retention of Fortran 77 as a strict subset.  

[ ... ]  

Given the general perception of Fortran users refusing to use any language other than Fortran, it came as a surprise to discover that over three-quarters of respondents were, in fact, using Fortran in conjunction with other languages. Principal among these were C(47%), C++(26%) and Visual Basic(19%) along with a conglomerate of other languages. Comments provided indicate that Fortran is being used to provide the 'number crunching' facilities behind programs written in these other languages.  

[ ... ]  

The Fortran Standard's committee has indicated its intention to remove features (listed in the Fortran Standard [ISO/IEC 1997] as 'Obsolescent Features') for which alternate methods of implementation have been provided. The survey has, however, shown that the users' requirement of conformance with Fortran 77, and their desire to use the compilers of post-Fortran 77 dialects for compilation of Fortran 77 code, will present an obstacle to such a 'cleaning' exercise.  

[ ... ]  

3.1 Parallelism

Just under a quarter of the respondents were using parallel architecture machines (23%). The breakdown of parallel machine users by the user's machine memory model is as follows:


      Shared     Distributed     Both Shared & Distributed
       59%            41%                         24%
Of the supercomputing machines in use Cray had attracted a most significant slice of the market (94%). Of significant interest, among these groups, was that while the vast majority of distributed users were using MPI (92%), only 17% of parallel users were using HPF. Since the vast majority of HPF users were utilising both parallel architectural models no further characterisation of HPF can be carried out here.  

[ ... ]  

Quick-Tip Q & A


A: {{ I tried to authenticate using Kerberos/SecurID and got this message:
   {{
   {{     kerberos skew too great
   {{
   {{ What does this mean?


    Thanks go to Kevin Kennedy for the answer:
    
    The local clock time differs by more than N number of seconds from
    the kerberos server.  If you reset the clock/time on your local
    machine this problem will go away.



Q: I love the T3E, but sometimes I relax with me olde Cray PVP.
   How can I estimate my job's memory utilization so I can make an
   accurate NQS request (and get my job to start sooner)?

[ Answers, questions, and tips graciously accepted. ]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top