ARSC T3D Users' Newsletter 95, July 12, 1996

Imagine The Perfect MPP... (ARSC T3D User Survey)

As we discuss different configurations for our expected T3E, we are gathering user input. Our goal is to learn more about your MPP needs, both current and projected, to help guide our decisions.

Consider this: We face an inevitable tradeoff between number of PEs and amount of memory per PE. How do you come down on this?

Below (if you want to really get into it), are some "survey" questions. We would appreciate answers to these, along with your opinions on the "perfect" T3E configuration.

Contact Tom Baring at:


baring@arsc.edu
 or 907-474-1899

ARSC T3D User Survey

  1. How much memory does your code require, overall? Is the current 64 MB per PE sufficient? Too much? How much memory could you effectively use per PE? Do you now feel "forced" to use extra PEs in order to meet memory demands?
  2. Is your code compute-intensive? "Embarrassingly parallel?" How would you benefit from more PEs or be hurt by fewer?
  3. Is your code communication-intensive? What percentage of its time is spent synchronizing and/or sending data between PEs?
  4. Is your code file-intensive? What percentage of its time is spent reading/writing to disk?
  5. Does your code run efficiently on a large number of PEs (>8)? How well does it "scale?"
  6. How often do you need access to more than 4 interactive PEs? What work do you do interactively?
  7. Do you like ARSC's MPP batch queue structure? Does it meet your needs?
  8. What are your projected storage needs? Do you need long term file storage at ARSC? How useful do you find CRL?
  9. Which programming language(s) and communications package(s) do you use?
  10. Why have you chosen the T3D as your platform?
  11. What additional services would you like ARSC to provide? What practices would you like us to discontinue?
  12. Do you expect your future usage to mimic your past usage?

m_64pe_24h Queue Priority Adjusted

We increased the priority of the m_64pe_24h queue this week, causing it to be searched in advance of the 8, 16, and 32 PE queues. This was in response to a persistent situation on Tuesday, in which half of the T3D's PEs remained idle, although 64pe jobs were queued. This vexing situation is shown in the following outputs (captured on Tuesday) from mppview and qstat:

  denali$ mppview -L
   _________________________________________________________________________
   \ .        .        .        .        .        .        .        .       \
    \ .        .        .        .        .        .        .        .       \
     \________________________________________________________________________\
   _________________________________________________________________________
   \ .        .        .        .        .        .        .        .       \
    \ .        .        .        .        .        .        .        .       \
     \________________________________________________________________________\
   _________________________________________________________________________
   \ AAA      AAA      BBB      BBB      CCC      CCC      CCC      CCC     \
    \ AAA      AAA      BBB      BBB      CCC      CCC      CCC      CCC     \
     \________________________________________________________________________\
   _________________________________________________________________________
   \ AAA      AAA      BBB      BBB      CCC      CCC      CCC      CCC     \
    \ AAA      AAA      BBB      BBB      CCC      CCC      CCC      CCC     \
     \________________________________________________________________________\

  Part    User    PID    Program  State  Flags    Shape- YZX   (base)   Elapsed
  ----  -------- ------ -------- ------ ------ ----------------------- ---------
    26  AAA       61728 hib4A    Active B        16= 2x 2x 4   (0x200)   1:05:25
    37  CCC       63803 hib6B    Active B        32= 2x 2x 8   (0x208)   0:27:39
    48  BBB       55270 hib1C2   Active B        16= 2x 2x 4   (0x204)   3:12:26


  denali$ qstat -a  
 grep pe_

  -----------------------------
  NQS 1.1 BATCH REQUEST SUMMARY                
  -----------------------------                   
  IDENTIFIER    NAME    USER  QUEUE                 JID  PRTY REQMEM REQTIM ST
  ------------- ------- ----- --------------------- ---- ---- ------ ------ ---
  99044.denali  R011t  AAA    m_16pe_24h@denali     94875   20   3315  86361 R10
  99033.denali  R060t  BBB    m_16pe_24h@denali     94515   20   7405  86371 R10
  98989.denali  R128t  CCC    m_32pe_24h@denali     93635   20   9495  85565 R14
  99048.denali  runcp  EEE    m_32pe_24h@denali            499  16384  86400 Qqr
  99003.denali  runcp  EEE    m_64pe_24h@denali            286  16384  86400 Qce
  98995.denali  R013t  AAA    m_64pe_24h@denali            377  12288  86400 Qce
  99037.denali  run12  DDD    m_128pe_8h@denali            ---   2048    500 Qqs
There are a few reasons why this situation occurred and then perpetuated itself:
  1. ARSC currently limits the batch queues to 120 PEs total. This guarantees that at least 8 PEs will be available (or in use) by interactive jobs at all times. However, as shown above, it also blocks the 64pe queue if both 16pe queues and the 32pe queue are running.
  2. Prior to the adjustment, the queues were searched in order, according to size, from smallest to largest. Whenever a 16pe or 32pe job terminated, another could "sneak in," replacing it, before the 64pe job got to run, effectively replacing one block with another. In the case above, if CCC's job terminated, EEE's 32pe job would replace it immediately. The 64pe jobs would then be blocked because of the 120 complex limit.
  3. The T3D use pattern seems to have shifted slightly. On Tuesday, 6 jobs were submitted to the 32pe queue (about twice "normal" for the last month); 8 jobs were submitted to the 16pe queue (also, about twice "normal"). This created a demand backlog, ready to block the 64pe jobs.
By changing the queue priorities, and letting 64pe jobs go first, we will, at times, of course, block 16pe and 32pe jobs (which must also abide the 120 PE complex limit). However: better 1/4 or 1/8 idle than 1/2. (Unless it's my 16pe job... But you can't please everybody!)

As an aside, the "ST" column, above, is the job's status code. Here is an explanation of the codes in that column (see "man qstat" for other codes).

If the first status character is "R" then the job is running. If the "R" is followed by a number, then that is the number of currently active processes started by that request.

If the first character is "Q" then: "the request is in a queue and is eligible for routing or running." The two characters which follow indicate:


  ce   (Cray MPP systems only.) The complex Cray MPP 
       processing element (PE) limit was reached.
  qr   The queue run limit was reached.
  qs   The queue in which the request resides was stopped.

(The mpp_128pe_8h queue is "stopped" except from Fridays at 6:00 PM to Sundays at 4:00 AM). See Newsletter #74 or execute the command "qstat -m" for more information on the current T3D queue structure.

cf77 and cc Invoke Different MPP Loaders

The "cf77" command uses "mppldr." The "cc" command uses "mppld." (Similarly, when compiling for the Y-MP, "cf77" uses "segldr" while "cc" uses "ld.")

These two loader interfaces differ in command line inputs and default settings. One difference is in their treatment of unsatisfied external references. If a source program references externals which can not be found, mppldr will create an executable while mppld will not. This is the explanation for the "glitch" I encountered while linking MPICH v1.0.13 (which has two unsatisfied externals) and mentioned last week.

The "Cray MPP Loader User's Guide" (SG-2514 1.1) states:


  The mppldr command provides a simple invocation method in which the
  loader handles many of the requirements of loading your program.
  The mppld command provides a traditional UNIX interface in which
  you must provide more information to the loader to load your
  program correctly.
And:

  Differences between   In addition to differences in command-line
  mppldr and mppld      invocation formats, mppldr and mppld vary in other
  2.6                   ways.  Table 3 summarizes these differences.

                    Table 3.  mppldr and mppld differences

  Feature                       mppldr                  mppld

  Default directives file  /lib/segdirs/mpp_def_seg  /lib/segdirs/mpp_def_ld

  Environment variable          MPP_SEGDIR              MPP_LDDIR
  processing

  Object file processing        All object file         All .o files
                                names are included      (sequential object
                                as bin files.           files) are included
                                                        as bin files.  All
                                                        .a files (library
                                                        object files) are
                                                        included as lib
                                                        files.

  Default setting of            DUPENTRY=CAUTION        DUPENTRY=CAUTION:
  DUPENTRY directive            :CAUTION:NOTE           NOTE:NOTE.
                                                        Because of the
                                                        different dupentry
                                                        setting, and the
                                                        practice of
                                                        including library
                                                        object files as lib
                                                        files, mppld issues
                                                        fewer diagnostic
                                                        messages about
                                                        duplicated entry
                                                        point names than
                                                        mppldr.

  Default setting of            DUPORDER=OFF The        DUPORDER=ON.  An
  DUPORDER directive            first definition of     ordered search
                                an entry point is       algorithm is used.
                                chosen, regardless      The entry point that
                                of the definition's     mppld chooses depends
                                location.               on the order of
                                                        definitions and
                                                        references.  See
                                                        "DUPORDER
                                                        directive," for more
                                                        information.

  Default system libraries      A list of default       No default libraries
                                libraries is            are included.  You
                                included.  Most         must specify all
                                common system           libraries required
                                routines are            by your program.
                                included in these
                                libraries.

  Default setting for           USX=CAUTION.  A         USX=WARNING.  A
  USX directive                 program that            program that
                                contains unsatisfied    contains unsatisfied
                                external references     external references
                                is still executable     is not executable
                                and mppldr exits        and mppld exits 
                                normally.  Calls to     with a nonzero error
                                unsatisfied             status.
                                references are
                                intercepted when the
                                program is run.

  Default setting for           FORCE=OFF.  Modules     FORCE=ON.  All
  FORCE directive               in bin files are        modules encountered
                                included in the         in bin files are
                                executable program      included in the
                                only if they are         executable program,
                                referenced, contain      whether or not the
                                a main program, or       modules are
                                initialize global        referenced.
                                data.
It's probably not too smart to ignore compiler/linker warnings. You know EXACTLY when your program will take a different execution path and hit that unsatisfied external. (During a demo, of course.)

But anyway, here's a little example which uses unsatisfiable externals. Note that a similar FORTRAN program, compiled with cf77 (thus loaded by mppldr), is immediately executable.


  /* goodbye.c */ 
  #include <stdio.h> 
        
  main (int argc, char **argv) {
    extern int Unsatisfied_External(); 
    int i; 
    
    if (argc > 1)
        i=Unsatisfied_External();      
    
    printf ("goodbye \n"); 
  }

  denali$ TARGET=cray-t3d; MPP_NPES=2

  denali$ cc goodbye.c
    [ Messages about unsatisfied externals and the statement:
     mppldr-112 cc: WARNING
         Because of previous errors, file 'a.out' is not executable.  ]

  denali$ a.out
    ksh: a.out: cannot execute

  denali$ chmod 700 a.out ; a.out
    [ Messages about unsatisfied externals ... and then a correct run]

  denali$ cc -c goodbye.c   

  denali$ mppldr goodbye.o
    [ Messages about unsatisfied externals ...  ]

  denali$ a.out
    [ Messages about unsatisfied externals ... and then a correct run]
    
  denali$ a.out XXX   
    [ Messages about unsatisfied externals ... then a crash, because 
      it tries to execute "i=Unsatisfied_External()".  ]

Announcement: 11th International Parallel Processing Symposium

I got this information off of the WWW. See the URL:

http://cuiwww.unige.ch/~ipps97/
  
    11th International Parallel Processing Symposium 
                    1-5 April 1997 
            University of Geneva, Switzerland 
  
 Important Dates:
     20 September 1996  ..... Manuscripts Due 
     13 December 1996   ..... Review Decisions Mailed                
     20 January 1997    ..... Print Ready Paper Due
     30 August 1996     ..... Workshop Proposals Due
     31 October 1996    ..... Tutorial Proposals Due
     31 October 1996    ..... Commercial Exhibit Registration
     
     
 Call for Participation:
  
     Sponsored by the Technical Committee on Parallel Processing,
     the symposium is the committee's primary forum for engineers
     and scientists from around the world to present their latest
     research findings in the field. In addition to technical
     sessions of submitted paper presentations, IPPS '97 will offer
     workshops, tutorials, an industrial track, and commercial
     exhibits.
 
     Our University of Geneva hosts are making arrangements for
     on-campus housing as well as specially priced nearby hotel
     accommodations. Also through the University, IPPS will be able
     to provide daily luncheon in addition to the usual breaks and
     refreshments. Full details will be available in the Advance
     Program.
  
     Also in 1997, PARCON, the one-day Symposium on New Directions
     in Parallel and Concurrent Computing, will co-locate with
     IPPS.  To accommodate their inclusion, workshops & tutorials
     will be held the first and last days, papers will be presented
     in technical sessions on the second and third days, and PARCON
     will be presented on the fourth day.

Correction: Table Headers on MPI Timings

In both Newletter #92 and #94 I put incorrect information into the headers for my tables reporting timing results for EPCC/MPI and MPICH.

My table headers were incorrect in two ways. The values are times (not "rates"), in microseconds, to transfer one entire buffer of the given size (not "value" of the given type). A sample correct header would look like this (changes are single-quoted):

Table 1

Transfer 'TIMES' (PE to PE, no ACK) in Microseconds Per REAL*4 'BUFFER' Obtained from /mpp/bin/f90 "ring.simple.f"

Thanks go to the two readers who reported this.

ARSC T3D Users' Group Met on July 11th

About 30 local users showed up for Frank Chism's presentation on the T3E. We got some good insights on an exciting machine. Stay tuned. And in the meantime, I would encourage everyone to check out CRI's T3E WWW page:

  http://www.cray.com/PUBLIC/product-info/T3E/CRAY_T3E.html

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top