ARSC T3D Users' Newsletter 15, December 16, 1994

Upgrade on ARSC T3D Software

ARSC upgraded the T3D software to CrayLib_M 1.1.1.2, MAX 1.1.0.4 and SCC_M 4.0.2.11 on December 11th. There have been no problems detected by ARSC testing or reported by users.

PE Limits

Yesterday, notices were sent out to users having more than the default configuration of
  • 8 PEs for a maximum of 1 hour in interactive mode
  • 32 PEs for a maximum of 24 hours in batch mode
On Monday afternoon, December 19th we will reset their allocations to the default values. Users should contact Mike Ess if they have any questions about this.

Linda on the T3D

From CRI, I received a promotional announcement about the programming language Linda. I can e-mail this on to anyone interested. Is there anyone out there interested in Linda on ARSC's T3D?

New SHMEM Paper

From a user, I received a copy of "SHMEM User's Guide for C" by Ray Barriuso and Allan Knies, Revision 2.2. It seems to be a replacement for the "SHMEM Users' Guide" by the same authors, Revision 2.0. I can e-mail this to anyone who is interested.

Phase II I/O on the T3D

ARSC is evaluating the effort of moving from the current Phase I I/O to Phase II I/O on the T3D. In future newsletters I can summarize the differences, but for now I would like to ask if any ARSC users are interested in this upgrade or would want to be part of the evaluation?

In C, the Timing Routine rtclock()

By accident I found a new timing routine for the T3D, rtclock(). I hesitate to add a new line to the table produced in Newsletter #12, but for the C programmer I think this function adds real functionality. There is a man page on denali for rtclock, but briefly, it is callable from all CRI platforms and returns the value of the real-time clock (RTC). In this way it is similar to the Fortran routines RTC and IRTC. Because there is no multiprogramming on the T3D PE we have:

  CPU time = Wallclock time
and so we can use rtclock to accurately measure CPU time on the T3D. The Fortran wrapper to access RTC or IRTC from C is no longer necessary and that overhead is gone too. It can be used as:

  long t1, t2, rtclock();

  t1 = rtclock();

  /* event to measure */

  t2 = rtclock();
  cputime = (t2 - t1) / 150000000.0;  /* time = clockticks/clockrate */
The updated table (with corrected granularites for RTC and IRTC) is now:

  Table of timers available on the T3D and Y-MP (um = microseconds)

  timer       Wallclock    Fortran  T3D or  Granularity     Resolution
             or CPU timer  or C     Y-MP      T3D    Y-MP     T3D Y-MP

  irtc          wallclock  Fortran  both  ~.187um ~.133um
  rtc           wallclock  Fortran  both  ~.867um ~.133um
  tsecnd        CPU        Fortran  both                  10000um  3um
  gettimeofday  wallclock  C        both  ~2500um   ~30um
  second        CPU        Fortran  Y-MP                      1um  5um
  rtclock()     wallclock  C        both    ~1 um   ~.2um
                CPU (on T3D)

Communication Between the T3D and the Y-MP

In newsletter #7, I described the reason that communication between the T3D incurred a large system overhead and therefore should be avoided. One of the reasons for avoiding communication with the Y-MP was that it was slow and the timings from the example below shows this. Once we understand the basic problems then we can go on to the more exotic solutions in future newsletters.

As part of the general distribution of PVM from Oak Ridge National Labs there is a collection of example programs. One of these examples does basic timings of PVM sends and receives from one master processor to one slave processor. I have modified that source to time PVM calls between Denali and the T3D.

There is one C program, timing.c, that runs on Denali initiating the sends. On a single PE of the T3D is another program timing_slave.c receiving the send and passing an acknowledgment back to the program running on denali.

A makefile that makes the two programs and runs them is shown below. (All the source for this example is in /usr/local/examples/mpp/timers on denali.)


  ARCH = CRAY
  CCY-MP=cc -Tcray-ymp
  LDY-MP=segldr
  CCT3D=cc -X 1 -Tcray-t3d
  LDT3D=/mpp/bin/mppldr
  CFLAGS=-O -c
  PVMDIR=/u1/uaf/ess/pvm3
  #NNN = user's uid
  NNN = 

  all:    timing timing_slave run

  timing: timing.c
        $(CCY-MP) $(CFLAGS) -I/usr/include/mpp timing.c
        $(LDY-MP) -o timing timing.o -L/usr/lib  -lpvm3

  timing_slave:   timing_slave.c
        $(CCT3D) $(CFLAGS) -I/usr/include/mpp timing_slave.c
        $(LDT3D) -o timing_slave timing_slave.o -lpvm3
        cp timing_slave $(PVMDIR)/bin/$(ARCH)

  run:
        -rm /tmp/pvmd.$(NNN) /tmp/pvml.$(NNN)
        pvmd3 &
        sleep 1
        /bin/time timing > results
        echo halt 
 pvm

  clean:
        -rm -f *.o timing timing_slave core
When run with the environmental variable TARGET set to cray-ymp, the makefile will:
  • create the programs
    • make the Y-MP executable timing
    • make the T3D executable timing_slave
    • move timing_slave to the directory from which timing will spawn it
  • run the programs
    • remove the pvm log files from previous runs (users must change the NNN to their own uid number)
    • initiate the pvm daemon in the background
    • sleep for 1 second to allow the pvm daemon to establish
    • execute the master program with results being saved to a file
    • finally, initiate the pvm console and kill the pvm daemon with a halt command

Results

The timing programs measure two quantities, the time for the minimal message to make the round trip from Y-MP to T3D and then back to the Y-MP. And also a series of sends and receives of messages of increasing size. From this series of timings we can derive a speed measurement in megabytes per second. For comparison we have added the times for two other PVM configurations in the following table:

                  Y-MP to T3D  T3D to T3D  Indy to Indy
                              (PE0 to PE1) (Ethernet)

  time for round trip
    (in microseconds)  13918        2289        2486
  speed for message
          size (MB/s)
      100 bytes         .014        .044        .071
     1000 bytes         .125        .451        .558
    10000 bytes         .735       4.444       1.641
   100000 bytes        1.192      33.829       1.910
  1000000 bytes        2.000      89.783       2.000
The timings between PE0 and PE1 are special because for all PEs, PE(N) and PE(N+1) for N even are on the same node and share much of the same hardware. In the next newsletter we'll measure more of these PVM timings between PEs.

Reminders

List of Differences Between T3D and Y-MP:

The current list of differences between the T3D and the Y-MP is:

  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top