ARSC T3D Users' Newsletter 20, January 30, 1995

ARSC T3D Upgrades

The next month or two will be a busy time for the ARSC T3D. We will be upgrading the following:

  1. 2MW to 8MW per PE, tentatively set for February 7th and 8th
  2. The T3D OS, MAX 1.1 to MAX 1.2, set for January 31st
  3. The T3D Programming Environment (libraries, tools and compilers) P.E. 1.1 to P.E. 1.2 sometime in the next two months.
Users will be notified when these upgrades will happen in mailings to the ARSC T3D User's Group (i.e., those who receive this newsletter).

Next T3D Class at ARSC

Introduction to Programming the CRAY T3D

Dates: February 8 - 10, 1995 Time: 9:00 AM - noon, 1:00 - 5:00 PM Location: University of Alaska Fairbanks main campus, room TBA Instructor: Mike Ess, Parallel Applications Consultant Course Description: To satisfy increasing computational demands, computers of the future must have multiple processors executing on the same program. The Cray T3D is a step in this direction. The Cray T3D, a MPP or Massively Parallel Processor, consists of 128 processors attached to the Cray Y-MP.

This class will cover the characterization and history of MPPs. With this background the students will experience how the T3D approaches the problem of executing a program in parallel. The class will cover the three programming paradigms for extracting parallelism:

  1. Data-sharing, as with Fortran 90
  2. Work-sharing, as with Craft Fortran and
  3. Message-passing as implemented with PVM or shmem
The primary goal is to provide practical experience in getting codes up and running efficiently on the T3D. Examples used in the class can be used as models for application programs on the T3D. Also covered will be:
  1. Performance measurement and tools
  2. Debugging techniques and tools
This class will have directed lab sessions and users will have an opportunity to have their applications examined with the instructor. Intended Audience: Researchers who will be developing programs to run on the T3D and current users of the T3D who want a comprehensive, up-to-date survey of programming on the T3D. Prerequisites: Applicants should have a denali userid or be in the process of applying for a userid. Applicants should be familiar with programming in Fortran or C on a UNIX system.

Application Procedure

There is no charge for attendance, but enrollment will be limited to 15. In the event of greater demand, applicants will be selected by ARSC staff based on qualifications, need, and order and completeness of application. The class may be cancelled if there are fewer than 5 applicants.

Send e-mail to with the following information:

  • course name
  • your name
  • UA status (e.g., undergrad, grad, Asst. Prof.)
  • institution/dept.
  • phone
  • advisor (if you are a student)
  • denali userid
  • preferred e-mail address
  • describe programming experience
  • describe need for this class

I/O on the T3D and Y-MP

In investigating the prospect of implementing Phase II I/O on the T3D, we at ARSC have begun measuring I/O speeds both on the Y-MP and the T3D. Measuring I/O is a complicated situation because:
  1. It is always implemented with shared resources:
    1. shared physical disks
    2. shared I/O devices
    3. shared system buffers
  2. It depends on the operating system to service user requests and the availability of the OS depends on system load.
  3. Its environment is not uniform across Y-MP systems
    1. what physical devices?
    2. SSD or BMR or LDcache in memory?
    3. T3D or Y-MP?
  4. A rich set of user options are available:
    1. formatted or unformatted?
    2. sequential or direct?
    3. record size large or small?
There are so many options that it is sometimes just easier to ignore the whole situation, except that eventually everyone has to do I/O and most likely the larger the problem the more I/O will be a bottleneck. So to get started I wanted to present some speeds contrasting I/O on the T3D and the Y-MP. Below is a table of the speed (in MW per second) reads and write on the Y-MP to a file on the /u1 file system (a typical home directory) and the /tmp file system (a larger, faster file system where users are encouraged to work). In Table 1 we have Y-MP speeds of an unformatted write for arrays of increasing size:

Table 1

  Y-MP speeds (MW/sec)
  for unformatted I/O on two different file systems

  array size    /u1 file system    /tmp file system
  (in words)    reads    writes    reads    writes
  ----------   ------    ------   ------    ------
        1024   13.272    13.118   13.289    13.136
        2048   22.613    22.456   21.866    22.428
        4096   33.306    33.764   33.561    33.761
        8192   45.969    46.082   45.968    45.948
       16384   55.130    55.903   55.163    54.907
       32768    0.365    12.266   25.676    25.878
       65536    1.103     1.319   25.725    23.909
      131072    1.014     0.874   22.260    22.037
      262144    0.912     0.967   21.106    21.978
Of course speed increases with the size of the transfer but only while the size of the buffer is larger than the size of the transfers. The high speed on the /tmp file system is due (among other things) to the LDcache which is like a ram disk used as a buffer and is made out of some of the 1GW of memory on ARSC's Y-MP. (So higher I/O speeds in another reason why users should work out of /tmp, rather than their home directories that are not LDcached.)

Next, we compare the Y-MP uniprocessor speeds to those of a single T3D PE running the same program:

Table 2

  Y-MP and T3D speeds (MW/sec)
  for unformatted I/O on the /tmp file system

  array size    /u1 file system    /tmp file system
  (in words)    reads    writes    reads    writes
  ----------   ------    ------   ------    ------
        1024   13.289    13.136    3.806     5.144
        2048   21.866    22.428    6.046     6.171
        4096   33.561    33.761    7.220     7.301
        8192   45.968    45.948    7.974     8.025
       16384   55.163    54.907    8.400     8.426
       32768   25.676    25.878    3.181     3.292
       65536   25.725    23.909    3.309     3.208
      131072   22.260    22.037    2.833     2.740
      262144   21.106    21.978    2.853     2.819
The big difference between I/O on the Y-MP and the T3D is that the I/O on the T3D is done by the mppexec agent that is just another Y-MP job competing with all other Y-MP jobs in the mix (ARSC's Y-MP is always running at more than 95% utilized). The degradation beyond 16K operations on the T3D must be due to some buffer other than the LDcache buffer because both writes are to files on the /tmp file system.

In all the above timings I am running something like:

      parameter( ia1size = 1024 )
      real a1( ia1size )
      call asnunit( iun, '-a /tmp/ess/fort.12', ier )
      open( iun, form = 'unformatted' )
      t1 = rtc()
        write(iun) a1
      t1 = rtc() - t1
      time = t1 * .000 000 006 666
      speed = a1size / time
and in this case the compiler knows the length of the write from the declarations of the array a1. But if I rewrite the write statement in a more flexible way to:

  write( iun ) ( a1( i ), i = 1, ia1size )
now the performance on the T3D goes to hell because the T3D compiler doesn't treat as a special case the implied do on the I/O statement like the Y-MP compiler does.

Table 3

  Y-MP and T3D times (MW/sec)
  for unformatted I/O with 2nd read/write construct

  array size    /u1 file system    /tmp file system
  (in words)    reads    writes    reads    writes
  ----------   ------    ------   ------    ------
        1024    0.066     0.067   13.361    13.480
        2048    0.067     0.067   22.439    22.442
        4096    0.067     0.067   33.953    34.211
        8192    0.067     0.067   46.392    46.397
       16384    0.067     0.067   55.360    55.217
       32768    0.066     0.066   25.136    25.456
       65536    0.066     0.066   24.863    25.043
      131072    0.066     0.066   21.975    21.816
      262144    0.066     0.066   21.987    22.012
I'm sure this is only a temporary difference between the Y-MP and T3D compilers and that in the future the T3D compilers will be as smart as the Y-MP compiler for such an implied do loop on the I/O construct. I/O is a complicated situation and if you find some insight or technique, I'm sure we'd all like to hear about it.


List of Differences Between T3D and Y-MP

The current list of differences between the T3D and the Y-MP is:
  1. Data type sizes are not the same (Newsletter #5)
  2. Uninitialized variables are different (Newsletter #6)
  3. The effect of the -a static compiler switch (Newsletter #7)
  4. There is no GETENV on the T3D (Newsletter #8)
  5. Missing routine SMACH on T3D (Newsletter #9)
  6. Different Arithmetics (Newsletter #9)
  7. Different clock granularities for gettimeofday (Newsletter #11)
  8. Restrictions on record length for direct I/O files (Newsletter #19)
  9. Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
I encourage users to e-mail in differences that they have found, so we all can benefit from each other's experience.

In newsletter #18 there is a list of CRI T3D optimization articles available from ARSC.

In Newsletter #19 there is a list of CUG articles on the T3D available from ARSC.

Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top