ARSC T3D Users' Newsletter 24, February 24, 1995
ARSC T3D Upgrades
In the next month we will be upgrading the T3D Programming Environment (libraries, tools and compilers) from P.E. 1.1 to P.E. 1.2.
We are also planning to install CF90 and C++ for the T3D in the next few months. A description of CF90 was given in newsletter #23 and a very complete description of C++ for the T3D is given on the CRI World-Wide Web page:
http://www.cray.com/PUBLIC/product-info/sw/C++/C++.htmlI am interested in hearing from users who want to use the CF90 and C++ products as soon as they are available.
Upgrade to the T3D Memory
On February 7th, ARSC upgraded the memory on each PE from 2MWs to 8MWs. If any users have questions about this, please contact Mike Ess. We have run into problems with the user limit for mppcore size which is now set too low for the 8MW nodes. We will be changing to a larger default size in the future but if you run into the message:mppexec: user UDB core limit reached, mppcore dump terminatedthen call us and we will increase your mppcore limits. The T3D can create tremendously large mppcore files, the storage for these files are charged against your service unit allocation. If you're not going to use mppcore files, delete them as soon as you don't need them. These files appear in the directory of the executable that aborted with the name "mppcore". Y-MP jobs that abort produce a corefile called "core".
New Shmem Manuals from CRI
I finally got in copies of the new SHMEM manuals:SN-2516 SHMEM Technical guide for Fortran Users SN-2517 SHMEM Technical guide for C Usersfor those of you who are ARSC users (i.e., you have a userid on denali) I will send you a hardcopy if you e-mail me your U.S. Mail address.
FMlib Available on ARSC's T3D
From at least four places I received the following notice:> FM - Fast Messaging on the Cray T3D > ----------------------------------- > > The FM library contains fast messaging primitives which exploit > special features of the Cray T3D hardware to provide very low latency > for short messages. FM provides an order of magnitude lower latency > than Cray's PVM and achieves performance comparable to SHMEM get while > providing a message-passing interface. > > The FM library provides two distinct sets of primitives which make > use of the T3D fetch-and-increment and atomic swap hardware > respectively. The fetch-and-increment primitives are optimized for > the lowest possible latency and are suitable for situations with light > communication traffic. The atomic swap primitives eliminate output > contention at the cost of slightly higher latency, but by doing so can > deliver robust performance even for heavy and unbalanced traffic loads. > > > Release 1.0 of the library is now available from our WWW server: > > http://www-csag.cs.uiuc.edu/projects/communication/t3d-fm.html > > The library can also be accessed on the T3D at Pittsburgh > Supercomputing Center (mario.psc.edu) from the directory: > > /usr/users/9/karamche/FM-1.0 > > The release contains the source files (C and Assembly), the library > (libFM.a), and an include file which provides the function prototypes. > The release directory also contains the usage manual and a copy of a > paper analyzing the performance of the two sets of FM primitives. The > latter is a preliminary version of the paper which will appear in the > Proceedings of the 22nd International Symposium on Computer > Architecture (ISCA'95). > > > Please contact me if you have any questions, comments or problems. > > Vijay Karamcheti > vijayk@cs.uiuc.edu > (217) 244-7116 > > Concurrent Systems Architecture Group > Department of Computer Science > University of Illinois at Urbana-Champaign > 1304 W. Springfield Avenue > Urbana, IL 61801 >I down loaded the files from the WWW server and installed the newest version (1.1) on denali. The include file "fm.h", which is needed, is in the directory:
/usr/local/examples/mpp/includethe needed library is:
/usr/local/examples/mpp/lib/libFM.aI have gotten some FM test cases running and I will describe some of the library routines next week. At the WWW server:
http://www-csag.cs.uiuc.eduthere are several interesting papers:
FM: Fast Messaging on the Cray T3D Vijay Karamcheti and Andrew Chien
This paper has a nice comparison figure of latency times for PVM and SHMEM as well as the FM libraries routines.A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D Vijay Karamcheti and Andrew Chien
This paper has a good description of T3D hardware support for message passing.Libsci Routines and Improved Speed on the T3D
At ARSC we use Linpack in various forms as part of our regression tests; We solve the Linpack problem for various sizes in C and Fortran, single PE and multiple PEs. These tests take a while to run especially for the large matrices that might fit in a 8MW node, so I've looked into speeding up the public domain versions of Linpack with calls to the libsci versions of the BLAS1 routines. All of the results below are for the single PE case.The standard linpack source as distributed from netlib comes with all the Linpack subroutines, BLAS1 subroutines and auxiliary subroutines needed to execute the benchmark. This means that this unmodified benchmark will call the Fortran versions of the BLAS1 routines rather than the optimized versions that are in the library /mpp/lib/libsci.a. By deleting the BLAS1 routines from the Fortran source, the library routines from libsci will now be used when the calls to BLAS1 routines are used in the Linpack benchmark. The BLAS1 routines called by the Linpack benchmark are:
SDOT - computes the dot product of two vectors
ISAMAX - finds the position of the maximum absolute value
of a vector
SAXPY - computes the vector sum of a vector and a scalar
multiple of another vector
SSCAL - scales a vector by a scalar
Using the same technique of Newsletter #22 we can time these routines from both the provided Fortran source and from calls to libsci routines. Below are the results of these timings. In all cases, eventually the libsci routine is faster than the straight Fortran source and for each routine the libsci version is eventually more than twice as fast as the Fortran version. The crossover points, where the libsci routine becomes faster, are shown in the table below with asterisks.
Fortran libsci Fortran libsci Fortran libsci Fortran libsci
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 0.17 0.10 0.28 0.20 0.23 0.13 0.23 0.10
2 0.43 0.27 0.45 0.36 0.50 0.26 0.44 0.17
3 0.51 0.39 0.64 0.53 0.75 0.36 0.60 0.25
4 0.73 0.50 0.86 0.70 0.87 0.49 0.83 0.35
5 0.80 0.63 1.02 0.34 0.70 0.59 1.04 0.47
10 1.59 1.22 1.69 0.82 1.29 1.12 1.88 0.87
20 2.51 1.94 2.59 1.55 2.07 1.80 3.17 1.53
40 *3.63***3.63* 3.59 3.01 *3.03***3.08* 4.72 2.70
50 4.04 4.58 3.86 3.50 3.34 3.94 4.93 3.34
60 4.17 5.30 *4.07***4.15* 3.57 4.24 5.35 3.71
70 4.68 3.43 3.21 4.90 3.83 3.88 5.02 4.28
80 4.66 3.70 3.32 5.49 3.92 4.14 5.24 4.65
90 4.83 4.25 3.41 5.95 4.14 4.92 5.51 5.08
100 4.90 4.46 3.52 6.42 4.18 4.99 5.81 5.56
200 5.31 6.82 3.93 9.67 4.74 7.12 *7.17***8.98*
300 5.73 8.51 4.04 11.49 5.01 8.72 7.75 11.36
400 5.84 9.62 4.16 12.98 5.16 9.56 8.05 12.95
500 5.91 10.70 4.21 13.98 5.22 10.49 8.31 14.22
600 5.96 11.22 4.26 14.70 5.29 10.96 8.45 15.30
700 6.02 11.89 4.28 15.05 5.33 11.46 8.56 16.23
800 6.05 12.28 4.31 15.58 5.37 11.76 8.66 16.82
900 6.05 12.74 4.33 15.94 5.38 12.03 8.74 17.29
1000 6.08 12.82 4.35 16.21 5.40 12.18 8.81 17.88
1500 6.15 14.06 4.39 17.23 5.46 13.00 8.96 19.43
2000 6.20 14.59 4.42 17.82 5.50 13.35 9.05 20.44
2500 6.23 15.06 4.43 18.13 5.52 8.92 9.11 21.12
2600 6.24 15.09 4.43 18.17 5.53 13.68 9.13 21.21
For large Linpack problems it looks like the substitution with the libsci routines is a good idea for speeding up the regression tests. However we must remember that in Gaussian Elimination, which is what the Linpack benchmark does, the algorithm updates a progressively smaller submatrix and so the vectors become progressively smaller as the algorithm executes. Below are the timings of the Linpack benchmark for increasing problem sizes on one PE.
Problem Linpack Mflop rates
size
Unmodified Version using
Version Libsci BLAS1 routines
1 .19 .13
2 .45 .30
3 .90 .30
4 1.50 1.04
5 2.21 1.28
10 5.04 2.92
20 8.72 6.07
40 11.04 10.72
50 *10.93****12.29*
60 11.10 13.49
70 11.21 13.38
80 11.43 13.31
90 11.55 13.46
100 11.64 13.75
200 12.08 17.30
300 12.03 19.66
400 11.84 21.20
500 11.60 22.26
600 11.36 23.03
700 11.21 23.61
800 11.14 24.14
900 10.97 24.40
1000 10.83 24.69
1500 10.15 25.62
2000 9.83 26.18
2500 9.70 24.61
2600 9.70 26.49
These optimizations are fine but we are not done yet, in the next newsletter we'll solve the same problems using the new LAPACK routines.
List of Differences Between T3D and Y-MP
The current list of differences between the T3D and the Y-MP is:- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
- Restrictions on record length for direct I/O files (Newsletter #19)
- Implied DO loop is not "vectorized" on the T3D (Newsletter #20)
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
