ARSC T3D Users' Newsletter 12, November 11, 1995
What Exact Versions Are We Running Now?
Last week I showed the verbose output from the "what" command to see exactly what is in a T3D library in the directory /mpp/lib. A more succinct report comes from:
what /mpp/lib/*.a grep versioncurrently on denali, this produces:
libcomm_version 22.214.171.124 06/08/94 09:41:28 libf_version 126.96.36.199 06/08/94 09:43:07 libfi_version 188.8.131.52 06/08/94 10:00:14 libm_version 184.108.40.206 06/15/94 16:43:37 libpvm3_version 220.127.116.11 06/08/94 09:04:01 libsci_version 18.104.22.168 06/08/94 09:09:10 libsma_version 22.214.171.124 06/08/94 09:39:24 libu_version 126.96.36.199 06/08/94 10:14:54The version numbers listed above identifies what is in CrayLib_M. The new version of CrayLib_M for December 4th produces:
libcomm_version 188.8.131.52 10/12/94 08:39:18 libf_version 184.108.40.206 10/12/94 08:43:15 libfi_version 220.127.116.11 10/14/94 10:37:51 libm_version 18.104.22.168 09/08/94 18:20:35 libpvm3_version 22.214.171.124 09/09/94 10:52:22 libsci_version 126.96.36.199 10/12/94 07:25:20 libsma_version 188.8.131.52 10/12/94 08:33:06 libu_version 184.108.40.206 10/12/94 10:09:48
Timing SHMEM Calls and Compiler OptimizationAnother exercise from last week's T3D class showed some interesting effects of compiler optimization choices on shmem_get speeds. There is a very comprehensive man page on denali for the shmem_get and shmem_put calls. (Just try "man shmem".) For the shmem_get call, a user specifies a local target array and its length to be filled by an array from another PE.
Below is a simple program for timing the speed of shmem_get calls for messages of increasing size. The program below times 6 shmem_get calls of size 8, 80, 800, 8000, 80000, and 800000 bytes. Using these times and sizes we can get a speed for the shmem_get call as a function of message size. This program does all of its timings on PE0 and times shmem_get calls when the data is both on and off PE0. (A shmem_get of memory local is done better with just a copy.)
program sh parameter( NMAX = 100 000 ) real a( NMAX ), b( NMAX ), c( NMAX ) INTRINSIC MY_PE integer DONE CDIR$ SHARED DONE DONE = 0 do 10 i = 1, NMAX a( i ) = i 10 continue me = MY_PE() n = 1 if( me .eq. 0 ) then do 100 i = 1, 6 t1 = second( ) call shmem_get( a, b, n, 0 ) t2 = second( ) call shmem_get( a, c, n, 1 ) t3 = second( ) rmb = 8*n/1000000.0 write(6,600)i,n,t2-t1,t3-t2,rmb/(t2-t1),rmb/(t3-t2) n = n * 10 100 continue DONE = 1 else 1 continue if( DONE .eq. 0 ) goto 1 endif 600 format( i4, i7, f10.6, f10.6, f10.2, f10.2 ) do 200 i = 1, NMAX if( i .ne. b( i ) ) then write( 6, 601 ) i, a( i ), b( i ) stop endif if( i .ne. c( i ) ) then write( 6, 602 ) i, a( i ), b( i ) stop endif 200 continue end 601 format( "b is bad ", i10, f10.1, f10.1 ) 602 format( "c is bad ", i10, f10.1, f10.1 ) real function second( ) second = float( irtc( ) ) / 150000000.0 endThe typical output for running this program is:
/mpp/bin/cf77 -Wf" " -c shmem.f mppldr shmem.o a.out 1 1 0.000016 0.000009 0.50 0.91 2 10 0.000011 0.000011 7.13 7.49 3 100 0.000017 0.000029 47.92 27.27 4 1000 0.000073 0.000211 109.51 37.93 5 10000 0.000692 0.002137 115.66 37.44 6 100000 0.006769 0.020570 118.19 38.89The major problem with this result is that the program hangs. As the program was written all PEs other than PE0 just spin, waiting for the DONE variable to change from 0 to 1. At the default level of compiler optimization the compiler doesn't recognize the the variable DONE is SHARED and might be changed by some other PE. So the spin loop has no way of being stopped. Compiling with optimization off produces this output:
/mpp/bin/cf77 -Wf" -o off " -c shmem.f mppldr shmem.o a.out 1 1 0.000019 0.000009 0.43 0.88 2 10 0.000012 0.000011 6.66 7.08 3 100 0.000020 0.000031 40.40 25.67 4 1000 0.000092 0.000223 86.58 35.82 5 10000 0.000860 0.002107 93.00 37.96 6 100000 0.008549 0.021146 93.57 37.83This time the program properly ends but the performance of the shmem_gets have taken a nasty performance hit on the local PE transfer. It has been suggested that by using the CDIR$ SUPPRESS directive the compiler will not optimize away correct behavior but will retain speed elsewhere in the code. I changed the class exercise to:
. . . else 1 continue CDIR$ SUPPRESS DONE if( DONE .eq. 0 ) goto 1 endif . . .And the program runs correctly (i.e., doesn't hang, and correctly transfers the data) but the performance hit is another surprise:
/mpp/bin/cf77 -Wf" " -c shmemq.f mppldr shmemq.o a.out 1 1 0.000020 0.000011 0.39 0.74 2 10 0.000015 0.000014 5.42 5.81 3 100 0.000027 0.000046 29.99 17.28 4 1000 0.000159 0.000375 50.24 21.30 5 10000 0.001154 0.002900 69.33 27.58 6 100000 0.011401 0.028644 70.17 27.93This time even the transfers off PE0 took a performance hit. This is counterintuitive. The CDIR$ SUPPRESS worked but has less performance than the original solution.
In the general case on the T3D, I think that it would be safest to develop code with optimization off and then when the program behaves correctly optimization can be turned on. If with optimization turned on, the program doesn't work correctly then the sensitive sections might be isolated in separate compilation units and compiled without optimization. In this one case CDIR$ SUPPRESS didn't seem to be very useful.
Timers on the T3DThis subject seems more complicated than it needs to be and I'll be putting out more detailed information in the future but for now, here's a summary of the timers available on the T3D:
timer Wallclock Fortran T3D or Granularity Resolution or CPU timer or C Y-MP T3D Y-MP T3D Y-MP irtc wallclock Fortran both ~.15um ~.2um rtc wallclock Fortran both ~.30um ~.3um tsecnd CPU Fortran both 10000um 3um gettimeofday wallclock C both ~2500um ~30um second CPU Fortran Y-MP 1um 5umnotes:
- Anything in Fortran or C is callable from the other language but with a additional overhead
- um = microseconds (.000001 of a second)
- Without multiprogramming on the T3D, wallclock is CPU time
- Resolution is from the Livermore Loops benchmark
- Granularity is measured with the program in Newsletter #11
- There are man pages on each of these functions
RemindersThe current list of differences between the T3D and the Y-MP is:
- Data type sizes are not the same (Newsletter #5)
- Uninitialized variables are different (Newsletter #6)
- The effect of the -a static compiler switch (Newsletter #7)
- There is no GETENV on the T3D (Newsletter #8)
- Missing routine SMACH on T3D (Newsletter #9)
- Different Arithmetics (Newsletter #9)
- Different clock granularities for gettimeofday (Newsletter #11)
Upcoming Upgrade on ARSC T3D SoftwareARSC will be upgrading the T3D software to CrayLib_M 220.127.116.11, MAX 18.104.22.168 and SCC_M 22.214.171.124 on December 4th.
We are asking users to submit their codes to be part of our regression test suite. We are looking for relatively small, short running, self contained programs in source form that are currently running correctly. The advantage of submitting your code is that it will be one of the first programs run on a new T3D release and so if it doesn't work we wouldn't install the new T3D software. Otherwise your program may fail after the new T3D software has become the default.
PE LimitsAt ARSC, most production runs can be handled by the 32 PE NQS queue. The use of large partitions of 64 or 128 PEs seems restricted to research problems of the "what if..." type. Basically, development is done on smaller partitions and then to complete a table of performance or problem sizes the larger partitions are used.
We have been accommodating this situation by individually changing a user's interactive permissions with the understanding that users would share their results with us. We are in the process of notifying users of the change in this policy. In the future we will reset everyone back to the default settings and then make it easy for users to request changes that will be in affect for a short time period (perhaps a week per request).
Next Week's NewsletterNext week, I will be at Supercomputing '94 and there will be no newsletter. I'd like to meet, in person, the members of the ARSC T3D users group, so if you stop by the ARSC booth we can talk or arrange a meeting. Also I have poster #23 in the poster session; you can also catch me there.
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.