| Newsletter Index | Quick-Tip Index | Search Newsletters |
During test time on June 10, ARSC will launch a daily cron job to delete core and mppcore files which have not been accessed in seven or more days from within user directories on yukon's /u1 and /u2 disks. The seven day /tmp purger has been running on yukon for a week now, and, as it deletes every file older than seven days on /tmp, removes old core and mppcore files as well. To save important core files on /u1 and /u2 (as if there were a lot of these hanging around!) you can simply rename them.
If we run the following code on 4 processors on the T3D what do you think will happen? On the T3E?
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
program prog
implicit none
integer, parameter :: maxproc = 256
include "mpp/shmem.fh"
! local identity
integer my_pe
integer num_pe
integer shmem_my_pe, shmem_n_pes
integer mproc
! processor to stop
integer hproc
! find out how many processors
num_pe = shmem_n_pes()
! which one am i?
my_pe = shmem_my_pe()
write(6,*) ' pe ',my_pe,' of ',num_pe
hproc=2
if(my_pe.eq.hproc) then
write(6,*) ' processor ',hproc,' stopping '
stop
endif
write(6,*) ' pe ',my_pe,' of ',num_pe,' wait on barrier '
call barrier()
end
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Here is the output from a run on the ARSC T3D:
c-denali<121> ./prog -npes 4
pe 3 of 4
pe 0 of 4
pe 3 of 4 wait on barrier
pe 0 of 4 wait on barrier
pe 2 of 4
pe 1 of 4
processor 2 stopping
pe 1 of 4 wait on barrier
STOP (PE 2) executed at line 39 in Fortran routine 'PROG'
At this point, the job will wait until the user's resources are consumed and then be terminated by the system, occupying the four PEs all the while.
Here is the output from a run on the ARSC T3E:
yukon% mpprun -n 4 ./prog
pe 1 of 4
pe 0 of 4
pe 1 of 4 wait on barrier
pe 3 of 4
pe 2 of 4
pe 0 of 4 wait on barrier
pe 3 of 4 wait on barrier
processor 2 stopping
STOP (PE 2) executed at line 39 in Fortran routine 'PROG'
FATAL ERROR IN BARRIER: User deadlock detected.
Number of PEs that have exited: 1
Number of PEs waiting for barriers: 3
Total PEs: 4
Status of each PE:
PE 0 is waiting for a hardware barrier
PE 1 is waiting for a hardware barrier
PE 2 has exited
PE 3 is waiting for a hardware barrier
SIGNAL: Abort ( from process 7959 )
Beginning of Traceback (PE 3):
Interrupt at address 0x800032130 in routine '_lwp_kill'.
Called from line 30 (address 0x800031930) in routine 'raise'.
Called from line 125 (address 0x800009784) in routine 'abort'.
Called from line 321 (address 0x8000c5c20) in routine
'_sma_register_blocked_pe'.
Called from line 80 (address 0x8000c52e8) in routine '_sma_deadlock_wait'.
Called from line 111 (address 0x8000c63c0) in routine 'barrier'.
Called from line 43 (address 0x80000129c) in routine 'PROG'.
Called from line 433 (address 0x800000b3c) in routine '$START$'.
End of Traceback.
Abort (core dumped)
yukon%
The T3E handles the situation much better than the T3D. Naturally, the programmer should not have an exit or stop statement which doesn't perform a global syncronisation. A more realistic situation is whereeach processor might be reading from a different file and one processor encounters an error and stop's, or each is malloc'ing and one runs out of memory and stop's, and the programmer has done a poor job of exception handling. (In the case of the file read problem, Fortran provides an easy solution. Stay tuned...)
At the CUG, Jeff Brooks mentioned a few features of the v3.0 f90 T3E compiler (see last week's newsletter), one of which is a compiler directive to bypass the cache, thus using the e-registers for memory transfers. For example:
!dir$ cache_bypass c,a
do i=1,n
c(i) = a(i)
enddo
This directive is not available in v2 f90, but we can test the effect by using shmem_put on the same PE. The following command has the same effect as the above loop:
shmem_put (c, a, n, mype)
This is Cray-specific code, of course, but if you are using shmem routines anyway, and can't wait for f90 v3.0, it might be worth it. The benefit increases with the volume of data being copied, but in a simple test run on the T3E, shmem_put was faster than a loop by up to 400%. The same test suggests that on the T3D you should stick with loops and array operations. Here's the test code followed by sample output:
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
program prog
implicit none
integer, parameter::SZ=1000000
real c(SZ), a(SZ)
integer i, t1, t2, t3, t4, t5, t6
integer irtc, ierr, index, iclktck, my_pe, shmem_my_pe
integer copycnt, copysz
! Get machine clock ticks per second
call pxfconst ('CLK_TCK',index,ierr)
call pxfsysconf (index, iclktck, ierr)
! Get my pe
my_pe = shmem_my_pe()
do copycnt=1,6
copysz=10**copycnt
c = 1
! copy arrays using shmem_put.
a = 0
t1 = irtc ()
call shmem_put (a, c, copysz, my_pe)
t2 = irtc ()
call dummy (a, c)
! Verify copy completed.
do i=copysz,1,-1
if (c(i) .NE. a(i)) stop "copy failed"
enddo
! copy arrays using f90 array operation
a = 0
t3 = irtc ()
a(1:copysz) = c(1:copysz)
t4 = irtc ()
call dummy (a, c)
! copy arrays using loop
a = 0
t5 = irtc ()
do i=1,copysz
a(i) = c(i)
enddo
t6 = irtc ()
call dummy (a, c)
write (6,'("Copying ", i8, " words. Words per sec:")') copysz
write (6,'("shmem_put ",f12.0)') copysz/((t2-t1)/real(iclktck))
write (6,'("array op ",f12.0)') copysz/((t4-t3)/real(iclktck))
write (6,'("loop ",f12.0)') copysz/((t6-t5)/real(iclktck))
write (6,*)
enddo
end
subroutine dummy
end
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Here is output from this code on the T3E:
yukon$ mpprun -n1 ./a.out
Copying 10 words. Words per sec:
shmem_put 1143729.
array op 3064351.
loop 2909796.
Copying 100 words. Words per sec:
shmem_put 8498584.
array op 7767996.
loop 8136697.
Copying 1000 words. Words per sec:
shmem_put 30618494.
array op 9596622.
loop 9015778.
Copying 10000 words. Words per sec:
shmem_put 43273184.
array op 9065170.
loop 9780493.
Copying 100000 words. Words per sec:
shmem_put 43497488.
array op 10582787.
loop 10324527.
Copying 1000000 words. Words per sec:
shmem_put 44401595.
array op 10590687.
loop 10429912.
yukon$
At 100 array elements (or so) the cache_bypass/shmem_put/e-register method surpasses the loop and array operation methods, and never looks back.
The best rate of about 44 Mwords/sec translates to about 355 Mbytes/sec, surpassing the best shmem inter-PE bandwidth of about 330 Mbytes/sec we obtained in earlier bandwidth tests and printed in newsletter 117. These rates fluctuate slightly, on the order of a percent, from run to run.
Here is output from this code on the T3D:
denali$ mppexec ./a.out -npes 1 Copying 10 words. Words per sec: shmem_put 1116183. array op 6173457. loop 5415704. Copying 100 words. Words per sec: shmem_put 5331023. array op 13182338. loop 12789003. Copying 1000 words. Words per sec: shmem_put 8108481. array op 10638607. loop 14611376. Copying 10000 words. Words per sec: shmem_put 9269283. array op 8931485. loop 8660224. Copying 100000 words. Words per sec: shmem_put 9786806. array op 8633544. loop 8607606. Copying 1000000 words. Words per sec: shmem_put 9770439. array op 8604423. loop 8601747. denali$
> [ One of our readers, David Bader, sent in this announcement. ] > > ---------------------------------------------------------------------- > SIMPLE: A Methodology for Programming High Performance Algorithms on > Clusters of Symmetric Multiprocessors (SMPs) > ---------------------------------------------------------------------- > > We have released our technical report entitled ``SIMPLE: A Methodology > for Programming High Performance Algorithms on Clusters of Symmetric > Multiprocessors (SMPs),'' by David A. Bader and Joseph Ja'Ja'. > Technical Report Number: CS-TR-3798 and UMIACS-TR-97-48. Institute for > Advanced Computer Studies (UMIACS), University of Maryland, College > Park, May 1997. > > This report is available in PostScript format via the WWW: > > http://www.umiacs.umd.edu/research/EXPAR > http://www.umiacs.umd.edu/research/EXPAR/papers/3798.html > > or via anonymous ftp from these locations: > > ftp://ftp.cs.umd.edu/pub/papers/papers/3798/3798.ps.Z > ftp://ftp.umiacs.umd.edu/pub/EXPAR/papers/3798.ps.gz > ftp://ftp.umiacs.umd.edu/pub/EXPAR/papers/3798.ps.Z > > If you prefer a hardcopy, please reply to this message and send me > your mailing address. > > ABSTRACT: > > We describe a methodology for developing high performance programs > running on clusters of SMP nodes. Our methodology is based on a small > kernel (SIMPLE) of collective communication primitives that make > efficient use of the hybrid shared and message passing environment. We > illustrate the power of our methodology by presenting experimental > results for sorting integers, two-dimensional fast Fourier transforms > (FFT), and constraint-satisfied searching. Our testbed is a cluster of > DEC AlphaServer 2100 4/275 nodes interconnected by an ATM switch. > > --- > > David A. Bader, Ph.D. Office: 301-405-6755 > Institute for Advanced Computer Studies FAX: 301-314-9658 > A.V. Williams Building Internet: dbader@umiacs.umd.edu > University of Maryland WWW: http://www.umiacs.umd.edu/~dbader > College Park, MD 20742 >
A: {{ Is diamond a good conductor (of heat? of electricity?). How about
silicon? }}
Diamond has nice properties for hardware engineers. It is a
fantastic conductor of heat but an electric insulator.
A speaker at the CUG described shrinking a 4 processor 1 Gbyte
J90's weight by 83% and volume by 80% for an embedded application.
Among other things, they used a substrate of synthetic diamond on the
boards to aid heat dispersion. They also used a coolant atomizer to
directly spray them. The coolant phase change keeps the boards at
essentially constant temperature without the bulk of chillers. The
vapor is reclaimed into liquid via a simple radiator.
How about silicon? As an aside, ARSC staff has a tradition of buying
interesting little items for each other whenever we are on vacation
or a business trip (and think of it). Dale Clark returned from
Silicon Valley with pieces of silicon for everyone. They look like
shiny metallic chunks of coal, which makes sense, as Si is in the
same valence group as C. Silicon, as we all know, is a
semi-conductor.
Q: On the T3E, how can you limit the size of core files?
[ Answers, questions, and tips graciously accepted. ]
Contact:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619 Donald Bahls ARSC User Consultant ph: 907-450-8674 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.Email Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources