ARSC T3E Users' Newsletter 119, May 16, 1997
Yukon /tmp Purging Begins Next Week
Users helping with the testing of ARSC's T3E should be interested in this.
As noted in the motd on yukon, we plan to start purging files in the /tmp file system on yukon starting Tuesday, May 20. Currently, any files in /tmp (excluding /tmp/window) not accessed or modified in 7 days will be deleted. /tmp/window will be subject to a 21 day purge, as happens on denali. The script that will do this is already in the yukon root crontab, but it is only listing files that will be deleted instead of actually deleting them.
Most people active on yukon are likely to have something go purged. Beyond individuals, directories like /tmp/dump, /tmp/window/udb, /tmp/window/dot.files and /tmp/window/disk also have lots of files that will go away.
T3E Adaptive Routing and 'shmem_fence'
The previous newsletter quoted some bandwidth figures for PUT and GET on the T3D and T3E. The code given used SHMEM_WAIT to verify that all of the data had arrived.
On the T3D, routing of messages is deterministic, data arrives in the order it was sent. On the T3E, a system configuration option exists which allows adaptive routing and as the name suggests, when this is enabled, data may take different routes through the torus, avoiding any hotspots. Therefore, data need not arrive in order and testing the last data item is no longer a correct test for arrival of the complete message.
The following code uses shmem_put and shmem_wait to transfer arrays. It was run on the ARSC T3E which does have adaptive routing enabled. It counts the number of data items which arrive out of order, based on the shmem_wait test of the last array element, and prints out these "error" counts.
Example code1.
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
program band
include "mpp/shmem.fh"
parameter (maxsize=1024*1024)
common /a/ itdata(maxsize),isdata(maxsize)
integer num_pe,my_pe
integer shmem_my_pe, shmem_n_pes
num_pe = shmem_n_pes();my_pe = shmem_my_pe()
! set master processor
mproc=0;mtarg=1
! how many repeats
ntdo=10
! count errors
iperr=0;igerr=0;icerr=0
do matarg=0,num_pe-1
mtarg=mod(matarg+my_pe,num_pe)
mbtarg=mod(matarg-my_pe+num_pe,num_pe)
msize=maxsize
99 continue
itget=0;itrecv=0;itput=0
do ido=1,ntdo
call barrier()
isdata(1:msize)=msize;itdata(1:msize)=-1
call barrier()
its=irtc()
call shmem_put(itdata,isdata,msize,mtarg)
itf=irtc();itput=itput+itf-its
call shmem_wait(itdata(msize),-1)
do ic=1,msize
if(itdata(ic).ne.msize) iperr=iperr+1
enddo
icerr=icerr+msize
do ic=1,msize-1
call shmem_wait(itdata(ic),-1)
enddo
itf=irtc();itrecv=itrecv+itf-its
isdata(1:msize)=msize;itdata(1:msize)=-1
call barrier()
its=irtc()
call shmem_get(itdata,isdata,msize,mbtarg)
do ic=1,msize
if(itdata(ic).ne.msize) igerr=igerr+1
enddo
icerr=icerr+msize
itf=irtc();itget=itget+itf-its
enddo
itget=itget/ntdo;itrecv=itrecv/ntdo;itput=itput/ntdo
if(mproc.eq.my_pe) then
write(6,*) ' put/get times ',msize, ' were ',
! itput,itrecv,itget
write(6,*) ' bandwidth ',matarg,
! float(msize)/(float(itput)/300.0E6),
! float(msize)/(float(itget)/300.0E6)
2000 format(1x,i10,i10,e13.6,e13.6)
endif
if(msize.gt.1) then
msize=msize/2
goto 99
endif
enddo
write(6,*) ' number of data errors ',my_pe,' is put/get ',
! iperr,igerr,
! ' in ',icerr
stop
end
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
This code passes data from each processor to a processor a fixed stride away. The test on arrival is necessary to know when all puts are finished, here WAIT is used instead of BARRIER since it allows computation to proceed when the data has arrived rather than waiting for all data exchanges to complete. Note that the WAIT tests the arrival of data being PUT from another processor. The code loops over all strides which generates a very busy network and it is more likely that there will be adaptive routing of messages. An error count is kept and same results below show the number of a run on 32 processors.
Example error counts.
number of data errors 0 is put/get 1, 0 in 1342176640 number of data errors 16 is put/get 2*0 in 1342176640 number of data errors 5 is put/get 2*0 in 1342176640 number of data errors 1 is put/get 2*0 in 1342176640 number of data errors 15 is put/get 2*0 in 1342176640 number of data errors 28 is put/get 33, 0 in 1342176640 number of data errors 2 is put/get 2*0 in 1342176640 number of data errors 21 is put/get 41, 0 in 1342176640 number of data errors 11 is put/get 2*0 in 1342176640 number of data errors 25 is put/get 1, 0 in 1342176640 number of data errors 9 is put/get 2*0 in 1342176640 number of data errors 23 is put/get 2, 0 in 1342176640 number of data errors 14 is put/get 2*0 in 1342176640 number of data errors 4 is put/get 2*0 in 1342176640 number of data errors 18 is put/get 2*0 in 1342176640 number of data errors 10 is put/get 10, 0 in 1342176640 number of data errors 3 is put/get 6, 0 in 1342176640 number of data errors 22 is put/get 2*0 in 1342176640 number of data errors 6 is put/get 2*0 in 1342176640 number of data errors 26 is put/get 18, 0 in 1342176640 number of data errors 24 is put/get 2*0 in 1342176640 number of data errors 17 is put/get 2*0 in 1342176640 number of data errors 31 is put/get 2*0 in 1342176640 number of data errors 20 is put/get 11, 0 in 1342176640 number of data errors 27 is put/get 2*0 in 1342176640 number of data errors 13 is put/get 2*0 in 1342176640 number of data errors 30 is put/get 1, 0 in 1342176640 number of data errors 29 is put/get 2, 0 in 1342176640 number of data errors 8 is put/get 2*0 in 1342176640 number of data errors 19 is put/get 8, 0 in 1342176640 number of data errors 7 is put/get 1, 0 in 1342176640 number of data errors 12 is put/get 15, 0 in 1342176640
As can be seen there are only a small number of data errors, however these might be important in your program.
A simple solution is to wait for the entire array to have changed. However this is a rather expensive option for large messages. The new SHMEM 'fence' routine is provided to ensure arrival order is correct. It is used along with the exchange of a token to make sure all data has arrived.
The following code uses shmem_fence to guarantee the arrival of the entire message and then tests it, as in the previous example.
Example code 2.
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
program band
include "mpp/shmem.fh"
parameter (maxsize=1024*1024)
common /a/ itdata(maxsize),isdata(maxsize)
parameter (mfsize=1)
common /af/ itfdata(mfsize),isfdata(mfsize)
integer num_pe,my_pe
integer shmem_my_pe, shmem_n_pes
num_pe = shmem_n_pes();my_pe = shmem_my_pe()
! set master processor
mproc=0;mtarg=1
! how many repeats
ntdo=10
!count errors
iperr=0;igerr=0;icerr=0
do matarg=0,num_pe-1
mtarg=mod(matarg+my_pe,num_pe)
mbtarg=mod(matarg-my_pe+num_pe,num_pe)
msize=maxsize
99 continue
itget=0;itrecv=0;itput=0
do ido=1,ntdo
call barrier()
isdata(1:msize)=msize;itdata(1:msize)=-1
itfdata=-1;isfdata=msize
call barrier()
its=irtc()
call shmem_put(itdata,isdata,msize,mtarg)
call shmem_fence
call shmem_put(itfdata,isfdata,mfsize,mtarg)
call shmem_wait(itfdata(mfsize),-1)
itf=irtc();itput=itput+itf-its
do ic=1,msize
if(itdata(ic).ne.msize) iperr=iperr+1
enddo
icerr=icerr+msize
itf=irtc();itrecv=itrecv+itf-its
isdata(1:msize)=msize;itdata(1:msize)=-1
call barrier()
its=irtc()
call shmem_get(itdata,isdata,msize,mbtarg)
do ic=1,msize
if(itdata(ic).ne.msize) igerr=igerr+1
enddo
icerr=icerr+msize
itf=irtc();itget=itget+itf-its
enddo
itget=itget/ntdo;itrecv=itrecv/ntdo;itput=itput/ntdo
if(mproc.eq.my_pe) then
write(6,*) ' put/get times ',msize, ' were ',
! itput,itrecv,itget
write(6,*) ' bandwidth ',matarg,
! float(msize)/(float(itput)/300.0E6),
! float(msize)/(float(itget)/300.0E6)
2000 format(1x,i10,i10,e13.6,e13.6)
endif
if(msize.gt.1) then
msize=msize/2
goto 99
endif
enddo
write(6,*) ' number of data errors ',my_pe,' is put/get ',
! iperr,igerr,
! ' in ',icerr
stop
end
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Output from version using shmem_fence
number of data errors 0 is put/get 2*0 in 1342176640 number of data errors 18 is put/get 2*0 in 1342176640 number of data errors 2 is put/get 2*0 in 1342176640 number of data errors 5 is put/get 2*0 in 1342176640 number of data errors 24 is put/get 2*0 in 1342176640 number of data errors 6 is put/get 2*0 in 1342176640 number of data errors 22 is put/get 2*0 in 1342176640 number of data errors 11 is put/get 2*0 in 1342176640 number of data errors 7 is put/get 2*0 in 1342176640 number of data errors 29 is put/get 2*0 in 1342176640 number of data errors 8 is put/get 2*0 in 1342176640 number of data errors 27 is put/get 2*0 in 1342176640 number of data errors 14 is put/get 2*0 in 1342176640 number of data errors 19 is put/get 2*0 in 1342176640 number of data errors 3 is put/get 2*0 in 1342176640 number of data errors 9 is put/get 2*0 in 1342176640 number of data errors 1 is put/get 2*0 in 1342176640 number of data errors 17 is put/get 2*0 in 1342176640 number of data errors 15 is put/get 2*0 in 1342176640 number of data errors 4 is put/get 2*0 in 1342176640 number of data errors 16 is put/get 2*0 in 1342176640 number of data errors 26 is put/get 2*0 in 1342176640 number of data errors 28 is put/get 2*0 in 1342176640 number of data errors 30 is put/get 2*0 in 1342176640 number of data errors 21 is put/get 2*0 in 1342176640 number of data errors 31 is put/get 2*0 in 1342176640 number of data errors 13 is put/get 2*0 in 1342176640 number of data errors 10 is put/get 2*0 in 1342176640 number of data errors 23 is put/get 2*0 in 1342176640 number of data errors 25 is put/get 2*0 in 1342176640 number of data errors 20 is put/get 2*0 in 1342176640 number of data errors 12 is put/get 2*0 in 1342176640
The calls to shmem_fence ensure that data is passed in order. This only refers to the sequence of messages, not the data in a message. In this case we pass two messages and the call to SHMEM_FENCE ensures that the second does not arrive until the first is completed. The test for arrival of the second single data items indicates the first message has been received. Without the call to shmem_fence, data in the previous "put" could still be in transit.
What might a program do with adaptive routing if it has a problem? Firstly different runs might give different results. This happens since the data used in the computation might not have been correct when the computation proceeds past a wait. (Note, users who sync with BARRIER don't have to worry, the processor putting the data will only join the barrier once the data transfer is complete.)
Does adaptive routing improve bandwidth? It does not improve the peak performance of the system but does avoid help problems with torus congestion. This was generally not a problem on the T3D since the torus was used only by users programs and carried a small system load. With the T3E being a stand-alone system, however, with processors performing operating system, user, and application tasks, there is much more traffic on the torus which increases the potential for hotspots.
Here is a bit of the 'man' page:
SHMEM_FENCE(3) CrayLibs 2.0 SR-2165 2.0
NAME
shmem_fence - Assures ordering of delivery of puts
SYNOPSIS
C:
#include <mpp/shmem.h>
void shmem_fence(void);
Fortran:
CALL SHMEM_FENCE
IMPLEMENTATION
All Cray Research systems
DESCRIPTION
This function ensures ordering of put (remote write) operations. All
put operations issued to a particular processing element (PE) prior to
the call to shmem_fence are guaranteed to be delivered before any
subsequent put operations to the same PE which follow the call to
shmem_fence.
CUG Spring '97 Report
The Silicon Valley CUG, hosted last week in San Jose by NASA Ames and Sterling Software, was pleasantly upbeat, especially after the previous CUG which had an air of anxiety due, perhaps, to uncertainty surrounding both the future of CUG and the future of CRI (which had just been acquired by SGI). At this point, CUG's future seems secure, though its direction is still under debate (there will be one meeting per year rather than two, and this decision at least seems to have settled with everyone). And the future of Cray also looks good.
There has been a milestone reached at Cray. The revenue from sale of Cray MPP systems exceeded, for the first time ever, the revenue from Cray PVP systems (this was during one quarter only... but still...).
The SGI/CRI merger seems to be going smoothly, and SGI seems to treat Cray quite generously. In his address to the CUG, Ed McCracken, CEO of Cray/SGI, stated: "You [CUG attendees] have more impact on us by being here in Silicon Valley than we have on you."
This theme was echoed by others. In general, SGI seemed down-right grateful that it is teamed with responsible, reputable, grey-haired Cray. They have promoted Robert Ewald, former CEO of CRI, to be the Executive Director Of Computer Systems (I didn't get all of these titles exact, but they're close) for all of SGI/Cray, and have relocated him to Mountain View. They have promoted Mick Dungworth, formerly Director of Customer Service (?) for CRI to the same position for all of SGI/Cray. They have moved development of the Origin2000 (top of SGI's line) to Eagan, where Irene Qualters, former CRI VP of System Development (?) under Robert Ewald is now Executive Director of SGI/Cray for Supercomputer Development (?) and responsible for the Eagan facility.
What follows are notes from some presentations relevant to this Newsletter:
----------------------------------------------------------------------
Jeff Brooks of SGI/Cray: T3E Single PE Optimization
-100 mflops on 1 300Mhz pe is worthy goal.
-latency to memory is a FACT (like death and taxes) to be dealt with.
-C will never beat FORTRAN for compiler optimization.
Streams:
-after 3 cache misses, starts pre-fetching from remote PE's memory
using stream register.
-6 stream buffers (LHS takes 1)
-use MPI to avoid conflicts between e-registers and cache.
f90 v3.0:
-This compiler has new optimization options.
-New Math Library:
use -lmfastv to get optimized "vector" math library.
-The speaker's favorite optimization options:
f90 -O3,unroll2,pipeline2,apad -lmfastv
-Cache Bypass using e-register:
-Using cache will slow down array initialization. Do this to bypass
cache and use e-registers:
!dir$ cache_bypass c,a
do i=1,n
c(i) = a(i)
enddo
Unrolling:
do i=1,n,1
t(i) = t(i) * t(i) ! expect 1 result per 4 clock periods
enddo
do i=1,n,4 ! expect 4 results per 4 clock periods
t(i+0) = t(i+0) * t(i+0)
t(i+1) = t(i+1) * t(i+1)
t(i+2) = t(i+2) * t(i+2)
t(i+3) = t(i+3) * t(i+3)
enddo
----------------------------------------------------------------------
SGI/Cray CEO Ed McCracken's Address:
-Quote: "California attitude: 'If it exists, it's obsolete.'"
-Purpose of SGI/Cray: "unleash human creativity."
-Culture of SGI/Cray:
-innovative: "white corpuscles rise up to kill boring projects."
-respect honesty toward clients.
-Integrity: "honoring commitments."
-Passion to excel.
-Deliver results: it's meaningless w/out delivery.
-Strengths:
-systems: SGI creates entire systems (gets components from specialists).
-"man-machine" interface.
-customer relations.
-Coming Changes:
-10^3 speedup in computers by 2010.
-Drastic speedup in networks.
-Effects of these changes:
-Recentralization of computing; application servers; growth in large .
network hubs; resurgence of supercomputing centers.
-Current market:
-33% (!!) of SGI systems in network hubs, now.
-Managers of network hubs want a few huge systems rather than
100's of small systems.
-33% in supercomputing.
-33% in graphics.
-Current competitive areas in the SGI world:
-Unix.
-PC's.
-Network.
-Consumer electronics.
-Relationship with user groups:
-Catering too much to desires of today's users can impede progress, as
users want things to stay the same. Need to look to tomorrow's users.
----------------------------------------------------------------------
Richard Langstrom, SGI/Cray: T3E
-Resource Scheduling: To be in coming U/mk releases:
-Gang scheduler (moves groups of apps into run-status; swaps groups).
-Load Balancer (shifts apps between PEs).
-Quotas: Coming in v 1.5.2 (July?).
----------------------------------------------------------------------
Kent Koeninger, SGI/Cray: Cray product comparison:
T90: 880 Gb/sec -- mem-processor max bandwidth
1.6 Gflop per processor max
J90: Compatible w/ Y-MP
Good price/perf ratio
T3E: Most scalable
Origin2000: independent processor price/perf ratio comparable w/
T90 and T3E. SMP or MPP style of programming.
Each node is PVP w/ shared memory. Nodes connected MPP style
for globally addressable memory.
----------------------------------------------------------------------
Quick-Tip Q & A
A: {{ Can you tell NQS to not restart your job from the beginning after
a failed checkpoint at shutdown or after a crash? You might want
to do this if restarting the job would overwrite work already
completed. }}
# Use the qsub "no rerun" option, -nr. From the qsub man page:
#
# -nr Specifies that the batch request cannot be rerun.
#
Q: Is diamond a good conductor (of heat? of electricity?). How about
silicon?
[ Answers, questions, and tips graciously accepted. ]
Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
