| Newsletter Index | Quick-Tip Index | Search Newsletters |
On the T3E, streams refers to read-ahead of local memory locations. When enabled, this read-ahead attempts to fetch data from memory into cache before it is actually needed. Depending on a code's memory access pattern, it can improve performance significantly.
Whether or not streams are enabled or disabled in a run of your program is determined by settings in the system configuration files, user environment, and user program. A program's stream settings over-ride the environment's, which over-ride the system default.
In order to set the stream level, the following commands may be used:
System default: <Not under user control>
In The User Environment:
Set the environment variables SCACHE_D_STREAMS and/or SCACHE_I_STREAMS
to desired stream level. For instance:
export SCACHE_D_STREAMS=0 [ksh user turns data streams off]
setenv SCACHE_I_STREAMS 1 [csh user turns instruction stream on]
In Programs:
Call one of the routines SET_D_STREAM() or SET_I_STREAM() (or the C
equivalents) from within your code. For instance:
CALL SET_D_STREAM(0) ! disable data streaming
In order to determine what settings are currently in use, call the functions, GET_D_STREAM() or GET_I_STREAM() (or the C equivalents).
Here's a simple example:
program prog
integer get_d_stream
d_stream_value=get_d_stream()
write(6,*) ' streams were initially ',d_stream_value
call set_d_stream(1)
d_stream_value=get_d_stream()
write(6,*) ' streams have been set to ',d_stream_value
call set_d_stream(0)
d_stream_value=get_d_stream()
write(6,*) ' streams have been set to ',d_stream_value
end
Streams can be set to any of four levels. You should experiment with data streams on and off, at the minimum, and possibly with all four different levels and with both the instruction and data stream capabilities. The following description of the different levels is taken from man GET_D_STREAM on yukon:
0 or _MPC_NO_STREAMS Deactivates read-ahead in the
stream buffers. Previously
detected small-strided reference
streams continue to be active, but
no new small-strided access streams
are detected.
1 or _MPC_DETECT_STREAMS Activates read-ahead in the stream
buffers. Stream detection occurs
upon reference to the second of two
successive 8-word secondary cache
lines.
2 or _MPC_INITIAL_PREFETCH Sets stream detection to the level
set by _MPC_DETECT_STREAMS, but in
addition any secondary cache and
stream buffer misses result in a
prefetch of the next successive
secondary cache line.
3 or _MPC_AGGRESSIVE_PREFETCH Sets stream detection to the level
set by _MPC_INITIAL_PREFETCH, but
additional aggressive read-ahead is
performed.
Additional ARSC Users Note: Although our system default is now to have streams on at level 1, if your code calls any SHMEM routine, the executable will be set to over-ride this at run-time, and run with streams off. If you want your program to use streams, set them explicitly to the level you want. At run-time, set the SCACHE_D_STREAMS environment variable to 1 in your or code your program to turn streams on internally by calling SET_D_STREAMS().
Having installed programming environment 3.0 and upgraded yukon to be streams-safe, we decided to run this follow-up to an article that appeared in issue #120. We test several different means of copying one array to another on a single processor on yukon.
This revised program can test the PE3.0 cache_bypass directive to force copies through the E-register rather than cache. Under PE2.0, shmem_put and shmem_get were clearly the fastest available methods. However, running with streams enabled, a conventional loop with cache_bypass is equally fast.
In the following table, the timings are in units of number of array elements (64-bit words) transferred per second, and should not be construed as memory bandwidth. The tests were made on ARSC's 450MHz T3E, yukon. In order to make the runs on application (rather than shell) PEs, they were started on 2 PEs (mpprun -n2), and the results from one of the two PEs were discarded.
Here are the timing results followed by the program:
----------------------------------------------------
STREAMS ON STREAMS OFF
========== ===========
Copying 10 words:
cache bypass loop construct 1. MW/s 1. MW/s
cache bypass f90 array op 3. MW/s 3. MW/s
shmem_get 2. MW/s 2. MW/s
shmem_put 3. MW/s 3. MW/s
f90 array op 3. MW/s 2. MW/s
loop construct 2. MW/s 2. MW/s
Copying 100 words:
cache bypass loop construct 8. MW/s 8. MW/s
cache bypass f90 array op 13. MW/s 8. MW/s
shmem_get 15. MW/s 15. MW/s
shmem_put 6. MW/s 19. MW/s
f90 array op 13. MW/s 8. MW/s
loop construct 12. MW/s 8. MW/s
Copying 1000 words:
cache bypass loop construct 31. MW/s 31. MW/s
cache bypass f90 array op 19. MW/s 9. MW/s
shmem_get 31. MW/s 26. MW/s
shmem_put 32. MW/s 32. MW/s
f90 array op 19. MW/s 10. MW/s
loop construct 19. MW/s 10. MW/s
Copying 10000 words:
cache bypass loop construct 36. MW/s 36. MW/s
cache bypass f90 array op 23. MW/s 10. MW/s
shmem_get 36. MW/s 36. MW/s
shmem_put 36. MW/s 36. MW/s
f90 array op 23. MW/s 11. MW/s
loop construct 23. MW/s 11. MW/s
Copying 100000 words:
cache bypass loop construct 36. MW/s 36. MW/s
cache bypass f90 array op 27. MW/s 11. MW/s
shmem_get 35. MW/s 37. MW/s
shmem_put 36. MW/s 37. MW/s
f90 array op 28. MW/s 11. MW/s
loop construct 28. MW/s 11. MW/s
Copying 1000000 words:
cache bypass loop construct 36. MW/s 36. MW/s
cache bypass f90 array op 28. MW/s 11. MW/s
shmem_get 36. MW/s 36. MW/s
shmem_put 36. MW/s 36. MW/s
f90 array op 28. MW/s 11. MW/s
loop construct 28. MW/s 11. MW/s
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
c T3E program to measure memory bandwidth using different techniques
c for copying arrays. For best (?) performance, compile with:
c -Ounroll2
c and run on T3E application PEs rather than shell PEs by requesting
c 2 PEs:
c mpprun -n2
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
program prog
implicit none
integer, parameter::SZ=1000000
real c(SZ), a(SZ)
integer i,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12
integer irtc, ierr, index, iclktck, my_pe, shmem_my_pe
integer copycnt, copysz
! Get machine clock ticks per second
call pxfconst ('CLK_TCK',index,ierr)
call pxfsysconf (index, iclktck, ierr)
! Get my pe
my_pe = shmem_my_pe()
do copycnt=1,6
copysz=10**copycnt
c = 1
! copy arrays in loop with cache_bypass directive
a = 0
t7 = irtc ()
!dir$ cache_bypass c,a
do i=1,copysz
a(i) = c(i)
enddo
t8 = irtc ()
call chkcopy (a, c, copysz, SZ)
! copy arrays f90 syntax with cache_bypass directive
a = 0
t9 = irtc ()
!dir$ cache_bypass c,a
a(1:copysz) = c(1:copysz)
t10 = irtc ()
call chkcopy (a, c, copysz, SZ)
! copy arrays using shmem_get.
a = 0
t11 = irtc ()
call shmem_get (a, c, copysz, my_pe)
t12 = irtc ()
call chkcopy (a, c, copysz, SZ)
! copy arrays using shmem_put.
a = 0
t1 = irtc ()
call shmem_put (a, c, copysz, my_pe)
t2 = irtc ()
call chkcopy (a, c, copysz, SZ)
! copy arrays using f90 array operation
a = 0
t3 = irtc ()
a(1:copysz) = c(1:copysz)
t4 = irtc ()
call chkcopy (a, c, copysz, SZ)
! copy arrays using loop
a = 0
t5 = irtc ()
do i=1,copysz
a(i) = c(i)
enddo
t6 = irtc ()
call chkcopy (a, c, copysz, SZ)
write (6,'("Copying ", i8, " words:")') copysz
write (6,*)
write (6,1000) "cache bypass loop construct",
& (copysz / ((t8-t7)/real(iclktck))) / 1000000
write (6,1000) "cache bypass f90 array op",
& (copysz / ((t10-t9)/real(iclktck))) / 1000000
write (6,1000) "shmem_get",
& (copysz / ((t12-t11)/real(iclktck))) / 1000000
write (6,1000) "shmem_put",
& (copysz / ((t2-t1)/real(iclktck))) / 1000000
write (6,1000) "f90 array op",
& (copysz / ((t4-t3)/real(iclktck))) / 1000000
write (6,1000) "loop construct",
& (copysz / ((t6-t5)/real(iclktck))) / 1000000
1000 format (A,t30,f6.0," MW/s")
write (6,*)
write (6,*)
enddo
end
ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
! Called mainly to ensure that compiler doesn't remove the entire
! copy operation as an optimization.
subroutine chkcopy (a, c, copysz, SZ)
integer copysz, SZ
real c(SZ), a(SZ)
integer i
! Verify copy completed.
do i=copysz,1,-1
if (c(i) .NE. a(i)) stop "copy failed"
enddo
end
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Often users ask what are the best options to use with the Fortran compiler on code they are porting to the T3E.
This is a difficult question, since the effectiveness of each option varies greatly with both the nature of the algorithm and the programmer's expression of the algorithm to the compiler. However, a procedure for optimisation is to start with no options and then incrementally add the following options in this order, noting how performance changes and checking results at each stage. (Some optimisations will change results since the order of operations may change which can make rounding errors different. These changes should be small.)
It is also useful to assess the performance of the different code parts before starting this exercise since some options may speed up one part and slow down another. However the application of the options below should result in an improvement with relatively modest effort from the programmer.
-O3
this is a basic optimisation set.
-O3,aggress
this increases the internal tables for the compiler and
allows further modification of user code for performance
optimisations.
-O3,aggress -apad
this pads arrays on which there might be cache conflicts.
-O3,aggress,unroll2 -apad
this starts to unroll loops to exploit multiple operations
per load into cache from main memory.
-O3,aggress,unroll2,pipeline2 -apad
here pipelining is exploited to try and get one result per
clock cycle.
-O3,aggress,unroll2,pipeline2,split2 -apad
here loops are split to restrict the number of streams open
at any one time to that in the hardware.
-O3,aggress,unroll2,pipeline2,split2 -apad -lmfastv
this option uses a faster but not fully IEEE compliant math
library for intrinsics.
Hopefully, the above seven set of options will allow users to make a first pass at optimising code with the help of the compiler. After applying these options the programmer must now consider explicit code modification to exploit the features of the underlying architecture. More on these changes in future newsletters.
For more information these options are described in greater depth and some examples given in the Cray document, 'The Benchmarker's Guide to Single-processor Optimisation for Cray T3E Systems." It is available in postscript via anonymous ftp to: ftp.arsc.edu. It is in the directory: pub/mpp/docs, and is named: bmguide.ps.Z.
On October 13th and 14th, ARSC is holding a forum for users local to Fairbanks. Monday morning presentations by ARSC staff will give updates on ARSC resources. Following that session, there will be a series of short talks by users of those resources. The schedule for the forum is listed below or at:
http://www.arsc.edu/~horner/UserForum.html
All local users are encouraged to attend. The forum will be held in Butrovich 109, the Regents Conference Room.
========================================================================
ARSC User Forum
========================================================================
MONDAY Oct 13, 1997
========================================================================
9:00 AM ARSC Presentations
o Barbara Introductions
Horner-Miller
o Frank Williams Welcome
o Guy Robinson T3E Upgrades and Transition Plans
o Tom Baring Programming Environment 3.0 and
Software Update
10:15 AM Break
o Sergei Maurits ARSC Visualization Update
o Virginia Bedford ARSC Mass Storage Plans
11:30 AM Lunch Break
1:00 PM SAR Processing
o Rick Guritz Technology Prototyping in SAR
Processing at ASF
o Thomas Logan PAISP - The Parallel ASF
Interferometric SAR Processor
Visualization
o Chris Hartman Use of OpenGL in CS381
Music is to time as visual art is to
o Bill Brody space & Hulahula: an evolving
landscape
o Sergei Maurits The Polar Ionosphere Model and its
Real-Time Applications
2:30 PM Break
2:45 PM Mechanical Engineering
o Jonah Lee Computational Mechanical Engineering
& Collaborative Computing Using CORBA
o Tinggang Zhang A Numerical Approach TO Fatigue
Analysis
Miscellaneous
o Knut Stamnes Title?
Time-Dependent Behavior of Soft Rock
o Gang Chen (Mining) Strata in Underground Mines &
Computer Simulation of Blasting
o Giray Okten Applications of Hybrid-Monte Carlo
Methods
TUESDAY Oct 14, 1997
========================================================================
9:00 AM o Barbara Intro to Day 2
Horner-Miller
9:05 AM Atmospheric Science
Improved visualization of carbon
o Bob Andres dioxide emissions from fossil fuel
consumption
o Jeff Tilly Regional Modeling in the Western
Arctic
o Alexander Mahura Atmospheric Transport Pathways for
Pollutants - Trajectory Model Studies
10:15 AM Break
10:30 AM Oceanography
o David Eslinger 3D Coupled Biological & Physical
Modeling
11:00 AM ARSC Tour
11:30 AM Lunch Break
1:00 PM Geophysics
o Elena Troshina A Time-Dependent Numerical Model of
the Antarctic Icesheet
o Sukumar Ventilation for Arctic Mines
Bandopadhyay
Space Science
o Antonius Otto Plasma Processes in the Earth's
Magnetosphere
o Peter Delamere A Hybrid Code for an F-region
Chemical Release
Space Plasma Simulations Using Hybrid
o Daniel Swift Codes in Generalized Curvilinear
Coordinates
2:30 PM Break
2:45 PM Miscellaneous
o Chuen-Sen Lin (ME) Mechanical Design & Motion Analysis
3:00 PM T3E Presentation
o Guy Robinson Using the CRAY T3E
3:30 PM ARSC Wrapup
o Barbara Summary of "Care-Abouts"
Horner-Miller
A: {{ Is this a good idea? Self-documenting? Is it even valid!? }}
The C-snippet offered last week, which used a question mark-colon
construct as an lvalue, was extracted from the code to W3's browser,
"arena." Here's more of the code:
-------------------------------------------------------------------------
#ifdef __STRICT_ANSI__
if(buffer_cell->prev)
buffer_cell->prev->next = buffer_cell->next;
else
buffer->cell = buffer_cell->next;
#else
((buffer_cell->prev) ? buffer_cell->prev->next : buffer->cell) =
buffer_cell->next;
#endif /* __STRICT_ANSI__ */
------------------------------------------------------------------------
Which provides part of our answer. Depending on your compiler, it
is indeed valid, but does not conform to the ANSI C standard.
Why would it be a good idea?
If programmer productivity is measured in lines of code per day, then
it might be a "bad" idea. If measured in program effort per line of
code, then it's "good." More likely, it would be "good" if it leads
to a more efficient executable.
To help in testing, here's a trivial program which uses the construct:
#include <stdio.h>
main() {
int i=0,j=1;
printf ("i:%d j:%d\n", i, j);
*(i < j ? &i : &j) = 100;
printf ("i:%d j:%d\n", i, j);
if (i < j)
i = 200;
else
j = 200;
printf ("i:%d j:%d\n", i, j);
}
What follows is part of the output of the command "cc -S" run on the
above program (the "-S" option instructs the compiler to translate
the C-code to assembly). This output was produced on an SGI
workstation:
[ ---------------------------------------- 2 labels, 8 commands ]
# 9 *(i < j ? &i : &j) = 100;
lw $15, 44($sp)
lw $24, 40($sp)
bge $15, $24, $32
addu $16, $sp, 44
b $33
$32:
addu $16, $sp, 40
$33:
li $25, 100
sw $25, 0($16)
.loc 2 11
[ ---------------------------------------- 2 labels, 8 commands ]
# 13 if (i < j)
lw $8, 44($sp)
lw $9, 40($sp)
bge $8, $9, $34
.loc 2 14
# 14 i = 200;
li $10, 200
sw $10, 44($sp)
b $35
$34:
.loc 2 16
# 15 else
# 16 j = 200;
li $11, 200
sw $11, 40($sp)
$35:
.loc 2 18
# 17
The question mark-colon construct didn't simplify the assembly
output, but it's likely that the indirect addressing obtained
("addu") is faster than the direct addressing ("sw").
Q: In Unix, how can you list the contents of the current directory,
and the contents of every subdirectory within the current directory,
but not descend any deeper into the depths of the tree of
subdirectories?
[ Answers, questions, and tips graciously accepted. ]
Contact:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619 Donald Bahls ARSC User Consultant ph: 907-450-8674 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.Email Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources