ARSC T3D Users' Newsletter 86, May 10, 1996
Workstation or Supercomputer?
Now that ARSC is moving to become a "cost recovery center", maybe it's time to look at the economics of using workstations versus supercomputers. I have five overhead slides from John Larson when at CSRD (Center for Supercomputing Research and Development at the University of Illinois) entitled "Which machine do I use?" that described the situation well. The slides describe a procedure for deciding between using a dedicated workstation or sharing a traditional supercomputer. Of course, if one of these machines is free to use, the decision should be easy! I'm going to simplify these slides into the ASCII format of this newsletter.
Let's suppose I have two options for solving my computer problems:
Option #1 Dedicated Workstation Relative Performance = 1 Option #2 Timeshared Supercomputer Relative Performance = 10Each of us can imagine the workstation and supercomputer that is most applicable to our own situation. On either machine we are interested in how long we as users have to wait for results. How long we wait for results will determine how many results we produce. For each machine we have:
Time(turnaround) = Time(computing) + Time(waiting)and, as a general time breakdown we have:
<-----------------Time(turnaround)-------------> 0--------------------------------------------------job finish <--Time(waiting)--> <--Time(computing)--------->On the Dedicated Workstation we have:
Time(turnaround) = Time(computing) Time(waiting) = 0and, as the Dedicated Workstation time breakdown we have:
<-----------------Time(turnaround)-------------> 0--------------------------------------------------job finish <-----------------Time(computing)-------------->In this case "dedicated" means it works on our problem alone; we don't share the Dedicated Workstation with anyone. The Expansion Factor is how long the job appears to take with respect to the time actually computing.
Expansion Factor = Time(Turnaround) / Time(computing) For a Dedicated Workstation, Expansion Factor = 1.On the Timeshared Supercomputer we have a more complicated situation:
Expansion Factor = Time(Turnaround) / Time(computing) For a Timeshared Supercomputer, Expansion Factor > 1. Expansion Factor = (may be 5 to 10) Time(waiting) = (say, 9) * Time(computing) Time(computing) = f(1/relative performance) Time(waiting) = Time(queued_to_run) + Time(swapped_out) Time(queued_to_run) = f(workload) Time(swapped_out) = f(workload, scheduling) workload = f(number_of_jobs, resource_requirements_per_job)and, as a Timeshared Supercomputer time breakdown we have:
<-----------------Time(turnaround)------------->
0--------------------------------------------------job finish
<---------------Time(waiting)----------->
<---->
Time
(computing)
With the Expansion Factor, we try to indicate how many users are sharing the same CPU. In the above assumption we have that 5 to 10 users are sharing the same CPU. For any computer we have:
Service = Work / Time(turnaround) Value = Service / CostJust from this specification of the problem, we have these observations:
- All other things being equal, if the turnaround times of two machines are the same, choose the cheaper machine.
- The cost paid for the Dedicated Workstation goes completely toward computing.
- The cost paid for a Timeshared Supercomputer goes partly toward computing and partly to pay to wait.
How can I get more Value from the Timeshared Supercomputer?Using the model above we have:
Time(turnaround) = Time(computing) + Time(waiting) Time(waiting) = (9) * Time(computing) Time(computing) = f(1/relative performance) Time(waiting) = Time(queued_to_run) + Time(swapped_out) Time(queued_to_run) = f(workload) Time(swapped_out) = f(workload, scheduling) workload = f(number_of_jobs, resource_requirements_per_job) Service = Work / Time(turnaround) Value = Service / CostThis sequence distills our options:
If my Cost is fixed (and nonzero), I must increase Service. If my Work is fixed, I must decrease Time (turnaround). If my Time (computing) is fixed, I must decrease Time (waiting).So to get better value from my timeshared supercomputer the conclusion is:
To decrease Time (waiting), the workload must be decreased.But the workload is controlled by the site administration! Usually the site administration is dealing with hundreds of users, with each user having only a small amount of influence. So the user who chooses to use a timeshared supercomputer is in an almost helpless position about getting his work done. How did this happen?
The core of the problem lies in a difference in expectations:
The Timeshared Supercomputer salesman said that he sold me Time(computing), when what I wanted to buy was Time(turnaround). The salesman forgot to tell me how much Time(waiting) I was getting for "free". What do I do now?There is not much that can be done by the user. If a large portion of the user's time is Time(waiting), then not even optimizing his code has much of an effect (a perverse form of Amdahl's law). Another option is to determine the expansion factor for this particular timeshared supercomputer (as approximated by wall clock .vs. cpu time) and use this term in the reevaluation of the timeshared supercomputer.
What can the Site Administration do?
Most of the options lie with site administration, but are not necessarily technical problems but policy and implementation choices:-
Reduce the workload (Time(waiting))
-
Reduce the number of jobs
- limit eligible users
- tighten allocation policies
- restrict runs or hours used per month
- allocation - use it or lose it
-
Reduce resource requirements of jobs
-
optimize CPU performance
- training of staff and users
- use tools - preprocessors, hpm, atexpert
- identify and help critical users
- use better algorithms and software packages
-
optimize memory usage
- recompute rather than store
- recycle variables and workspace
-
optimize I/O
- IOS, SSD, memory-resident datasets
- asynchronous I/O
- use multitasking
-
optimize CPU performance
-
Reduce the number of jobs
-
Increase relative performance (1/Time(computing))
- Increase utilization
- Get more powerful machine (speed, bandwidth, memory, I/O)
- Realize that on a machine with a current expansion factor of 9, if Time(waiting) remains unchanged, Amdahl's Law limits the Time(turnaround) savings to 11% as Time(computing) goes to 0.
The Linpack Benchmark on the T3D (with more than one processor)
Being a "one DO loop benchmark", optimizing linpack on a multiprocessor can be easy. We just concentrate on the major loop and then when it is running well we use the same optimizations on all other loops. From last week's newsletter we know that the DO loop nest of interest is:
do 30 j = kp1, n
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20 continue
c call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=t*a(k+i,k)+a(k+i,j)
21 continue
30 continue
where the call to the saxpy BLAS1 routine has been inlined. Also we have chosen the distribution of the two dimensional matrix for the 100x100 and 1000x1000 problems as:
100x100 problem: 1000x1000 problem:
parameter( lda = 128 ) parameter( lda = 1024 )
parameter( n = 100 ) parameter( n = 1000 )
real a( lda, lda ) real a( lda, lda )
cdir$ shared a( :, :block(1) ) cdir$ shared a( :, :block(1) )
This choice of declarations was made because:
- Shared arrays must follow some power-of-2 restrictions (for Craft Fortran).
- Distribution by column preserves column access that is essential for a cache based processor (the T3D uses the DEC Alpha).
- Cyclic distribution of the columns provides natural load balancing during the factorization (Experience).
program main
c declarations
cdir$ master
. ! start out uniprocessing
.
.
cdir$ endmaster
call sgefa( ! call routine that is executed on multiple processors
cdir$ master
call sgesl( ! return to uniprocessing
.
.
.
cdir$ endmaster
end
subroutine sgefa( ! here's the routine to share
c declarations
do 60 k = 1, n-1 ! all processors cycle through major loop
cdir$ master
. ! PE0 the master does:
. a. find pivot
. b. form multipliers ...
cdir$ endmaster
call barrier ! sync
do 30 j = kp1, n ! shared loop nest
t = a(l,j) !
if (l .eq. k) go to 20 ! exchange pivoted
a(l,j) = a(k,j) ! rows
a(k,j) = t !
20 continue !
c call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1) ! inlined updates
do 21 i = 1, n-k !
a(k+i,j)=t*a(k+i,k)+a(k+i,j) !
21 continue !
30 continue !
60 continue
.
.
.
end
Below is a sequence of six versions of this DO loop nest for parallelization on the T3D:
- craft - a simple Craft version from the fpp'ed source
- mod1 - use the home intrinsic to distribute the updates
- mod2 - use the home intrinsic to distribute the updates and rowexchanges
- mod3 - use the DO loop indices to distribute the work
- mod4 - use temp array to make local copy of multipliers
- mod5 - call a local version of saxpy
craft - simple craft version from the fpp'ed source
Using the transformation from fpp, as shown in last week's newsletter, we break the DO loop nest into the row exchanges done on PE0 and the updates which are done as a DO shared loop. The 'doshared' construct of Craft Fortran distributes the DO 31 work among the processors:
do 30 j = kp1, n
t = a(l,j)
temp( j ) = t
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20 continue
30 continue
cdir$ endmaster
call barrier()
cdir$ doshared( j ) on a( k+i, j )
do 31 j = k+1, n
c call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=temp(j)*a(k+i,k)+a(k+i,j)
21 continue
31 continue
60 continue
mod1 - use the home intrinsic to distribute the updates
The 'home' intrinsic returns the PE number on which the shared array element resides. In the code below, we use this to keep most of the update DO loop 21 computation local to the PE that owns the column being updated. A shared array temp is filled with the multipliers by PE0 and then accessed by the other PEs:
do 30 j = kp1, n
t = a(l,j)
temp( j ) = t
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20 continue
30 continue
cdir$ endmaster
call barrier()
do 31 j = k+1, n
if( home( a( 1, j ) ) .eq. me ) then
c call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=temp(j)*a(k+i,k)+a(k+i,j)
21 continue
endif
31 continue
60 continue
mod2 - use the home intrinsic to distribute the updates and row exchanges
Moving the control higher in the loop structure distributes more of the work and eliminates the need for the shared local array:
cdir$ endmaster
call barrier()
do 30 j = kp1, n
if( home( a( 1, j ) ) .eq. me ) then
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20 continue
c call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=t*a(k+i,k)+a(k+i,j)
21 continue
endif
30 continue
mod3 - use the DO loop indices to distribute the work
The cyclic column distribution of the array means columns residing on the same processor are exactly N$PES columns apart. (N$PES is the number of processors that the program in currently running on.) We can use this information to incorporate the test for locality with the DO loop indices. This way the test for locality is done only once:
cdir$ endmaster
call barrier()
me0 = home( a( 1, k+1 ) )
if( me .ge. me0 ) then
istart = k+1 + ( me - me0 )
else
istart = k+1 + ( N$PES - ( me0 - me ) )
endif
do 30 j = istart, n, N$PES
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20 continue
c call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
do 21 i = 1, n-k
a(k+i,j)=t*a(k+i,k)+a(k+i,j)
21 continue
30 continue
mod4 - use temp array to make local copy of multipliers
The multipliers can be copied once to a local array on each PE:
cdir$ endmaster
call barrier()
me0 = home( a( 1, k+1 ) )
istart = k+1
if( me .gt. me0 ) istart = k+1 + ( me - me0 )
if( me .lt. me0 ) istart = k+1 + ( N$PES - ( me0 - me ) )
do 29 j = k+1, n
temp( j ) = a( j, k )
29 continue
do 30 j = istart, n, N$PES
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20 continue
do 21 i = 1, n-k
a(k+i,j)=t*temp(k+i)+a(k+i,j)
c a(k+i,j)=t*a(k+i,k)+a(k+i,j)
21 continue
30 continue
mod5 - call a local version of saxpy
With all of the operands of DO loop 21 local to a single PE, the DO loop can be replaced with a call to the optimized uniprocessor version of the BLAS1 library routine, saxpy:
cdir$ endmaster
call barrier()
me0 = home( a( 1, k+1 ) )
istart = k+1
if( me .gt. me0 ) istart = k+1 + ( me - me0 )
if( me .lt. me0 ) istart = k+1 + ( N$PES - ( me0 - me ) )
do 29 j = k+1, n
temp( j ) = a( j, k )
29 continue
do 30 j = istart, n, N$PES
t = a(l,j)
if (l .eq. k) go to 20
a(l,j) = a(k,j)
a(k,j) = t
20 continue
c call saxpy(n-k,t,a(k+1,k),1,a(k+1,j),1)
call saxpy(n-k,t,temp(k+1),1,a(k+1,j),1)
c do 21 i = 1, n-k
c a(k+i,j)=t*temp(k+i)+a(k+i,j)
c a(k+i,j)=t*a(k+i,k)+a(k+i,j)
c 21 continue
30 continue
Results
Below is a summary of times for this sequence of modifications for only the factorization stage of the linpack problem. For this newsletter, we are only interested in timings for the routine sgefa that we are modifying. To give us some perspective on how we are doing, we add two uniprocessor timings:asis - the uniprocessor version with no modifications lapack - the best uniprocessor version from last week's newsletterThe two uniprocessor times for the unmodified source and the lapack version show the worst and best uniprocessor times.
Times (seconds) for the factorization phase of the linpack problem
(sgefa only):
problem size 1PE 2PEs 4PEs 8PEs 16PEs 32PEs
------------ --- ---- ---- ---- ----- -----
asis 100x100 .047
" 1000x1000 60.380
lapack 100x100 .018
" 1000x1000 10.080
craft 100x100 .102 .205 .125 .086 .087 .087
" 1000x1000 259.079 193.042 101.329 61.096 55.418 50.743
mod1 100x100 .100 .280 .159 .101 .090 .094
" 1000x1000 258.962 263.942 136.665 74.586 61.875 54.030
mod2 100x100 .095 .268 .155 .097 .083 .086
" 1000x1000 256.710 153.879 131.945 73.585 60.827 53.443
craft3 100x100 .090 .269 .150 .092 .080 .079
" 1000x1000 242.661 263.045 135.714 73.598 58.906 53.013
craft4 100x100 .070 .210 .122 .081 .064 .066
" 1000x1000 63.377 182.141 91.808 46.891 24.794 14.645
craft5 100x100 .097 .074 .054 .054 .052 .060
" 1000x1000 28.045 15.679 8.545 5.179 3.976 4.236
The lapack results on a single processor are hard to beat, but we are not yet done with sgefa, only with optimizing the major DO loop of the factorization phase. Similarly, for the solving phase of the linpack benchmark (the call to sgesl) we have increased the execution time time because of the distribution of the array. The times below give some indication of the cost of using an array distributed among processors as opposed to a local array.
Times (seconds) for both phases (factorization and solving)
on linpack problem:
1PE 2PEs
sgefa sgesl sgefa sgesl
----- ----- ----- -----
asis 100x100 .047 .0015
1000x1000 60.380 .1784
lapack 100x100 .018 .0006
1000x1000 10.080 .0556
craft with 100x100 .070 .0017 .459 .0146
distr. array 1000x1000 242.400 .2284 465.900 1.4540
In next week's newsletter we'll see how this parallelization for the major DO loop nest influences the rest of the benchmark. Having done a good job on the parallel part we need to reduce the scalar portion to get good overall speedup (an application of Amdahl's law).
Bug in STRSM
In the last newsletter, I mentioned that there was a bug in the BLAS3 routine, strsm. Casimir Suchyta of CRI mailed in this to say the bug is being fixed:> Ed Anderson wanted me to let you know that the problem with an > Operand Range Error in STRSM is SPR 103030 and a fix is already > working its way through the integration process.
A Call for Material
If you have discovered a good technique or information on the T3D and you think it might benefit others, then send it to the email address below and it will be passed on through this newsletter.Current Editors:
E-mail Subscriptions:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
-
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
-
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
