| Newsletter Index | Quick-Tip Index | Search Newsletters |
There have been some late nights around here, and they were well spent. Yukon has been accepted, with over 96% uptime during its acceptance period and is running in pre-production mode.
We have a handful of non-staff users on yukon. Here's a message from Don Morton of Cameron University (he didn't intend this for the newsletter -- which, to me, makes it more interesting -- but I asked and he said okay).
> Subject: Preliminary T3E times, comments, etc. > > What follows are some T3D vs.T3E times on an MPI hydrologic code. I'll > also add in T3D CRAFT times for code. All times in seconds. > > PE's Initialization Time Single Timestep Single Timestep > T3D T3E T3D T3E T3D-CRAFT > > 2 26.9 3.64 6.18 2.16 6.23 > 4 26.9 3.76 4.80 1.74 4.22 > 8 26.8 3.64 4.58 1.85 3.31 > 16 26.9 3.64 5.62 2.47 2.86 > > The "Initialization Time" is time needed to read input files, > set up data structures, etc. I know the T3E was going to have > much better I/O performance, and this seems to show it... Nice > to be rid of the Y-MP/T3D bottleneck! > > Looks like I still have a communications bottleneck which I'll > probably try to resolve with Shmem. The bottleneck is a small > loop which, many times within a given timestep, exchanges a small > number of values (approx. 5-15). Seems that the MPI latency is > pretty high, and CRAFT seems to get around some of this when I use > shared arrays.
In newsletter #100, we presented a simple shmem program which measures bandwidth, and used it on the T3D to compare EPCC's 2-sided MPI routines with shmem. In this newsletter, we use essentially the same program to look at bandwidth on ARSC's T3E. It gives a T3D maximum of ~126 mb/s and a T3E maximum which, in our tests, ranged from about 250 mb/s to ~333 mb/s. This variance on the T3E is discussed below.
First, here's how the program works. It runs on two PEs, the sender and receiver. The two processes synchronize at a barrier. The sender then initiates an asynchronous send using shmem_put() while the receiver waits at a shmem_wait() for the last item in the buffer which serves as a signal that the entire buffer has arrived. These two functions are timed, the results printed, and the processes loop back and repeat the sequence with a larger buffer.
We hope that a "red flag" went up when you read the statement: "the receiver waits at a shmem_wait() for the last item in the buffer."
On the T3E, if adaptive routing is enabled, this is not a reliable way to test for the arrival of data. However, we have not enabled adaptive routing yet on yukon, and, just to be sure, ran a version of the program which examines every buffer element testing that the entire buffer was received. On the T3D, buffer packets are guaranteed to arrive in order, so testing the last element does test the entire buffer. We will discuss adaptive routine in greater detail in future newsletters.
The following output is from a run on ARSC's T3D. nwords gives the size in words of the buffer exchanged; ticks is the number of clock ticks required for the transfer to complete; and usecs is the number of microseconds required. The mbr field is bandwidth in megabytes per second. Runs on the T3D always give bandwidths within a couple percent of the values shown in this run, and are not influenced by neighboring jobs.
T3D Run ==================== RECEIVER: nwords=1 ticks=1272 usecs=8.479152; mbr=0.943491 RECEIVER: nwords=10 ticks=672 usecs=4.479552; mbr=17.858930 RECEIVER: nwords=100 ticks=1533 usecs=10.218978; mbr=78.285718 RECEIVER: nwords=1000 ticks=10109 usecs=67.386591; mbr=118.717980 RECEIVER: nwords=10000 ticks=95708 usecs=637.989500; mbr=125.393913 RECEIVER: nwords=100000 ticks=957078 usecs=6379.881672; mbr=125.394175 RECEIVER: nwords=1000000 ticks=9479730 usecs=63191.877442; mbr=126.598549 SENDER: nwords=1 ticks=1010 usecs=6.732660 SENDER: nwords=10 ticks=725 usecs=4.832850 SENDER: nwords=100 ticks=1619 usecs=10.792254 SENDER: nwords=1000 ticks=10151 usecs=67.666563 SENDER: nwords=10000 ticks=95819 usecs=638.729426 SENDER: nwords=100000 ticks=956952 usecs=6379.041756 SENDER: nwords=1000000 ticks=9479679 usecs=63191.537476
The next batch of results are from ARSC T3E runs. As they show, the bandwidth the program achieves on the T3E is both better and far more variable than that which it achieves on the T3D, and it is influenced by other jobs.
This contrast between the T3D and T3E variability makes sense considering that this program runs on exactly two PEs and that in the T3D architecture, PEs are paired up, two per node, with a dedicated route between them. Communication between the sender and receiver pair on the T3D is unaffected by traffic on the torus, even when some of that traffic passes through the node on which they are running. On the T3E, however, every PE is its own node, and traffic between any given pair of nodes shares the torus network with traffic between nodes of other applications.
For the T3E runs, I used mppview -s all to capture the configuration at the time of each of three runs (the job appears in these displays as shmem_wait). There was another user's 64-node job running throughout my runs. The best bandwidth was achieved in the third run, below, after I inserted a 16 node spacer job which separated the 2-node job from the 64-node job, giving it a relatively quiet route through the torus. (The spacer program, route, does introduce some traffic on the torus.)
T3E Run 1:
====================
UID PPID APID Run Time PEs Base Command
----- ----- ---------- -------- ----- ----- ----------
162 7838 0x0186dc13 03:29:28 64 0 a.out.64
1235 11631 0x01459c14 00:00:20 2 64 shmem_wait
====================
RECEIVER: nwords=1 ticks=3210 usecs=10.700000; mbr=0.747664
RECEIVER: nwords=10 ticks=1966 usecs=6.553333; mbr=12.207528
RECEIVER: nwords=100 ticks=1834 usecs=6.113333; mbr=130.861505
RECEIVER: nwords=1000 ticks=10640 usecs=35.466667; mbr=225.563910
RECEIVER: nwords=10000 ticks=92944 usecs=309.813333; mbr=258.220003
RECEIVER: nwords=100000 ticks=950161 usecs=3167.203333; mbr=252.588772
RECEIVER: nwords=1000000 ticks=9421812 usecs=31406.040000; mbr=254.728071
SENDER: nwords=1 ticks=2900 usecs=9.666667
SENDER: nwords=10 ticks=1614 usecs=5.380000
SENDER: nwords=100 ticks=950 usecs=3.166667
SENDER: nwords=1000 ticks=9542 usecs=31.806667
SENDER: nwords=10000 ticks=92649 usecs=308.830000
SENDER: nwords=100000 ticks=949593 usecs=3165.310000
SENDER: nwords=1000000 ticks=9421195 usecs=31403.983333
T3E Run 2:
====================
UID PPID APID Run Time PEs Base Command
----- ----- ---------- -------- ----- ----- ----------
162 7838 0x0186dc13 05:20:44 64 0 a.out.64
1235 12203 0x01459c11 00:00:09 8 64 route <--8 node "spacer"
1235 12212 0x0145bc12 00:00:04 2 72 shmem_wait
====================
RECEIVER: nwords=1 ticks=3358 usecs=11.193333; mbr=0.714711
RECEIVER: nwords=10 ticks=2186 usecs=7.286667; mbr=10.978957
RECEIVER: nwords=100 ticks=2054 usecs=6.846667; mbr=116.845180
RECEIVER: nwords=1000 ticks=10552 usecs=35.173333; mbr=227.445034
RECEIVER: nwords=10000 ticks=89032 usecs=296.773333; mbr=269.565999
RECEIVER: nwords=100000 ticks=932565 usecs=3108.550000; mbr=257.354715
RECEIVER: nwords=1000000 ticks=9403328 usecs=31344.426667; mbr=255.228787
SENDER: nwords=1 ticks=3008 usecs=10.026667
SENDER: nwords=10 ticks=1746 usecs=5.820000
SENDER: nwords=100 ticks=817 usecs=2.723333
SENDER: nwords=1000 ticks=9510 usecs=31.700000
SENDER: nwords=10000 ticks=88518 usecs=295.060000
SENDER: nwords=100000 ticks=931773 usecs=3105.910000
SENDER: nwords=1000000 ticks=9402727 usecs=31342.423333
T3E Run 3: (Best Performance)
===============================
UID PPID APID Run Time PEs Base Command
----- ----- ---------- -------- ----- ----- ----------
162 7838 0x0186dc13 05:26:43 64 0 a.out.64
1235 12296 0x0145bc13 00:00:15 16 64 route <--16 node "spacer"
1235 12313 0x0145f414 00:00:08 2 80 shmem_wait
====================
RECEIVER: nwords=1 ticks=3226 usecs=10.753333; mbr=0.743955
RECEIVER: nwords=10 ticks=1958 usecs=6.526667; mbr=12.257406
RECEIVER: nwords=100 ticks=1562 usecs=5.206667; mbr=153.649168
RECEIVER: nwords=1000 ticks=8008 usecs=26.693333; mbr=299.700300
RECEIVER: nwords=10000 ticks=72100 usecs=240.333333; mbr=332.871012
RECEIVER: nwords=100000 ticks=729933 usecs=2433.110000; mbr=328.797301
RECEIVER: nwords=1000000 ticks=7262705 usecs=24209.016667; mbr=330.455388
SENDER: nwords=1 ticks=2928 usecs=9.760000
SENDER: nwords=10 ticks=1778 usecs=5.926667
SENDER: nwords=100 ticks=937 usecs=3.123333
SENDER: nwords=1000 ticks=7585 usecs=25.283333
SENDER: nwords=10000 ticks=71993 usecs=239.976667
SENDER: nwords=100000 ticks=729573 usecs=2431.910000
SENDER: nwords=1000000 ticks=7262383 usecs=24207.943333
====================
Here's the code. As noted in Newsletter #116, the cache coherence call becomes a no-op on the T3E, but should remain for T3D portability. The only change to this code in porting to the T3E was in the include preprocessor commands (for details, see the next article).
/*-------------------------------------------------------------*/
#include <mpp/shmem.h>
#ifdef T3D
#include <mpp/stdio.h>
#include <mpp/limits.h>
#include <mpp/time.h>
#endif
#ifdef T3E
#include <stdio.h>
#include <limits.h>
#include <time.h>
#endif
#define BUFSZ 1000000
/* d=delta clock tics; n=number words*/
#define USECS(d) ((float)(d)*1000000.0/(float)CLK_TCK)
#define MBR(d,n) ((float)(n)*8.0/1000000.0)/(USECS(d)/1000000.0))
main () {
int n;
long nwords;
int mype, otherpe, npes;
long t1,t2;
long buf[BUFSZ];
fortran irtc();
npes = shmem_n_pes();
if (npes < 2) {
printf ("ERROR: Minimum of 2 PEs required.\n");
exit (1);
}
for (n = 0; n < BUFSZ; n++)
buf[n] = 0;
mype = shmem_my_pe();
shmem_set_cache_inv(); /* Reload cache when put received */
switch (mype) {
case 0:
otherpe = 1;
for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
buf[nwords-1] = 1; /* Use as flag to shmem_wait() on receiver */
barrier();
t1 = irtc ();
shmem_put (buf, buf, nwords, otherpe);
t2 = irtc ();
printf ("SENDER: nwords=%ld ticks=%ld usecs=%f\n",
nwords, (t2-t1), USECS(t2-t1));
}
break;
case 1:
for (nwords = 1; nwords <= BUFSZ; nwords*=10) {
barrier();
t1 = irtc ();
shmem_wait( &buf[nwords-1], 0 ); /* wait for last element */
t2 = irtc ();
printf ("RECEIVER: nwords=%ld ticks=%ld usecs=%f; mbr=%f\n",
nwords, t2-t1, USECS(t2-t1), MBR(t2-t1,nwords);
}
break;
default:
break;
}
} /*-------------------------------------------------------------*/
Some header files are in different locations on the T3E and T3D. It can be a little confusing, as this example (taken from the previous article) shows:
#include <mpp/shmem.h> #ifdef T3D #include <mpp/stdio.h> #include <mpp/limits.h> #include <mpp/time.h> #endif #ifdef T3E #include <stdio.h> #include <limits.h> #include <time.h> #endif
One might ask: why were stdio.h, limits.h and time.h& moved out of the mpp directory, while shmem.h remained?
This question suggests, erroneously, that on the Y-MP, these .h files actually are in the same directory, mpp. It turns out that they're not: stdio.h, limits.h, time.h (and about 135 other .h files) are system include files and are found under the /usr/include/ tree, while shmem.h is part of the programming environment, and is under the /opt/ctl/ tree. For example:
-------------------- limits.h
On Y-MP:
denali$ find /usr/include -name limits.h -print
/usr/include/mpp/limits.h
/usr/include/limits.h
On T3E:
yukon$ find /usr/include -name limits.h -print
/usr/include/limits.h
The Y-MP has two "limits.h" files because it is a single host shared
by two architectures. Life is simple again on the T3E, the
subdirectory, "mpp" is completely gone, and the contents of the
Y-MP's /usr/include/mpp/ directory has been moved back down into
/usr/include/ (with the necessary exception or two -- see mpi.h,
below).
-------------------- shmem.h
On Y-MP:
denali$ find /opt -name shmem.h -print
/opt/ctl/craylibs_m/2.0.0.0/include/mpp/mpp/shmem.h
/opt/ctl/craylibs_m/2.0.2.0/include/mpp/mpp/shmem.h
On T3E:
yukon$ find /opt -name shmem\*.h -print
/opt/ctl/craylibs/2.0.3.0/include/mpp/shmem.h
/opt/ctl/craylibs/2.0.3.3/include/mpp/shmem.h
The shared memory archive and headers are not considered "system"
files, but, rather, part of the programming environment ("product"
files). It looks like they've moved (again, an "mpp" subdirectory has
fallen out), but this is a transparent change: PE 2.0's environment
sets up the paths to these headers for us.
-------------------- mpi.h
On Y-MP:
denali$ find /usr/include -name mpi\*.h -print
/usr/include/mpp/mpi.h
/usr/include/mpp/mpif.h
On T3E:
yukon$ find /usr/include -name mpi\*.h -print
yukon$ find /opt/ctl -name mpi\*.h -print
/opt/ctl/mpt/1.1.0.0/include/mpi.h
/opt/ctl/mpt/1.1.0.0/include/mpif.h
On the Y-MP, mpi headers were installed in /usr/include/mpp/. On the
T3E, mpi is part of the "message passing toolkit," or MPT, which is
part of the programming environment. Thus, the mpi headers have
moved from /usr/include/mpp/ over to /opt/ctl/.
As with shmem.h, the module command sets up the environment so that
the programmer doesn't need to worry about the location of these
files. Remember to load the MPT though, with this command:
module load mpt
This presentation makes the situation seem more confusing than it really is. Except for the mpp system headers the changes should be transparent, and the module command makes it trivial to switch from one version to another.
Our standard advice: 1) Don't use absolute paths to anything. 2) Get comfortable with the "module" command -- it is powerful and easy to use.
For now, the "ascii graphics," dots and words display of system activity, which was available through mppview on the Y-MP is unavailable on the T3E. For users with X displays, "xmppview" is a colorful 3-Dimensional replacement that beckons us to get virtual reality goggles and voyage inside the rotating torus.
Graphics aside, however, the ASCII info you really need is still available on the T3E though good old mppview. "mppview -s queue" tells you what's running and waiting; "mppview -s config" gives PE configuration. For example:
yukon$ mppview -s queue
********** MPP Application Queue Stats **********
Applications Running: 3
UID PPID APID Run Time PEs Base Command
----- ----- ---------- -------- ----- ----- ----------
162 7838 0x0186dc13 06:12:57 64 0 a.out.64
1235 12886 0x0145f415 00:03:19 11 64 route
1235 12924 0x01459c13 00:00:10 2 75 shmem_wait
Applications Queued: 1
UID PPID APID Q Time PEs Command Reason
----- ----- ---------- -------- ----- ---------- ----------
1235 12909 0x0145bc13 00:02:07 10 iompp ApLimit
yukon$ mppview -s queue
********** PE Configuration **********
Total PEs configured: 96
OS PEs configured: 3
LPE# Name MB MHz
----- ---------------- ---- ----
0x058 ospe_b 128 300
0x059 ospe_c 128 300
0x05f ospe_a 128 300
Command PEs configured: 9
LPE# MB MHz
----- ---- ----
0x054 128 300
0x055 128 300
0x056 128 300
0x057 128 300
0x05a 128 300
0x05b 128 300
0x05c 128 300
0x05d 128 300
0x05e 128 300
Application PEs configured: 84
Application PE types found: 1
MB MHz #PEs
----- ----- -----
1) 128 300 84
Application regions configured: 1
Min PEs Max PEs #In Use
------- ------- -------
1) 2 84 77
This will be a semi-regular feature, following the example of the "T3D/YMP Differences" list which appears in earlier issues of the T3D Newsletter. As we write up specific differences between the T3D and T3E, we'll append them to this list for quick reference, but for a general overview available immediately, please see Newsletter #116. Whenever you find something, send it in so we can learn from each others' experience. The current list:
A: {{ In Cray's Programming Environment 2.0, how can you tell what versions
of libraries, compilers, etc... will be used as the current default? }}
# ARSC users may run the "PEvers" script. Here some sample output:
yukon$ PEvers
The following Programming Environment Packages are installed:
=============================================================
/opt/ctl/cf90
2.0.3.0
2.0.3.3
The current default version is //opt/ctl/cf90/2.0.3.3.
=============================================================
/opt/ctl/CC
2.0.3.0
2.0.3.3
The current default version is //opt/ctl/CC/2.0.3.3.
=============================================================
/opt/ctl/craytools
2.0.3.0
2.0.3.4
The current default version is //opt/ctl/craytools/2.0.3.4.
=============================================================
etc...
etc...
# The alternative is manually take a peek at the PE2.0 products:
ls -l /opt/ctl # Each directory listed is a product name.
ls -l /opt/ctl/$PROD # Where $PROD is a product name
# In this second listing, each subdirectory is an installed version while
# the link identifies the default version. For example:
yukon$ ls -l /opt/ctl
total 96
drwxr-xr-x 4 bin bin 4096 Apr 1 19:15 CC
drwxr-xr-x 3 bin bin 4096 Mar 12 00:41 CCmathlib
drwxr-xr-x 3 bin bin 4096 Mar 12 00:44 CCtoollib
drwxr-xr-x 6 bin bin 4096 Mar 12 00:13 admin
drwxr-xr-x 2 bin bin 4096 Apr 1 19:15 bin
drwxr-xr-x 3 bin bin 4096 Mar 12 00:57 cam
drwxr-xr-x 4 bin bin 4096 Apr 1 19:17 cf90
drwxr-xr-x 4 bin bin 4096 Apr 1 19:12 craylibs
drwxr-xr-x 4 bin bin 4096 Apr 1 19:22 craytools
drwxr-xr-x 3 bin bin 4096 Mar 12 01:23 cvt
drwxr-xr-x 2 bin bin 4096 Apr 1 19:23 doc
drwxr-xr-x 3 bin bin 4096 Mar 12 00:13 mpt
yukon$ ls -l /opt/ctl/cf90
total 24
drwxr-xr-x 6 bin bin 4096 Mar 12 00:29 2.0.3.0
drwxr-xr-x 6 bin bin 4096 Apr 1 19:17 2.0.3.3
lrwxrwxrwx 1 root bin 22 Apr 1 19:17 cf90 ->
/opt/ctl/cf90/2.0.3.3
# This shows that two versions of cf90 are installed: 2.0.3.0 and
# 2.0.3.3, and that the default is 2.0.3.3.
Q: How can you capture a "man" page without getting all the formatting
characters? Say you wanted ASCII text for a newsletter.
[ Answers, questions, and tips graciously accepted. ]
Contact:
Thomas J. Baring ARSC Web Specialist ph: 907-450-8619 Donald Bahls ARSC User Consultant ph: 907-450-8674 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Send comments and questions to the current editors using this Contact Form.Email Subscriptions:
| Newsletter Index | Quick-Tip Index | Search Newsletters |
Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8600 | email:
home | search | about | support | news | science | resources