ARSC HPC Users' Newsletter 396, October 17, 2008

Pingo Has Arrived

On Monday October 13th, ARSC's new five cabinet Cray XT5, Pingo, was delivered to the ButrovichBuilding on the UAF campus. During the last week Cray system engineers and ARSC systemadministrators have been busy configuring the system.

Several capability (CAP) users have been granted exclusive access to Pingo, in order to test thescaling of their codes on high processor counts. The system should be available to these capabilityusers by the end of October. Following the CAP period, Pingo will be made available for generalproduction by projects allocated at ARSC.

We anticipate Pingo be available for production during the first quarter of 2009.

Stay tuned to the ARSC HPC Users' Newsletter for articles on our experiences with the CrayXT5.

For additional details on Pingo, see this announcement:

    http://www.arsc.edu/news/pingo.html

The C99 restrict Type

[By: Don Bahls]

The C99 standard adds the restrict type qualifier for pointers. The restrict type designates that a pointer has exclusive access to the memory being referenced by the pointer. Using the restrict type in a function definition lets the compiler make assumptions about a pointer that it may not otherwise be able to make. Consider a simple function which adds two equal length vectors and saves the result into a third vector:


  void vadd(double * a, double * b, double * c, int v)
  {
      int ii;
      for(ii=0; ii<v; ++ii)
      {
          c[ii]=a[ii]+b[ii];
      }
  }

At first glance it may appear that the compiler could use a vector instruction for the add operation inside of the loop. However since C allows pointers to point to the same memory region there could be overlap of data, so the compiler has to be conservative with the optimizations it uses.

When compiled with "-Minfo" and "-Mneginfo" the PGI compiler will indicate which optimizations it uses or doesn't use. For the existing code it uses loop unrolling to improve the performance of the loop:


  ognip % pgcc -c99 vector_add.c -Minfo -Mneginfo -fast -c
    vadd:
         7, Loop not vectorized: data dependency
            Loop unrolled 4 times

While it unrolls the loop, it doesn't attempt to use a vector add operation.

With the restrict type we can tell the compiler that a, b, and c won't overlap. Here is the same function definition with the restrict type added:


  void vadd(double * restrict a, double * restrict b, double * restrict c, int v)

Recompiling we now see something like this:


  ognip % pgcc -c99 vector_add.c -Minfo -Mneginfo -fast -o vector_add
  vadd:
       7, Generated 4 alternate loops for the inner loop
          Generated vector sse code for inner loop
          Generated 2 prefetch instructions for this loop
          Generated vector sse code for inner loop
          Generated 2 prefetch instructions for this loop
          Generated vector sse code for inner loop
          Generated 2 prefetch instructions for this loop
          Generated vector sse code for inner loop
          Generated 2 prefetch instructions for this loop
          Generated vector sse code for inner loop
          Generated 2 prefetch instructions for this loop

The compiler is prefetching data and vectorizing the loop, both of which can help performance. The compiler may not be able to tell if you are using the new version of "vadd" safely, so it's up to the programmer to ensure that the function input parameters match the restrict guidelines.

Because the Fortran standard doesn't allow for aliasing of input parameters, the same routine implemented in Fortran does prefetching and vectorization out of the box:


  ognip % pgf90 vector_add.f90 -Minfo -Mneginfo -fast -c
    add_them:
         5, Generated 4 alternate loops for the inner loop
            Generated vector sse code for inner loop
            ...
            Generated 2 prefetch instructions for this loop

A micro-benchmark running the two C versions of the vadd routine repeated showed the restrict version was approximately 3 to 4 times faster than the non-restrict version. The performance of the C restrict micro benchmark was on par with the same code written in Fortran 90.

NOTE: You may need to add a compiler flag to make use of C99 features, such as the restrict type. None of the C++ compilers I tried (pathCC 3.2, g++ 4.2.0, and pgCC 7.2-3) support the restrict type at this time.

Quick-Tip Q & A


A:[[ I am running on a multicore Opteron processor.  I need to know 
  [[ which one of the processor cores is running my program.  How can 
  [[ I print from my program which core is executing it?
  [[ 

#
# Thanks to Rich Griswold for this great lead:
#

The upper 8 bits of the CPUID 1 EBX register hold your APIC ID, which
is your processor ID.  AMD and Intel use similar algorithms to break
the APIC ID down into physical package ID, core ID, and thread ID.
There is a good explanation of this process for Intel CPUs at



http://software.intel.com/en-us/articles/optimal-performance-on-multithreaded-software-with-intel-tools
.

Breaking down the APIC ID on AMD is easier since you only have to deal
with cores and not with threads.  First, check bit 28 of the CPUID 1
EDX register to see if you have a multi-core system.  If so, get the
number of cores per package from bits 16 to 23 of the CPUID 1 EBX
register, and compute the number of bits, n, required to hold this
value (1 bit for 2 cores, 2 bits for 4 cores, etc).  The lower n bits
of the APIC ID are the core ID, and the upper 8-n bits of the APIC ID
are the physical package ID.


#
# The editors coded up an example that uses CPUID:
#

First, a function that returns the core number of the running process:

//
//
// Adapted from "CPU Counting Utility for Linux, Version 1.0", Intel Corp.
// See: 
http://software.intel.com/en-us/articles/optimal-performance-on-multithreaded-software-with-intel-tools

// and: 
http://software.intel.com/en-us/articles/methods-to-utilize-intels-hyper-threading-technology-with-linux

//
// This macro returns the CPUID data.  Note: Requires an asm() call.
//
#define cpuid( in, a, b, c, d ) \
   asm ( "cpuid" :             \
     "=a" (a), "=b" (b), "=c" (c), "=d" (d) : "a" (in));

#define INITIAL_APIC_ID_BITS  0xFF000000  

unsigned int get_coreid(void)
{
   unsigned int EAX,EBX,ECX,EDX;
   unsigned int coreid;
   void verbose_Fn0000_0001 (int EAX, int EBX, int ECX, int EDX);

   // CPUID(1) will return the 8-bit initial APIC ID for the processor 
   // this code is running on in bits 31:24 of register EBX.
   cpuid( 1, EAX, EBX, ECX, EDX );    
   coreid = ( EBX & INITIAL_APIC_ID_BITS ) >> 24;

   return coreid;
}


If we compile this into an object file, we can then link it in any
time we need to get the core number:

mg57> pathcc -g -c coreid.c

For example, this program which gets the core number every couple of
seconds, printing out the new number whenever it changes:

#include <stdio.h>

int main(void)
{
   unsigned int get_coreid(void);
   unsigned int coreid, lastcore;
   lastcore=999;
   coreid=0;

   while ( 1 )
   {
       coreid = get_coreid();
       if ( lastcore != coreid )
       {
           printf("Core # %4d\n", coreid);
       }
       lastcore=coreid;
       sleep(2);
   }
}


Compile the driver, linking in the coreid function:

mg57> pathcc -o test-coreid driver.c coreid.o

Now execute the driver in one window...

mg57> ./test-coreid

... then in another window, use taskset to move the process to
different cores.  Taskset places a process on a particular core.  Our
program should show the same core number if we're doing things right.
(The output of "ps" has been doctored to make it fit on a line.)

mg56 % ps -elf 
 grep test-coreid
0 S bahls    25980 25164 ... 116127 schedu 08:50 pts/8 ... ./test-coreid
0 S bahls    25991 25001 ...   705 pipe_w 08:50 pts/2  ... grep test-coreid
mg56 % taskset -p -c 1 25980
pid 25980's current affinity list: 0
pid 25980's new affinity list: 1

mg56 % taskset -p -c 2 25980
pid 25980's current affinity list: 1
pid 25980's new affinity list: 2

mg56 % taskset -p -c 3 25980
pid 25980's current affinity list: 2
pid 25980's new affinity list: 3

Sure enough, back in the other window, we can see our test-coreid
process moving to different cores:

mg57> ./test-coreid
Core #    0
Core #    1
Core #    2
Core #    3


Q: I've seen shell gurus moving around the command line with amazing
   ease and grace.  What's your favorite shell short-cut?  What shell
   do you use and do you use vi or emacs bindings?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top