[Menu Bar] Resourses at ARSC Science at ARSC Newsroom Support About ARSC ARSC Home

 

ARSC HPC Users' Newsletter 340, May 12, 2006

Newsletter Index Quick-Tip Index Search Newsletters

Contents

 

Book Raffle

[ Thanks to Lee Higbie for an update on the book raffle ]

The response to the Raffle (Issue 339) has been so great that some entrants will not receive a copy of Wicked Cool Java. With the current number of entrants, the probability of winning is still high. Submit your raffle entry to higbie (at) arsc.edu by the deadline on May 31st.

Also, given the number of answers to the problems (see end of review in Issue 339), solutions to either of them appear certain to win a book. Prove me wrong. Submit a solution. If there are no solutions, I'll give both books to raffle entrants.

For more information on the contest see:
    http://www.arsc.edu/support/news/HPCnews/HPCnews339.shtml#article3

 

Optimizing XD1 Codes with SSE Hardware Instructions

The AMD Opteron processor serves as the computational power for ARSC's Cray XD1, nelchina. Like Intel compatible processors back to the Pentium II, the Opteron supports single instruction multiple data (SIMD) operations (a.k.a. vector instructions). These days the SIMD support is known as Streaming SIMD Extensions (SSE), which Intel introduced first with the Pentium III and later improved with the Pentium IV. Newer Opteron processors supports, SSE, SSE2, and SSE3. SSE2 offers significant improvements over the older SIMD support (i.e. MMX, 3DNow and SSE) by adding 8- 128 bit wide multi-purpose registers. There are some advantages to using SSE instructions over the traditional x87 floating point unit.

I was interested how much of a difference SSE instructions would make on the performance of a code, so I borrowed Guy Robinson's GFLOP contest winning code from issue 213 (see link below). Back in 2001, this code attained 1037.77 MFLOPS on ARSC's old Cray SV1 vector machine. The original goal of the GFLOP contest was to get as close as possible to the peak processor performance on the SV1's vector processors. The GFLOP code was designed to run optimally on the SV1 vector processors, so it was likely it wouldn't get as close to the peak theoretical performance on the Opteron as it did on the SV1.

The documentation for the Portland Group compilers, which are available on the XD1, recommends the following flags to get started quickly with optimizations:

pgf90 -fastsse -Mipa=fast gflop.f

Since GFLOP doesn't have any subroutines, I dropped the "-Mipa=fast" flag from the compiler flags. Here are some performance results for various optimization flags.

Optimization Flags Real Time (s) MFLOPS % of Peak Theoretical
-O0119.95 774.9016.14 %
-O1111.23 829.4817.28 %
-O289.18 971.7120.24 %
-0389.20 971.0820.23 %
-fastsse85.91 1006.5620.97 %
-fast84.90 1021.3621.28 %

NOTES:

  1. MFLOP numbers from PAPI library. (see reference "C")
  2. Theoretical Peak is 4.8 GFLOPS for 2.4 GHz AMD Opteron Processors.

So as it turns out for the GFLOP code, the "-fast" optimizations slightly out perform "-fastsse". When I saw this it made me curious whether or not the "-fast" optimizations were using any SSE2 instructions. Adding the "-Mkeepasm" compiler option tells pgf90 to keep the intermediate assembly that it creates. When I recompiled with the new compiler flags, I found that indeed the "-fast" optimizations do use SSE instructions.

pgf90 -fast -Mkeepasm gflop.f -o gflop.fast

The names of the SSE registers all begin with "xmm", so a grep of the assembly code, will show all of the SSE instructions.

nelchina 1% grep xmm gflop.s 
        movss   .C1_287(%rip),%xmm1
        movaps  %xmm1,%xmm0
        movss   %xmm0,-484(%rcx)
        cvtsi2ss        %r14d,%xmm2
...
...

At this point I was curious which instructions were being used. Since SSE registers can do either integer or floating point operations, the presence of a "xmm" register reference doesn't necessarily mean that the instruction is a floating point instruction.

nelchina 2% grep xmm gflop.s | while read i j ; do echo $i; done | sort -u
addss
cvtsi2ss
divss
movaps
movss
mulss
subss

A web search shows that these are definitely floating point SSE instructions (see reference "D"). Here's the what Intel's documentation says about "ADDSS":

    ADDSS		Add Single Scalar
    
    Opcode		Cycles	Instruction
    F3 0F 58	1 (3)	ADDSS xmm reg,xmm reg/mem32
    
    ADDPS op1, op2
    
    op1 contains 4 single precision 32-bit floating point values
    op2 contains 1 single precision 32-bit floating point value
    
    	op1[0] = op1[0] + op2
    	op1[1] = op1[1]
    	op1[2] = op1[2]
    	op1[3] = op1[3]
    

As it turns out, the Portland Group compilers will use SSE instructions even at "-O0" for Opteron processors using 64 bit addressing, so the difference performance is between optimization levels is not based strictly on the use of SSE instructions.

See "pgf90 -fast -help" and "pgf90 -fastsse -help" to see which options "-fast" and "-fastsse" imply.

In a future article, we will discuss the PAPI library, which we used to get the performance numbers for this article, so stay tuned.

References

  1. GFLOP code.
    http://www.arsc.edu/support/news/HPCnews/HPCnews213.shtml#article2
  2. AMD Software Optimization Guide for AMD64 Processors; Publication #25112; Revision 3.06; Issue Date: September 2005
    http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
  3. PAPI Website
    http://icl.cs.utk.edu/papi/
  4. SSE Documentation
    http://www.intel80386.com/simd/mmx2-doc.html
  5. http://en.wikipedia.org/wiki/Sse
  6. http://en.wikipedia.org/wiki/SSE2

 

IBM: OBJECT_MODE Environment Variable

Iceberg and iceflyer both support 32 and 64 bit addressing. The default addressing scheme is 32 bit, however the environment variable OBJECT_MODE allows one to override the default.

The OBJECT_MODE environment variable is understood by a number of IBM commands including:

The OBJECT_MODE environment variable can be particularly useful when you are trying to build a 64-bit application and libraries. Rather than altering the Makefile, you can simply set the OBJECT_MODE variable and build.

e.g.
export OBJECT_MODE=64
./configure
make  

You may need to alter the library and include path for the 64 bit version.

The values of OBJECT_MODE which are valid for all commands are 32 and 64. A third value 32_64 is only valid for the ar, dump, and nm commands.

 

Quick-Tip Q & A


A:[[ How can I run a remote command on a system and use the remote 
  [[ values for $HOME, $WRKDIR, etc. instead of the local values?
  [[
  [[ For example, I'd like this command to work as I obviously intend 
  [[ it to work (even if WRKDIR is defined differently on the local 
  [[ and remote systems):
  [[
  [[   scp -r remotesys:$WRKDIR/mymegamodel/answers .

  #
  # Thanks to Jed Brown:
  #
  You just have to escape shell expansion.  Thus, this

    % scp -r 'remotesys:$WRKDIR/mymegamodel/answers' .

  or even this,

    % scp -r remotesys:\$WRKDIR/mymegamodel/answers .

  works just fine.


  #
  # From Scott Kajihara:
  #
  This is a common question. The command should read

     scp -r remotesys:'$WRKDIR/mymegamodel/answers' .

  Note: it is important that this be single quotes so that the
  environment variable is not expanded. Quoting also prevents wildcard
  expansions on the local machine.


  #
  # And thanks for an in-depth explanation, from: ./Greg-Newby --verbose
  #    :-)
  #
  Success will depend on where such variables are defined, and have some
  shell sensitivity.  Quoting from "man tcsh"

      Non-login shells read only /etc/csh.cshrc and ~/.tcshrc or
      ~/.cshrc on startup.

  Other shells (ksh, bash...) are similar.

  When you do an scp, it really runs an ssh shell, but the shell is not
  a login shell.  To test whether a variable is defined for non-login
  shells, try the "echo" command:

      ssh remotesys echo '$HOME'
      ssh remotesys echo '$WRKDIR'

  Use single quotes, not double quotes.  Double quotes will be evaluated
  by your local shell:

      WRONG:  ssh remotesys echo "$HOME"
      yields: $HOME on your local system

      RIGHT:  ssh remotesys echo '$HOME'
      yields: $HOME on the remote system


  In this example, it's fine to place the quotes in different places, as
  long as the variable itself is quoted.  As for any variable use in a
  shell, you can use curly braces to separate the variable name in case
  it is ambiguous.

  Examples:

      scp -r remotesys:'$WRKDIR'/mymegamodel/answers .
      scp -r remotesys:'$WRKDIR/mymegamodel/answers' .
      scp -r remotesys:'${WRKDIR}'/mymegamodel/answers .

  One way I often use remote variable expansion is for lazy path
  globbing.  Let's say I want to get:
      remotesys:'somelongdirectoryname/someuniquefilename' 
  and don't mind making the remote system work harder to match
  filenames... or, perhaps I want all files from a particular remote
  directory.  The * wildcard works great (you could also use the ?
  wildcard if you want to match single characters):

      scp -r remotesys:some'*'/someunique'*' .
  or, scp -r remotesys:allmatching.'*' .
  or, scp -r remotesys:mysubdir/'*' .

Q: What development environments do you use to write
   C/C++/FORTRAN/<other> code, and how do you manage your (possibly
   many) source code files.  What editors and other tools do you 
   use?

[[ Answers, Questions, and Tips Graciously Accepted ]]

 

 


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Craig Stephenson ARSC User Consultant ph: 907-450-8653
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
Contact:
Send comments and questions to the current editors using this Contact Form.
E-mail Subscriptions: Archives:

 

Newsletter Index Quick-Tip Index Search Newsletters

 

Arctic Region Supercomputing Center
PO Box 756020, Fairbanks, AK 99775 | voice: 907-474-6935 | email:

home | search | about | support | news | science | resources