ARSC T3D Users' Newsletter 60, November 10, 1995

Debugging Floating-point Exception Errors on CRAY T3D Systems

The article below appeared in the Cray Research Service Bulletin last month and is a very useful description:


  > Debugging floating-point exception errors on CRAY T3D systems
  > 
  >    Systems:     CRAY T3D
  >    OS:          UNICOS MAX
  >    Product:     CF77
  >    Audience:    Programmers
  >    Date:        October 1995
  > 
  > On CRAY T3D systems, a floating-point exception (FPE) error occurs when 
  > the floating-point hardware functional unit detects an unusable operand 
  > for a floating-point operation or an overflow in the result of a floating-
  > point operation.
  > 
  > An unusable operand always has one of the following forms:
  > 
  >    - NaN, which is an IEEE floating-point number that is not valid 
  >      and that consists of the largest exponent and a nonzero mantissa 
  > 
  >    - Infinity, which is an IEEE floating-point number with an infinity 
  >      value that consists of the largest exponent and a zero mantissa 
  > 
  >    - Denormal, which is an IEEE floating-point number with a denormal 
  >      value that consists of a zero exponent and a nonzero mantissa 
  > 
  >    - A zero divisor 
  > 
  > In an application context, FPE errors usually are caused by floating-point
  > calculations that involve a real variable that does not have a valid 
  > floating-point value (and thus is not a valid floating-point operation), 
  > algorithm problems (such as divide by 0), or overflows. So that you know 
  > what to examine first when debugging the problem, you should determine 
  > which of the preceding causes has triggered the FPE error. For example, 
  > for a floating-point operation that is not valid, after identifying the 
  > source line in which the problem occurred, you should first try to
  > look for uninitialized variables used in that source line, then look 
  > for possible corruption of initialized variables and/or array bounds 
  > problems, and so on.  Similarly, for an algorithm problem such as divide 
  > by 0, you should examine the variables in the divisor part of the 
  > expression first.
  > 
  > On CRAY T3D systems, FPE errors usually cause a system TRACEBK and a
  > register dump to standard error. The contents of the floating-point 
  > control register (FPCR), shown at the bottom of the register dump, can 
  > indicate the FPE problem type.
  > 
  > The FPCR is a register on the DECchip 21064 microprocessor that keeps 
  > track of the type of floating-point exceptions detected by the system. 
  > Certain bits of the FPCR are set when a certain type of exception is 
  > detected. The DECchip 21064 microprocessor offers several traps for 
  > different types of exceptions. Some of these traps are always enabled 
  > and cannot be disabled; they are called default traps. The default 
  > traps include the following:
  > 
  >    - Operation that is not valid 
  >    - Division by 0 
  >    - Overflow 
  > 
  > You can enable and disable the remaining traps (underflow, inexact 
  > result, and integer overflow) by using software such as compiler-
  > generated floating-point instructions or assembly routines, which 
  > are called software-controlled traps. For more information about the 
  > floating-point exception traps, see section 4.7 of the Alpha 
  > Architecture Handbook - CRAY T3D Differences, publication TPD-0007.
  > 
  > This article discusses only the default traps. The CRAY T3D compilers 
  > and libraries usually do not enable the software-controlled traps, and 
  > programmers usually do not modify the assembly instructions from their 
  > application to enable them.
  > 
  > The following table shows the FPCR content for each of the problems 
  > that the system hardware would detect by default; this is followed by 
  > several examples. For debugging purposes, you can use this table as 
  > is to determine the type of FPE errors that the application is 
  > encountering.
  > 
  > +-----------------------------------------------------------------+
  > 
 FPCR contents (in hexadecimal)  
 FPE problem type              

  > 
 --------------------------------------------------------------- 

  > 
 0x8910000000000000              
 Operation is not valid        

  > 
 --------------------------------------------------------------- 

  > 
 0x8920000000000000              
 Division by 0                 

  > 
 --------------------------------------------------------------- 

  > 
 0x8940000000000000              
 Overflow                      

  > +-----------------------------------------------------------------+
  > 
  > Example 1: Operation is not valid problem
  > 
  > In the example shown in Figure 4, variable a was not initialized properly 
  > before use in subroutine SUB1. Therefore, the code will fail with an FPE 
  > error inside of subroutine SUB1, with an FPCR content of 0x8910000000000000. 
  > 
  > +-------------------------------------------------------------------------+
  >                Figure 4.  Operation is not valid problem
  > +-------------------------------------------------------------------------+
  > => cat ex1.f
  >         program inv
  >         real a
  >         call sub1(a)
  >         print *, a
  >         end
  >         subroutine sub1(a)
  >         real a
  >         a = a + 1.
  >         return
  >         end
  > => cf77 -X1 ex1.f; a.out
  > Floating point exception
  > Beginning of Traceback (PE 0):
  >   Started from address 0x200000027c in routine 'SUB1'.
  >   Called from line 3 (address 0x2000000170) in routine 'INV'.
  >   Called from line 307 (address 0x2000003fa0) in routine '$START$'.
  > End of Traceback.
  > Agent printing core file information:
  > user exiting after receiving signal 8
  > Exit message came from virtual PE 0, logical PE 0x0
  > Register dump
  >   pa0: 0x0000000000000002     pa1: 0x0000000100000000  pa2:
  > 0x0000000000000000
  >   pc: 0x0000002000032938       sp: 0x00000060ffffe7f0   fp:
  > 0x00000060ffffe830
  >   .....
  >   f27:0x0000000000000000   f28:0x0000000000000000   f29:0x0000000000000000
  >   f30:0x0000000000000000                          fpcr:0x8910000000000000
  >                                                   =======================
  > 
  > Agent finished printing core file information.
  > mppexec: user UDB core limit reached, mppcore dump terminated
  > Floating exception
  > +-------------------------------------------------------------------------+
  > 
  > Example 2: Divide by 0 problem
  > 
  > In the example shown in Figure 5 variable a has a value of 0 and is used 
  > as a divisor in subroutine SUB1. Therefore, the code will fail with an FPE 
  > error inside of subroutine SUB1, with an FPCR content of 0x892000000000000.
  > 
  > +-------------------------------------------------------------------------+
  >                       Figure 5.  Divide by 0 problem
  > +-------------------------------------------------------------------------+
  > => cat ex2.f
  >         program dze
  >         real a
  >         a = 0
  >         call sub1(a)
  >         end
  >         subroutine sub1(a)
  >         real c
  >         c = 100./a
  >         print *, c
  >         return
  >         end
  > => cf77 -X1 ex2.f; a.out
  > Floating point exception
  > Beginning of Traceback (PE 0):
  >   Started from address 0x20000001dc in routine 'SUB1'.
  >   Called from line 4 (address 0x2000000170) in routine 'DZE'.
  >   Called from line 307 (address 0x2000003f80) in routine '$START$'.
  > End of Traceback.
  > Agent printing core file information:
  > user exiting after receiving signal 8
  > Exit message came from virtual PE 0, logical PE 0x0
  > Register dump
  >   pa0: 0x0000000000000004     pa1: 0x0000000100000000  pa2:
  > 0x0000000000000000
  >   pc: 0x0000002000032918       sp: 0x00000060ffffe7d0   fp:
  > 0x00000060ffffe810
  >   .....
  >   f27:0x0000000000000000   f28:0x0000000000000000   f29:0x0000000000000000
  >   f30:0x0000000000000000                          fpcr:0x8920000000000000
  >                                                   =======================
  > Agent finished printing core file information.
  > mppexec: user UDB core limit reached, mppcore dump terminated
  > Floating exception
  > +-------------------------------------------------------------------------+
  > 
  > Example 3: Overflow problem
  > 
  > In the example shown in Figure 6 variable a has a value of 1.79769E+308, 
  > and the multiplication of a to itself in subroutine SUB1 will overflow 
  > its result. Therefore, the code will fail with an FPE error inside of 
  > subroutine SUB1, with an FPCR content of 0x894000000000000.
  > 
  > +-------------------------------------------------------------------------+
  >                          Figure 6.  Overflow problem
  > +-------------------------------------------------------------------------+
  > => cat ex3.f
  >         program ovf
  >         real a
  >         a = 1.79769E+308
  >         call sub1(a)
  >         end
  >         subroutine sub1(a)
  >         real c
  >         c = a * a
  >         print *, c
  >         return
  >         end
  > => cf77 -X1 ex3.f; a.out
  > Floating point exception
  > Beginning of Traceback (PE 0):
  >   Started from address 0x20000001dc in routine 'SUB1'.
  >   Called from line 4 (address 0x2000000184) in routine 'OVF'.
  >   Called from line 307 (address 0x2000003f80) in routine '$START$'.
  > End of Traceback.
  > Agent printing core file information:
  > user exiting after receiving signal 8
  > Exit message came from virtual PE 0, logical PE 0x0
  > Register dump
  >   pa0: 0x0000000000000008     pa1: 0x0000000100000000  pa2:
  > 0x0000000000000000
  >   pc: 0x0000002000032918       sp: 0x00000060ffffe7d0   fp:
  > 0x00000060ffffe810
  >   .....
  >   f27:0x0000000000000000   f28:0x0000000000000000   f29:0x0000000000000000
  >   f30:0x0000000000000000                          fpcr:0x8940000000000000
  >                                                   =======================
  > Agent finished printing core file information.
  > mppexec: user UDB core limit reached, mppcore dump terminated
  > Floating exception
  > +-------------------------------------------------------------------------+
  > <end of article>
  > 

Parallel I/O on the T3D

The past two newsletters had some examples of I/O for the Y-MP and T3D reading files written on the other machine. In this article, I look at parallel I/O on the T3D. Using the mype() Fortran intrinsic, it is possible to get what syntactically looks like parallel I/O on the T3D. For unformatted I/O we might have something like:

Example 3


  character*7 filename
  character*2 ciun
  integer a( 100 )
  intrinsic my_pe
  iun = 10 + my_pe()
  write( ciun, "(i2)" ) iun
  filename = 'fort.'//ciun
  open( iun, file=filename, form = 'unformatted', iostat=ios )
  write( iun ) ( a( i ), i = 1, 100 )
  close( iun )
  end
For this example, we have each PE writing to its own file:

  PE0 writes to fort.10
  PE1 writes to fort.11 
  PE2 writes to fort.12 
      and so on ...
and each of the PEs could be writing at the same time.

Another popular method to get parallel I/O on the T3D is to use direct access files and the same my_pe intrinsic. This might look like:

Example 2


        parameter( LR = 1024  )
        real a( LR )
        intrinsic my_pe
        mype = my_pe()
  c for best results on the T3D, we want the recl to be a multiple of 4096, this
  c implies a minimum record length of 512 64 bit words
        open( 10, access = 'direct', iostat=ios, recl = 8*LR )
        write( 10, rec = mype+1 ) ( a( j ), j = 1, LR )
        end
In this example, each PE writes to a different record in the same direct access file. In both examples, it is the programmer's responsibility to make sure that write operations from different PEs don't access the same file (example #1) or the same record (example #2) at the same time. This might not only produce the wrong answers but could also make the job hang.

Although both examples seem to show that many I/O operations could be happening at the same time, there are several shared resources between the PE and the Y-MP disks that prevent true parallel operations. I can think of three such bottlenecks although there may be more:

  1. The I/O gateway on the T3D. For every 64 PEs there is one special processor that handles I/O requests between the PEs and the Y-MP.
  2. The mppexec running on the Y-MP. Each I/O request on a PE is implemented by a Y-MP job, that job competes with all other Y-MP jobs for Y-MP CPUs.
  3. The disk on which the file resides. This physical device handles only one I/O transfer at a time and queues up multiple requests to be handled one at a time.
Using the methods from the above examples, we can produce tables of transfer rates for difference size transfers and different number of PEs. Two such tables are shown below:

Table 1

speed of transfers of unformatted sequential files on the T3D and Y-MP

   Amount                 (Speed in MWs/second )
  written
  (64-bit
   words)   1 PE     2 PEs     4 PEs     8 PEs    16 PEs     32 PEs   Y-MP
     1024   6.41                                                     12.26 a
     2048   9.54     12.65                                           18.86 a
     4096  12.49     18.94     25.22                                 27.85 a
     8192  14.83     24.72     38.08     50.40 a                     38.35 a
    16384  16.29     29.46     49.63     75.65 a  100.05             44.53 a
    32768   1.82{-b-}32.45     58.86     99.19 a  151.28     202.19   2.27
    65536   2.54      2.99{-b-}64.86    117.41 a  199.01     301.27  18.57
   131072   2.47      4.10      0.28{-b}117.89 a  235.40     397.50  16.66
   262144   1.31      4.01      1.24      0.50{-b}259.45     471.63  16.42
   524288             2.34      4.77      2.04      0.41{-b-}518.88  16.99
  1048576                       2.74      4.64      1.24       0.64  17.21
  2097152                                 2.72      2.19       2.13  17.09
  4194304                                           3.57       1.24  17.03
  8388608                                                      0.81  15.85

Table 2

speed of transfers with direct access `(LR=8192) files on the T3D and Y-MP

   Amount                 (Speed in MWs/second )
  written
  (64-bit
   words)    1 PE   2 PEs   4 PEs   8 PEs  16 PEs  32 PEs   64 PEs     Y-MP
       1024  0.03                                                     10.92 a
       2048  0.07    0.07                                             20.59 a
       4096  0.14    0.14    0.14                                     42.50 a
       8192  0.28    0.27    0.27    0.27                             80.98 a
      16384  0.27    0.54    0.54    0.54    0.54                     85.07 a
      32768  0.27    0.54    1.08    1.08    1.08    1.08             90.11 a
      65536  0.27    0.52    0.90    2.16    2.16    2.16     2.16    84.24 a
     131072  0.27    0.52    0.94    1.56    4.32    4.32     4.32    85.28 a
     262144  0.27    0.49    0.81    1.73    2.47    8.63     8.63    23.54
     524288  0.25    0.44    0.71    1.32    2.91    3.47    17.26    30.11
    1048576  0.25{d-}0.44{d-}0.65{d-}1.24{d-}1.94{d-}3.71{-d-}4.01    30.22
    2097152  0.25    0.41    0.65    1.14    1.33    2.09     3.65    33.00
    4194304          0.44    0.74    1.24    1.46    1.11     1.50    31.82
    8388608                  0.65    1.10    1.44    0.91     0.84    31.21
   16777216                          1.09    1.31    1.17{-c-}0.88    31.84
   33554432                                  1.14    1.11{-c-}0.75    34.60
   67108864                                          1.28{-c-}0.74    32.92
  134217728                                                   0.85    34.14
Conditions for the runs described in the tables above:
  1. All runs were made during normal batch hours and from SAR statistics this means more than 95% of the Y-MP cycles are in use on ARSC's 8 CPU Y-MP during these runs.
  2. For columns with multiple PEs, the I/O task is evenly distributed so that each PE writes the same amount, the first column of each table is the aggregate size of all writes.
  3. For columns with multiple PEs, Table 1 shows writes to different files and Table 2 shows writes to different records to the same direct access file.
Observations from the tables:
  1. The Y-MP is always faster, duh!
  2. When writing a small amount of I/O, it is done only to the library buffers and does not cause a physical write to disk or a request to a mppexec job running on the Y-MP. These "I/O" operations are actually memory to memory copies and so have very high rates as marked in the above tables with the letter a. Once the amount written exceeds the size of these buffers then the I/O rate falls.
  3. Using more PEs increases the aggregate library buffer space and therefore reduces physical disk writes and mppexec actions when writing the same amount of data. Shown, for example, in the pairs of numbers marked with the letter b.
  4. Increasing PEs when the I/O is larger than the library buffers causes contention for the shared resources and decreases I/O rates when the same amount of data is transferred. Shown, for example, in the pairs of numbers marked with the letter c.
  5. Sometimes, when writing a fixed amount, increasing the number of PEs can have a monotonic increase in the I/O rate. As shown in the pairs marked with the letter d.
I/O is a jungle of tricks and experience, if you have any tricks or experience that you are willing to share, please send them to me and I'll generate the example and writeup the description.

ARSC is Now at MAX 1.2.0.5

During the downtime on Tuesday 10/31/95, ARSC upgraded to UNICOS 8.0.4.1 and MAX 1.2.0.5. If any gremlins got into your T3D code on that Halloween Night please contact Mike Ess. With this version on MAX, we have started using mppview instead of mppmon; mppmon will be removed at the end of the month.
Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top