ARSC HPC Users' Newsletter 272, July 11, 2003

Speeding up Your FTP Transfers: Part I of II

[ Many thanks to Nathan Bills, ARSC Network Specialist for contributing this 2 part series. ]

Transfering very large files is a fact of life in high-performance computing. As computers get bigger and faster, researchers need ever larger files to hold raw data, raw output, images, animations, etc.

Helping you transfer files faster is the goal of this article.

In theory, a workstation with a 10/100Base-T ethernet adapter should be able to transfer data at 100 Mbits/sec, or about 12.5 Mbytes/sec. But if it's transferring data a thousand miles or more across a wide-area network, sometimes the best it can get is 100-200 Kbytes/sec.

A major contributor to this poor performance is the computer system default settings that applications use for tcp bulk data transfer.

During the process of bulk data transfer, the receiver sends acknowledgements back to the sender periodically. Instead of transmitting just one packet, or segment of data, and waiting for one acknowledgement before sending the next one, the sender can transmit a block, or 'window', composing of multiple segments.

Unix implementations vary in what they set for default window sizes but most of the variants I've seen, like IRIX, Solaris, AIX, and Unicos, have default window sizes of 32 Kbytes or 64 Kbytes. (On the systems themselves, the 'window' sizes are referred to as 'socket buffer' sizes.)

Over long distances the delay in the time it takes a packet to travel from source to destination and the reply to come back, or the round-trip time, can be 50-200 milliseconds. With such a long delay, and default buffer sizes so small, there's a lot of 'dead' time where no data is being transmitted and the sender is just waiting for acknowledgements. This is a failure to effectively use the available network bandwidth.

As an example, assuming that your round-trip time is 100 milliseconds, or 0.1 seconds, you'll only be able to transmit:

(1 window / 0.10 seconds) = 10 windows/second

With a 32-Kbyte buffer size, you'll only get a transfer rate of,

32 Kbytes/window * 10 windows/second = 320 Kbytes/sec.

That's fine for small files but if you're transferring a 1-Gbyte file, it would take,

1 Gbyte / (320 Kbytes/sec) = 3276.8 seconds

or almost an hour. Increase the buffer size to 1-Mbyte and the time drops dramatically to,

1 Mbyte/window * 10 windows/sec = 10 Mbytes/sec

1 Gbyte / (10 Mbytes/sec) = 100 seconds or under two minutes.

So how can one adjust the socket buffer ("window) size? If you're a programmer, you can set the buffer size with the setsockopt() function call in your C program. Otherwise, you're at the mercy of the client and server applications.

Variants of the ftp program have different options that can be used to set the socket buffer sizes (for more information on different ftp implementations and how to change the buffer sizes in each, see http://dast.nlanr.net/Projects/FTP.html). The kerberos ftp that is used at ARSC has the 'lbufsize' command for setting the buffer size of your local system, or client, and 'rbufsize' command for setting the buffer size of the remote end. Once you're in ftp, they can be used as follows to set the buffer sizes:

ftp> lbufsize <size in bytes><br> ftp> rbufsize <size in bytes>

SET BOTH SIDES, otherwise, one side will be set at system default size. Like ships in a fleet that travel only as fast as the slowest ship, data will only transfer between two systems at the smallest window size since that's the limiting factor.

The best buffer size to use can be calculated but I've seen that setting the buffer size to 1 Mbyte is a good starting point. However, in addition to the default buffer size, Unix platforms often impose a system limit:

In Solaris the limit is 1 Mbyte, In Unicos, 1.5 Mbytes, In AIX, 320 Kbytes.

These limits can be adjusted, but the system administrator must do it. Compounding the problem, ftp will not always tell you when you've tried to exceed that system maximum and once you have, it will often just use the default size instead of the maximum available. When you exceed the buffer size in kerberos ftp, for example, it doesn't inform you until you start the data transfer:


  ftp> lbufsize 2097152
  Set local TCP buffer size to 2097152 bytes
  ftp> rbufsize 2097152
  200 TCP buffer size set to 2097152 bytes
  ftp> put y
  local: y remote: y
  ftp: setsockopt(SO_SNDBUF) (ignored): No buffer space available
  ftp: setsockopt(SO_RCVBUF) (ignored): No buffer space available
  200 PORT command successful.
  150 Opening BINARY mode data connection for y.

and it uses the small default buffer size. Each Unix implementation provides a way to determine system default setting, but the commands vary, and you might be better off just cutting your lbufsize and rbufsize requests in half and trying again.

Once you establish and set larger buffer sizes, you may see significant improvement in long-haul transfers. Those 100-200 Kbyte/sec transfers will seem a slow thing of the past and your 10 Gbyte file may transfer in minutes rather than hours or days.

--

In Part II of this series, I will examine some of these issues in greater depth.

Running pre-ISO Std C++ code on the Crays

As described in Shawn Houston's article, "Testing ISO C++ Conformance with Cray PE 3.6" in issue 267 , Cray's C++ compilers, starting with Programming Environment 3.6, provide very good ISO compliance and a complete STL.

Cray C++ 3.6.0.1 is installed at ARSC on the T3E and SV1ex. As described in "man CC" for this version, the environment variable CRAYOLDCPPLIB can be set to improve compatibility for codes written to pre-ISO compilers.

The latest Cray C++ compiler is version 5.0, and it's installed on the X1. With C++ version 5.0, Cray makes the C++ Standard Library the default. The function of CRAYOLDCPPLIB has been extended to allow use of the nonstandard header files commonly seen in old code, such as <iostream.h>.

As an example of such old code, the PE 5.0 loader will by default not accept this:


  #include <iostream.h>
  int main(int argc, char* argv[])
  {
    cout<<"this is a test"<<endl;
    return 0;
  }

It wouldn't recognize <iostream.h> and thus, not find "cout." (According to the C++ standard, <iostream.h> is replaced by <iostream>.) Setting CRAYOLDCPPLIB to "1" at compile time allows it recognize recognize "iostream.h", and the other nonstandard headers.

Under PE3.6 or later, if you're having problems with older C++ code, keep the man pages, CrayDocs, and ARSC consultants in mind.

ARSC SGI Users: July 28th, /allsys Is Going Away

ARSC SGI users, change is coming soon. Please be sure to read "news allsys" on the SGIs. We will unmount /allsys and /viztmp from all SGIs on July 28.

IBM Training Opportunity

ACTC will be visiting ARSC August 11th-15th to present training on optimizing codes for the IBM P4 systems and programmers tools.

Course outline:
  • Overview of IBM systems, processors and AIX, differences and similarities to other MPPsystems.
  • Programming shared and distributed memory using openMP and MPI, combining the two. Programming tools to help improve performance.
  • Tuning codes for P4 and P690 systems along with a porting clinic to help users with specific problems.

Details on ACTC:

http://www.research.ibm.com/actc

Significant new IBM resources to be installed at ARSC:

http://www.arsc.edu/news/10thanniversary.html .

During class there will be opportunities to discuss your coding issues with the ACTC instructors and ARSC staff and to work on your codes. A more detailed timetable will be available shortly.

If you are interested or have questions contact Guy Robinson (robinson@arsc.edu). Places are limited, so don't delay.

AAAS 54th Arctic Science Conference

54TH ARCTIC SCIENCE CONFERENCE :: EXTREME EVENTS

"Understanding Perturbations to the Physical and Biological Environment"

22-24 September 2003 Westmark Hotel & Convention Center Fairbanks, Alaska

http://arctic.aaas.org/meetings/2003/

Call for Abstracts All persons wishing to present either an oral paper or poster should submit their abstracts by Thursday, 24 July 2003.

Registration You may register at http://arctic.aaas.org/register Early registration: received by August 9

Conference Theme "Understanding Perturbations to the Physical and Biological Environment" Earthquakes, eruptions, nuclear testing, endangered species, global warming and other abrupt departures from the steady state.

On November 3, 2002, a magnitude 7.9 earth-quake struck the Denali Fault. This one-in-one-thousand year event will likely change how we think about tectonics in Alaska, and perhaps more generally about subduction tectonics worldwide.

Large departures from the expected, sometimes called catastrophes, introduce new perspectives to scientific thought. They remind us that what will happen tomorrow is not just an extrapolation of what happened yesterday and today. Large events may be manmade as in the case of oil spills, nuclear blasts, sudden cultural contact, and global warming; or, they may be natural as in the case of great earthquakes, great eruptions, glaciation and deglaciation, meteor impacts and biological extinctions.

This meeting will take a retrospective and prospective look at the unexpected and what it means for the Arctic. The focus will be on understanding the impact of large physical events on the biosphere and human society. Discipline oriented sessions reporting progress in Arctic science will also be held.

Quick-Tip Q & A



A:[[ Is there an easy way to extract a column from a regular text file?
  [[ For instance, a column of data.  Or am I back to writing a perl 
  [[ script?


  #   
  # Thanks to Martin Luthi:
  #   
  
  There is cut. E.g. 
  
  cut -c 10-20 platte.dbs      gives the columns (characterwise)
  cut -f2-3 "-d " /etc/mtab    gives the fields 2 to 3
                               the " "  are here to quote the space
                               character as delimiter
  
  There is also awk and perl, examples are given in 
    *** Unix Power Tools, O'Reilly ***
  the best Unix book I've ever seen (and the only one I own).
  
  # 
  # From Rich Hickey:
  # 

  ls -l 
awk '{print $3,$4}'    

     This, as an example, will print user and group out of a column
     list. Quick and easy.

  cat freddy.txt 
awk '{print $1, $NF}'

     Prints first and last column in a file.
  
  
  # 
  # From Kurt Carlson:
  # 
  
  With UNIX there're probably a dozen ways to do this.
  
  If white space delimited, say 3rd column:
    nawk '{print $3}' in_file >out_file
  
  If colon delimited, say 5th column:
    nawk -F: '{print $5}' in_file >out_file
  
  If arbitrary columns, say 19 to 22, nawk is a bit more
  esoteric but for C coders still straight forward:
    nawk '{x=substr($0,19,4); printf("%s\n",x);}' in_file >out_file
  or
    nawk '{x=substr($0,19,4); print x}' in_file >out_file
  
  The 'cut' command (man cut) can also be used, but I can never remember
  the syntax.

  I generally recommend folks (especially those who can code some C)
  learn to use nawk
awk
gawk, they're powerful tools either as script
  files, imbedded in scripts, or from the command line.

  If you speak nawk, you rarely need perl!


  # 
  # And from Ed Kornkven:
  # 
  
  This is trivial if your editor provides the ability to select and
  operate on rectangular blocks of text.  With such an editor, one
  simply highlights the block (in this case, the column), and does a
  copy-and-paste.  Done.

  Not all editors support this feature, unfortunately  -- notably, vi
  doesn't.  "NEdit" and "Vim" both do, and I have done this operation in
  Vim under Mac OS X.   I have also performed it on several PC editors
  including the excellent shareware editor "TextPad".

  

Q: There are commands I'd like to issue to ftp, whenever I use
   it. For instance, "idle 7200" and the lbufsize/rbufsize settings.
   Can I do this automatically, without typing them at the ftp prompt
   every single time?

[[ Answers, Questions, and Tips Graciously Accepted ]]


Current Editors:
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669
Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks AK 99775-6020
E-mail Subscriptions: Archives:
    Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.
Back to Top