ARSC HPC Users' Newsletter 389, June 27, 2008
On the Hazards of Array Syntax, ( Or How I Achieved Six Orders of Magnitude Speedup )
[ By: Lee Higbie ]
For a few computational problems the phenomenon being modeled is completely local. What happens at one point in the model does not depend on what goes on around it. To a first order, a forest fire's burning depends on the weather and the fuel, but not on the fire's behavior a mile away, in another part of the forest. These programs, sometimes called embarrassingly parallel, are the easiest to parallelize for supercomputers because the model can easily run on many cores.
Most programs require information from neighboring points, which means that the cores running the program must communicate- the major source of speed and programming bottlenecks for many parallelized programs. For example, the weather here depends not only on the local temperature, pressure, wind speed, but also on that of the areas nearby. Wind and barometric pressure changes propagate rapidly compared to fires.
I received a program from an Outside source ("Outside" is Alaskan for "Lower 48"). It is a first order production forest fire model. We wanted to adapt it to use for the fires we have in Interior Alaska, ones that often account for more than half the acreage burned in the entire US.
In the middle of this fortran code, inside the main triply nested loop, is the innocuous looking code block:
MC1 = 1.03*EMC where ( PCP > 2. ) MC1 = 35. end where where ( MC1 < 1. ) MC1 = 1. end where
After changing the geometry for the larger region needed for Alaska; updating the grid to use 1 km cells; and changing to a projection that is more suitable for our latitude, I had a program grid with about 2.7M grid cells instead of the original 10K. In the course of modifying the program, we also changed from 24 time cells to 49.
I ran the program. It bombed when it ran out of time, so I started a performance analysis. With gprof I quickly found that the main routine accounted for nearly all the execution time. After inserting a few:
call cpu_time(time) print *, [ cumulative elapsed time in code block]
I isolated the time to those simple lines above. Twice I had ignored them--no loop, nothing much going on. Right? Wrong!
I changed them slightly to:
real, dimension(1800, 1500, 49) :: PCP, mc1, EMC ... where ( PCP > 2. ) mc1 = 35. else where mc1 = max(1, 1.03 * EMC) end where
I figured this would stop trashing the cache and reduce storing the elements of the array mc1, to one store per element. This change sped up the loop nest by about a factor of three, pretty good for an hour's work. But still on the order of a second per innermost loop iteration.
But, I kept thinking about the physics. The inner pair of loops was working on the mesh's 2.7 M grid cells, why would any cell require more than a second's worth of computation? And if you do the arithmetic, that works out to a couple core-weeks of CPU time, still far from feasible for a production code that should run daily or twice daily during the fire season. Even using all of Midnight's 2312 cores, it would take thousands of seconds. Embarrassing.
I reread the paper describing the fire model, the one the original program was based on, and decided that nearly all the work implied by the array statements was redundant. Each variable in both of the statement blocks above is triply subscripted for longitude, latitude and time. The innermost loop pair is also over longitude and latitude. Only the value for the current cell is being used in any iteration.
Conveniently i and j are used for longitude and latitude. With the redundant calculations removed, the statement block becomes:
where ( PCP(i, j, :) > 2. ) mc1(i, j, :) = 35. else where mc1(i, j, :) = max(1, 1.03 * EMC(i, j, :)) end where
The innermost loop pair includes more than 100 lines of code, but now it runs about a million times faster than before. 396M loads and stores have been eliminated from each loop iteration. (The third subscript, the colon above, is over the time steps.)
Those 396 million extra loads and stores took so much time that the one hundred line loop pair runs about a million times faster than the original. Embarrassment gone.
[Ed. Note: On the entire program, which performs extensive I/O, the changes would save more than 100,000,000,000,000,000 stores and reduce the run time by a factor of about 5000.]
Handling Little Endian Files with XLF
[ By: Don Bahls ]
By default IBM systems such as iceberg generate big endian binary files. This can cause problems when you attempt to use binary files that were generated on a little endian system, such Linux systems using AMD or Intel processors. Version 10 of the XLF compiler has a runtime option which allows Fortran I/O to use unformatted little endian files.
Here's an example using the reader and writer codes from newsletter 344 see: /arsc/support/news/hpcnews/hpcnews344/index.xml#article1 .
Compile the code as you normally would.
iceberg1 46% xlf90 reader.f90 -o reader ** reader === End of Compilation 1 === 1501-510 Compilation successful for file reader.f90. iceberg1 47% xlf90 writer.f90 -o writer ** writer === End of Compilation 1 === 1501-510 Compilation successful for file writer.f90.
If the application is run, we get a standard unformatted big-endian file. The command "od -x" lets us verify the output.dat file is in the expected format.E.g.
iceberg1 51% od -x output.dat 0000000 0000 0018 cede d0da ce0f aded dead beef 0000020 cafe feed deaf bead edef face 0000 0018
When the XLFRTEOPTS environment variable is set, little-endian unformatted I/O will be performed for each specified unit. In these two examples, unit 11 is used for the I/O.E.g.
export XLFRTEOPTS=ufmt_littleendian=11 iceberg1 53% ./writer iceberg1 54% od -x output.dat 0000000 1800 0000 edad 0fce dad0 dece efbe adde 0000020 edfe feca cefa efed adbe afde 1800 0000
The reader application confirms the output.dat file is formatted properly.E.g.
iceberg1 61% ./reader Integer*4 matches. Real*4 matches. Integer*8 matches. Real*8 matches.
If you happen to need to do little-endian I/O on a range of units you can specify a range with the XLFRTEOPTS environment variable:E.g.
# specifies that little-endian unformatted I/O be done on units 10 # to 100 export XLFRTEOPTS=ufmt_littleendian=10-100
For additional details on this feature see the following documentation available on the IBM website:http://www-1.ibm.com/support/docview.wss?rs=43&uid=swg21243120
Back in 2004, iceberg became the first allocated IBM system at ARSC. It was one of the first systems that IBM deployed with, the then new, Federation 2 switch. Over the last 4 years iceberg has been a work-horse HPCMP and academic users alike. In a little under a month, iceberg will be retired to make way for our latest allocated system.
Accounts on iceberg will be inactivated at 1 PM AKDT on July 18th.
Quick-Tip Q & A
A:[[ Some Unix commands, like ps and finger, spit out a line of column [[ headers followed by a table of information. Being self-absorbed, [[ I often want to grep for myself in these tables, but I need to see [[ the column headers too, because they change depending on the command [[ options. Check out the different "ps" column headers! [[ [[ mg56:~ % ps -e head -1 [[ PID TTY TIME CMD [[ mg56:~ % ps -l head -1 [[ F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD [[ mg56:~ % ps -ef head -1 [[ UID PID PPID C STIME TTY TIME CMD [[ mg56:~ % ps -elf head -1 [[ F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD [[ [[ Apparently the developers did NOT consider my needs when they wrote [[ these tools. To see the header and information rows, am I really [[ stuck doing this?! [[ [[ mg56:~ % ps -ef head -1 [[ UID PID PPID C STIME TTY TIME CMD [[ mg56:~ % ps -ef grep mortimer [[ mortimer 9996 9994 0 10:07 ? 00:00:00 sshd: mortimer [priv] [[ mortimer 9998 9997 0 10:07 pts/8 00:00:00 -ksh [[ mortimer 18706 9998 0 10:27 pts/8 00:00:00 ps -ef # # Thanks to Ryan Czerwiec # The short answer is: yes, you're stuck because the problem is with the behavior of the grep / head and not the original command. It's not really a developer issue since they all behave as intended. There are solutions for some particular commands, but the broad scope of every header-producing command effectively precludes any generic solution or the ability of developers to predict the need and implement one. It wouldn't be a big deal to replace each of these commands with an alias or script on a case-by-case basis. There happens to be a workaround for ps, but as an example, you could change your pattern so it would also catch the header too, but likely nothing else additional. You'd have to know ahead of time what the header looks like and what strings in it wouldn't be found elsewhere, for instance PID for ps, as in: ps -elf grep "[Pm][Io][Dr]" In standard grep (definitely not foolproof), or: ps -elf grep "PID\ mortimer" in gnu grep. # # Greg Newby # With apologies for BSD vs. SysV Unix options for the PS command, I'll show a few examples from my Mac, where there are plenty of processes owned by my username. There are several scripts and aliases I can imagine writing to meet your needs, but this doesn't seem that hard to do by just separating the two commands by a semicolon. First command gets the header, second gets your processes. It's pretty, and doesn't limit you to particular ps options. % ps -ax head -1 ; ps -ax grep `whoami` PID TT STAT TIME COMMAND 2587 ?? Ss 87:40.94 /Users/newby/Documents/local/bin/Firefox.app/Contents 391 p1 Ss 0:00.02 login -pf newby 5085 p1 S 0:00.10 /Users/newby/.local/bin/mutt -F /Users/newby/.muttrc. 889 p2 Ss 0:00.01 login -pf newby 3105 p2 S+ 0:04.26 ssh email@example.com 933 p3 Ss 0:00.01 login -pf newby The semicolon simply tells the shell to run both commands, without giving a prompt in between. This will work on all shells I know of. "ps -ax" is about the same as "ps -ef" on midnight. "ps -aux" is about the same as "ps -elf" on midnight. Here's another, sorted on the PID field: % ps -aux head -1 ; ps -aux grep `whoami` sort +1 USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND newby 102 0.0 -0.4 82904 9024 ?? Ss Fri08PM 0:06.56 /System/Li newby 103 0.0 -0.4 359984 8536 ?? Ss Fri08PM 0:16.66 /System/Li newby 252 0.0 -0.1 349660 2692 ?? S Fri08PM 0:00.29 /System/Li newby 344 0.0 -0.1 55992 2256 ?? Ss Fri08PM 0:00.40 /System/Li newby 347 0.0 -0.0 27316 364 ?? S Fri08PM 0:00.01 /sbin/laun newby 353 0.0 -0.3 878884 5504 ?? S Fri08PM 2:20.76 /System/Li newby 354 0.0 -0.1 27936 1580 ?? S Fri08PM 0:00.49 aped ... If any of these suit you, then take the ones you use most commonly and add them as aliases or functions or shell scripts. Adding a pipe to "sort" opens many neat opportunities [you can even pipe one sort to another]. # # Brad Havel # Usage of egrep, which allows for more complex regular expressions to be grepped, would allow the command to be combined into a single pipe (assuming there is at least something on each line which could be searched, such as "PID" and userid for the ps -[elf] example above. ----- (mg56) [=> ps -ef egrep "(PID havel)" grep -v "grep" UID PID PPID C STIME TTY TIME CMD root 10064 31279 0 07:21 ? 00:00:00 sshd: havel [priv] havel 10066 10064 0 07:21 ? 00:00:00 sshd: havel [priv] havel 10067 10064 0 07:21 ? 00:00:00 sshd: havel@pts/0 havel 10068 10067 0 07:21 pts/0 00:00:00 -bash havel 10102 10068 0 07:21 pts/0 00:00:00 ps -ef (mg56) [=> ps -el egrep "(PID 3889)" grep -v "grep" F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 1 S 3889 10066 10064 0 77 0 - 4406 schedu ? 00:00:00 sshd 5 S 3889 10067 10064 0 75 0 - 4375 schedu ? 00:00:00 sshd 0 S 3889 10068 10067 0 75 0 - 2728 wait4 pts/0 00:00:00 bash 0 R 3889 10196 10068 0 77 0 - 867 - pts/0 00:00:00 ps ----- While probably not useful for everything it is a bit cleaner and a little more graceful than the "head -1; ps -ef" method. # # Derek Bastille # One dead-simple way to do this is to use an OR with grep like this: [mira-e0]:/bastille >ps -aux grep -E "bastille USER" USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND bastille 843 90.4 15.1 593456 197696 ?? R Thu10AM 151:22.68 /Applications/Safari.app/Contents/MacOS/Safari -psn_0_4194305 bastille 793 1.0 1.2 228884 15704 ?? S Thu09AM 94:18.28 /System/Library/CoreServices/SystemUIServer.app/Contents/MacOS/SystemUIS bastille 2785 0.9 3.1 562604 41228 ?? S Fri05PM 55:54.16 /Applications/MeetingMaker 8.6.2/Meeting Maker Calendar.app/Contents/Mac bastille 872 0.4 4.1 292420 53500 ?? R Thu10AM 47:24.02 /Applications/Utilities/Terminal.app/Contents/MacOS/Terminal -psn_0_4718 bastille 801 0.1 0.5 199112 6000 ?? S Thu09AM 5:50.56 /Applications/Microsoft Mouse.app/Contents/Resources/MicrosoftMouseHelpe ... (I did this with OS X, so the ps is a bit different). Basically, the -E is for including a real expression. So I'm telling grep I want lines with 'bastille' or 'USER'. For your ps header example, you'd want to use: ps -ef grep -E "UID mortimer" # # An Editor # Derek forgot about this option he showed me a while back... % ps -aux grep -e mortimer -e USER Q: Is there a way to tell which flags the MPI compiler is passing to the system compiler? Specificially I would like to see which include files and libraries are being passed to the system compiler.
[[ Answers, Questions, and Tips Graciously Accepted ]]
Ed Kornkven ARSC HPC Specialist ph: 907-450-8669 Kate Hedstrom ARSC Oceanographic Specialist ph: 907-450-8678 Arctic Region Supercomputing Center University of Alaska Fairbanks PO Box 756020 Fairbanks AK 99775-6020
Subscribe to (or unsubscribe from) the e-mail edition of the
ARSC HPC Users' Newsletter.
Back issues of the ASCII e-mail edition of the ARSC T3D/T3E/HPC Users' Newsletter are available by request. Please contact the editors.