Efficiently computing floating-point arithmetic hundreds of thousands of times in Bash - performance

Background
I work for a research institute that studies storm surges computationally, and am attempting to automate some of the HPC commands using Bash. Currently, the process is we download the data from NOAA and create the command file manually, line-by-line, inputting the location of each file along with a time for the program to read the data from that file and a wind magnification factor. There are hundreds of these data files in each download NOAA produces, which come out every 6 hours or so when a storm is in progress. This means that much of our time during a storm is spent making these command files.
Problem
I am limited in the tools I can use to automate this process because I simply have a user account and a monthly allotment of time on the supercomputers; I do not have the privilege to install new software on them. Plus, some of them are Crays, some are IBMs, some are HPs, and so forth. There isn't a consistent operating system between them; the only similarity is they are all Unix-based. So I have at my disposal tools like Bash, Perl, awk, and Python, but not necessarily tools like csh, ksh, zsh, bc, et cetera:
$ bc
-bash: bc: command not found
Further, my lead scientist has requested that all of the code I write for him be in Bash because he understands it, with minimal calls to external programs for things Bash cannot do. For example, it cannot do floating point arithmetic, and I need to be able to add floats. I can call Perl from within Bash, but that's slow:
$ time perl -E 'printf("%.2f", 360.00 + 0.25)'
360.25
real 0m0.052s
user 0m0.015s
sys 0m0.015s
1/20th of a second doesn't seem like a long time, but when I have to make this call 100 times in a single file, that equates to about 5 seconds to process one file. That isn't so bad when we are only making one of these every 6 hours. However, if this work is abstracted to a larger assignment, one where we point 1,000 synthetic storms at the Atlantic basin at one time in order to study what could have happened had the storm been stronger or taken a different path, 5 seconds quickly grows to more than an hour just to process text files. When you are billed by the hour, this poses a problem.
Question
What is a good way to speed this up? I currently have this for loop in the script (the one that takes 5 seconds to run):
for FORECAST in $DIRNAME; do
echo $HOURCOUNT" "$WINDMAG" "${FORECAST##*/} >> $FILENAME;
HOURCOUNT=$(echo "$HOURCOUNT $INCREMENT" | awk '{printf "%.2f", $1 + $2}');
done
I know a single call to awk or Perl to loop through the data files would be a hundred times faster than calling either once for each file in the directory, and that these languages can easily open a file and write to it, but the problem I am having is getting data back and forth. I have found a lot of resources on these three languages alone (awk, Perl, Python), but haven't been able to find as much on embedding them in a Bash script. The closest I have been able to come is to make this shell of an awk command:
awk -v HOURCOUNT="$HOURCOUNT" -v INCREMENT="$INCREMENT" -v WINDMAG="$WINDMAG" -v DIRNAME="$DIRNAME" -v FILENAME="$FILENAME" 'BEGIN{ for (FORECAST in DIRNAME) do
...
}'
But I am not certain that this is correct syntax, and if it is, if it's the best way to go about this, or if it will even work at all. I have been hitting my head against the wall for a few days now and decided to ask the internet before I plug on.

Bash is very capable as long as you have the ability you need. For floating point, you basically have two options, either bc (which at least on the box you show isn't installed [which is kind of hard to believe]) or calc. calc-2.12.4.13.tar.bz2
Either package is flexible and very capable floating-point programs that integrate well with bash. Since the powers that be have a preference for bash, I would investigate installing either bc or calc. (job security is a good thing)
If your superiors can be convinced to allow either perl or python, then either will do. If you have never programmed in either, both will have a learning curve, python slightly more so than perl. If you superiors there can read bash, then translating perl would be much easier to digest for them than python.
This is a fair outline of the options you have given your situation as you've explained it. Regardless of your choice, the task for you should not be that daunting in any of the languages. Just drop a line back when you get stuck.

Starting awk or another command just to do a single addition is never going to be efficient. Bash can't handle floats, so you need to shift your perspective. You say you only need to add floats, and I gather these floats represent a duration in hours. So use seconds instead.
for FORECAST in $DIRNAME; do
printf "%d.%02d %s %s\n" >> $FILENAME \
$((SECONDCOUNT / 3600)) \
$(((SECONDCOUNT % 3600) * 100 / 3600)) \
$WINDMAG \
${FORECAST##*/}
SECONDCOUNT=$((SECONDCOUNT + $SECONDS_INCREMENT))
done
(printf is standard and much nicer than echo for formatted output)
EDIT: Abstracted as a function and with a bit of demonstration code:
function format_as_hours {
local seconds=$1
local hours=$((seconds / 3600))
local fraction=$(((seconds % 3600) * 100 / 3600))
printf '%d.%02d' $hours $fraction
}
# loop for 0 to 2 hours in 5 minute steps
for ((i = 0; i <= 7200; i += 300)); do
format_as_hours $i
printf "\n"
done

If all these computers are unices, and they are expected to perform floating point computations, then each of them must have some fp capable app available. So a compound compound command along the lines of
bc -l some-comp || dc some-comp || ... || perl some comp

Related

Does AWK Buffer Multiple Print Statements Before Writing Them to Disc?

I have an AWK script that writes tens of thousands of pretty long lines to a couple of files and nearly ten thousand lines to a few more files on a network drive (all needed for different purposes). I would like to make the file I/O as efficient as possible for a few reasons.
Does AWK immediately write to a file with every print(f) statement or does it buffer them? If so, how much buffering goes on?
I am considering writing everything to a buffer (e.g., rec1 "\n" rec2 "\n" rec3...) and then dumping it all with a single print command, but not if it won't have a net benefit.
I am curious, not just for this program, but also to sharpen my "best practices" skills. I program a lot in AWK, but haven't been able to find the answer to this, yet.
Thanks in advance...
Yes, as you can read in GNU Awk manual: I/O functions. Actually that is why fflush (accepted for inclusion in POSIX) exists: to flush the buffers. And here some practical evidence.
As #Quasimodo points out, yes awk buffers it's output by default and you can bypass that by inserting fflush() statements if you like.
For the other part of your question (I am considering writing everything to a buffer (e.g., rec1 "\n" rec2 "\n" rec3...) and then dumping it all with a single print command, but not if it won't have a net benefit.) - constantly appending to a variable using string concatenation in awk is roughly as slow as I/O since awk has to continually find new memory areas big enough to hold the result of the concatenation, then move the contents of the old location to the new, then append the new text, and then free up the previous area so there'd be no noticeable benefit in execution speed of buffering and then printing all at once vs just printing as you go.

Batch "unfill" paragraphs with a shell script

I have a bunch of text files, with hard line breaks at 80 characters. I'd like to "unfill" (to use the emacs term) those paragraphs, such that each paragraph is a single line, to make copying and pasting text from those files into other applications easier. Is there a way to do that with a shell script?
For example, I have input text that looks like:
Call me Ishmael. Some years ago- never mind how long precisely- having little or
no money in my purse, and nothing particular to interest me on shore, I thought
I would sail about a little and see the watery part of the world. It is a way I
have of driving off the spleen and regulating the circulation. Whenever I find
myself growing grim about the mouth; whenever it is a damp, drizzly November in
my soul; whenever I find myself involuntarily pausing before coffin warehouses,
and bringing up the rear of every funeral I meet; and especially whenever my
hypos get such an upper hand of me, that it requires a strong moral principle
to prevent me from deliberately stepping into the street, and methodically
knocking people's hats off- then, I account it high time to get to sea as soon
as I can. This is my substitute for pistol and ball. With a philosophical
flourish Cato throws himself upon his sword; I quietly take to the ship.
There is nothing surprising in this. If they but knew it, almost all men in
their degree, some time or other, cherish very nearly the same feelings towards
the ocean with me.
There now is your insular city of the Manhattoes, belted round by wharves as
Indian isles by coral reefs- commerce surrounds it with her surf. Right and
left, the streets take you waterward. Its extreme downtown is the battery,
where that noble mole is washed by waves, and cooled by breezes, which a few
hours previous were out of sight of land. Look at the crowds of water-gazers
there.
I would like the output text to look like:
Call me Ishmael. Some years ago- never mind how long precisely- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off- then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.
There now is your insular city of the Manhattoes, belted round by wharves as Indian isles by coral reefs- commerce surrounds it with her surf. Right and left, the streets take you waterward. Its extreme downtown is the battery, where that noble mole is washed by waves, and cooled by breezes, which a few hours previous were out of sight of land. Look at the crowds of water-gazers there.
Is there a way to do that using a shell script? I feel like sed ought to be able to do this, but I'm not sure what the specific commands are to get it to join paragraphs, rather than split them.
Using (g)awk
awk -vRS= -vORS= '{gsub("\n","")}{print $0 RT}' file
Splits records on paragraphs and removes all newlines from records.
With perl
perl -pe '/^$/?print:chomp' file
Here is a pure bash solution:
#!/bin/bash
while read -r
do
if [[ -n $REPLY ]]
then
echo -n "$REPLY"
else
echo -e "\n$REPLY"
fi
done < "gash.txt"
The newline and trailing spaces are removed by the read. If there is data remaining then we echo without a newline, otherwise we echo with an extra newline. REPLY is the default variable used by read.
Solution using perl
perl -i.bak -pe 's/^$/\n/ ; s/(.+)\n/$1/' *.txt
-i.bak in-place editing as well as create backup files incase they are needed later or something goes wrong in command. Use -i if backup not needed
s/^$/\n/ double the empty lines
s/(.+)\n/$1/ remove newline character for non-empty lines

How long can I expect grep to take on a 10 TB file?

I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:
grep "cappucino" filename
I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.
Please correct me if I'm wrong:
I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?
The short answer is: no.
The longer answer is: it depends.
The even longer answer is: grep's performance depends on a lot of things:
are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
how long is your pattern - BM works faster for longer patterns
how many matches will you have - the less the matches the faster it will be
what is your CPU and memory - hard drive will help you only during reading not during computation time
what other options are you using with your grep
14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.
For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:
With 808320 hits
real 0m1.734s
user 0m1.334s
sys 0m0.120s
With 0 hits:
real 0m0.059s
user 0m0.046s
sys 0m0.016s
#Edit: in short read about Boyer-Moore :-)
#Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.

Parameter expansion slow for large data sets

If I take the first 1,000 bytes from a file, Bash can replace some characters pretty quick
$ cut -b-1000 get_video_info
muted=0&status=ok&length_seconds=24&endscreen_module=http%3A%2F%2Fs.ytimg.com%2F
yts%2Fswfbin%2Fendscreen-vfl4_CAIR.swf&plid=AATWGZfL-Ysy64Mp&sendtmp=1&view_coun
t=3587&author=hye+jeong+Jeong&pltype=contentugc&threed_layout=1&storyboard_spec=
http%3A%2F%2Fi1.ytimg.com%2Fsb%2FLHelEIJVxiE%2Fstoryboard3_L%24L%2F%24N.jpg%7C48
%2327%23100%2310%2310%230%23default%23cTWfBXjxZMDvzL5cyCgHdDJ3s_A%7C80%2345%2324
%2310%2310%231000%23M%24M%23m1lhUvkKk6sTnuyKXnPBojTIqeM%7C160%2390%2324%235%235%
231000%23M%24M%23r-fWFZpjrP1oq2uq_Y_1im4iu2I%7C320%23180%2324%233%233%231000%23M
%24M%23uGg7bth0q6XSYb8odKLRqkNe7ao&approx_threed_layout=1&allow_embed=1&allow_ra
tings=1&url_encoded_fmt_stream_map=fallback_host%3Dtc.v11.cache2.c.youtube.com%2
6quality%3Dhd1080%26sig%3D610EACBDE06623717B1DC2265696B473C47BD28F.98097DEC78411
95A074D6D6EBFF8B277F9C071AE%26url%3Dhttp%253A%252F%252Fr9---sn-q4f7dney.c.youtub
e.com%252Fvideoplayback%253Fms%253Dau%2526ratebypass%253Dyes%2526ipbits%253D8%25
26key%253Dyt1%2526ip%253D99.109.97.214%2
$ read aa < <(cut -b-1000 get_video_info)
$ time set "${aa//%/\x}"
real 0m0.025s
user 0m0.031s
sys 0m0.000s
However if I take 10,000 bytes it slows dramatically
$ read aa < <(cut -b-10000 get_video_info)
$ time set "${aa//%/\x}"
real 0m8.125s
user 0m8.127s
sys 0m0.000s
I read Greg Wooledge’s post but it lacks an explanation as to why Bash parameter expansion is slow.
For the why, you can see the implementation of this code in pat_subst in subst.c in the bash source code.
For each match in the string, the length of the string is counted numerous times (in pat_subst, match_pattern and match_upattern), both as a C string and more expensively as a multibyte string. This makes the function both slower than necessary, and more importantly, quadratic in complexity.
This is why it's slow for larger input, and here's a pretty graph:
As for workarounds, just use sed. It's more likely to be optimized for string replacement operations (though you should be aware that POSIX only guarantees 8192 bytes per line, even though GNU sed handles arbitrarily large ones).
Originally, older shells and other utilities imposed LINE_MAX = 2048
on file input for this kind of reason. For huge variables bash has no
problem parking them in memory. But substitution requires at least two
concurrent copies. And lots of thrashing: as groups of characters are
removed whole strings get rewritten. Over and over and over.
There are tools meant for this - sed is a premiere choice. bash is a
distant second choice. sed works on streams, bash works on memory blocks.
Another choice:
bash is extensible - your can write custom C code to stuff stuff well
when bash was not meant to do it.
CFA Johnson has good articles on how to do that:
Some ready to load builtins:
http://cfajohnson.com/shell/bash/loadables/
DIY builtins explained:
http://cfajohnson.com/shell/articles/dynamically-loadable/

Do comments affect Perl performance?

I'm optimizing some frequently run Perl code (once per day per file).
Do comments slow Perl scripts down? My experiments lean towards no:
use Benchmark;
timethese(20000000, {
'comments' => '$b=1;
# comment ... (100 times)
', 'nocomments' => '$b=1;'});
Gives pretty much identical values (apart from noise).
Benchmark: timing 10000000 iterations of comments, nocomments...
comments: 1 wallclock secs ( 0.53 usr + 0.00 sys = 0.53 CPU) # 18832391.71/s (n=10000000)
nocomments: 0 wallclock secs ( 0.44 usr + 0.00 sys = 0.44 CPU) # 22935779.82/s (n=10000000)
Benchmark: timing 20000000 iterations of comments, nocomments...
comments: 0 wallclock secs ( 0.86 usr + -0.01 sys = 0.84 CPU) # 23696682.46/s (n=20000000)
nocomments: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) # 22099447.51/s (n=20000000)
I get similar results if I run the comments and no-comments versions as separate Perl scripts.
It seems counter-intuitive though, if nothing else the interpreter needs to read the comments into memory every time.
Runtime performance? No.
Parsing and lexing performance? Yes, of course.
Since Perl tends to parse and lex on the fly, then comments will affect "start up" performance.
Will they affect it noticably? Unlikely.
Perl is a just-in-time compiled language, so comments and POD have no effect on run-time performance.
Comments and POD have a minuscule effect on compile-time, but they're so easy and fast for Perl to parse it's almost impossible to measure the performance hit. You can see this for yourself by using the -c flag to just compile.
On my Macbook, a Perl program with 2 statements and 1000 lines of 70 character comments takes the same time to compile as one with 1000 lines of empty comments as one with just 2 print statements. Be sure to run each benchmark twice to allow your OS to cache the file, otherwise what you're benchmarking is the time to read the file from the disk.
If startup time is a problem for you, it's not because of comments and POD.
Perl compiles a script and then executes it. Comments marginally slow the compile phase, but have zero effect on the run phase.
Perl is not a scripting language in the same sense that shell scripts are. The interpreter does not read the file line by line. The execution of a Perl program is done in two basic stages: compilation and runtime [1]. During the compilation stage the source code is parsed and converted into bytecode. During the runtime stage the bytecode is executed on a virtual machine.
Comments will slow down the parsing stage but the difference is negligible compared to the time required to parse the script itself (which is already very small for most programs). About the only time you're really concerned with parsing time is in a webserver environment where the program could be called many times per second. mod_perl exists to solve this problem.
You're using Benchmark. That's good! You should be looking for ways to improve the algorithm -- not micro-optimizing. Devel::DProf might be helpful to find any hot spots. You absolutely should not strip comments in a misguided attempt to make your program faster. You'll just make it unmaintainable.
[1] This is commonly called "just in time" compilation. Perl actually has several more stages like INIT and END that don't matter here.
The point is: optimize bottlenecks. Reading in a file consists of:
opening the file,
reading in its contents,
closing the file,
parsing the contents.
Of these steps, reading is the fastest part by far (I am not sure about closing, it is a syscall, but you don't have to wait for it to finish). Even if it is 10% of the whole thing (which is is not, I think), then reducing it by half only gives 5% improved performance, at the cost of missing comments (which is a very bad thing). For the parser, throwing away a line that begins with # is not a tangible slowdown. And after that, the comments are gone, so there can be no slowdown.
Now, imagine that you could actually improve the "reading in the script" part by 5% through stripping all comments (which is a really optimistic estimate, see above). How big is the share of "reading in the script" in overall time consumption of the script? Depends on how much it does, of course, but since perl scripts usually read at least one more file, it is 50% at most, but since perl scripts usually do something more, an honest estimate will bring this down to something in the range of 1%. So, the expected efficiency improvement by stripping all comments is at most (very optimistic) 2.5%, but really closer to 0.05%. And then, those where it actually gives more than 1% are already fast since they do almost nothing, so you are again optimizing at the wrong point.
Concluding, optimize bottlenecks.
The Benchmark module is useless in this case. It's only measuring the times to run the code over and over again. Since your code doesn't actually do anything, most of it is optimized it away. That's why you're seeing it run 22 million times a second.
I have almost on entire chapter about this in Mastering Perl. The error of measurement in the Benchmark technique is about 7%. Your benchmark numbers are well within that, so there's virtually no difference.
From Paul Tomblins comment:
Doesn't perl do some sort of on-the-fly compilation? Maybe the comments get discarded early? –
Yes Perl does.
It is a programming language in between compiled and interpreted. The code gets compiled on the fly and then run. the comments usually don't make any difference. The most it would probably effect is when it is initially parsing the file line by line and pre compiling it, you might see a nano second difference.
I would expect that the one comment would only get parsed once, not multiple times in the loop, so I doubt it is a valid test.
I would expect that comments would slightly slow compilation, but I expect it would be too minor to bother removing them.
Do Perl comments slow a script down? Well, parsing it, yes. Executing it after parsing it? No. How often is a script parsed? Only once, so if you have a comment within a for loop, the comment is discarded by the parses once, before the script even runs, once it started running, the comment is already gone (and the script is not stored as script internally by Perl), thus no matter how many times the for loop repeats, the comment won't have an influence. How fast can the parser skip over comments? The way Perl comments are done, very fast, thus I doubt you will notice. You will notice a higher start-up time if you have 5 lines of code and between each line 1 Mio lines of comments... but how likely is that and of what use would a comment that large be?

Resources