Parameter expansion slow for large data sets - bash

If I take the first 1,000 bytes from a file, Bash can replace some characters pretty quick
$ cut -b-1000 get_video_info
muted=0&status=ok&length_seconds=24&endscreen_module=http%3A%2F%2Fs.ytimg.com%2F
yts%2Fswfbin%2Fendscreen-vfl4_CAIR.swf&plid=AATWGZfL-Ysy64Mp&sendtmp=1&view_coun
t=3587&author=hye+jeong+Jeong&pltype=contentugc&threed_layout=1&storyboard_spec=
http%3A%2F%2Fi1.ytimg.com%2Fsb%2FLHelEIJVxiE%2Fstoryboard3_L%24L%2F%24N.jpg%7C48
%2327%23100%2310%2310%230%23default%23cTWfBXjxZMDvzL5cyCgHdDJ3s_A%7C80%2345%2324
%2310%2310%231000%23M%24M%23m1lhUvkKk6sTnuyKXnPBojTIqeM%7C160%2390%2324%235%235%
231000%23M%24M%23r-fWFZpjrP1oq2uq_Y_1im4iu2I%7C320%23180%2324%233%233%231000%23M
%24M%23uGg7bth0q6XSYb8odKLRqkNe7ao&approx_threed_layout=1&allow_embed=1&allow_ra
tings=1&url_encoded_fmt_stream_map=fallback_host%3Dtc.v11.cache2.c.youtube.com%2
6quality%3Dhd1080%26sig%3D610EACBDE06623717B1DC2265696B473C47BD28F.98097DEC78411
95A074D6D6EBFF8B277F9C071AE%26url%3Dhttp%253A%252F%252Fr9---sn-q4f7dney.c.youtub
e.com%252Fvideoplayback%253Fms%253Dau%2526ratebypass%253Dyes%2526ipbits%253D8%25
26key%253Dyt1%2526ip%253D99.109.97.214%2
$ read aa < <(cut -b-1000 get_video_info)
$ time set "${aa//%/\x}"
real 0m0.025s
user 0m0.031s
sys 0m0.000s
However if I take 10,000 bytes it slows dramatically
$ read aa < <(cut -b-10000 get_video_info)
$ time set "${aa//%/\x}"
real 0m8.125s
user 0m8.127s
sys 0m0.000s
I read Greg Wooledge’s post but it lacks an explanation as to why Bash parameter expansion is slow.

For the why, you can see the implementation of this code in pat_subst in subst.c in the bash source code.
For each match in the string, the length of the string is counted numerous times (in pat_subst, match_pattern and match_upattern), both as a C string and more expensively as a multibyte string. This makes the function both slower than necessary, and more importantly, quadratic in complexity.
This is why it's slow for larger input, and here's a pretty graph:
As for workarounds, just use sed. It's more likely to be optimized for string replacement operations (though you should be aware that POSIX only guarantees 8192 bytes per line, even though GNU sed handles arbitrarily large ones).

Originally, older shells and other utilities imposed LINE_MAX = 2048
on file input for this kind of reason. For huge variables bash has no
problem parking them in memory. But substitution requires at least two
concurrent copies. And lots of thrashing: as groups of characters are
removed whole strings get rewritten. Over and over and over.
There are tools meant for this - sed is a premiere choice. bash is a
distant second choice. sed works on streams, bash works on memory blocks.
Another choice:
bash is extensible - your can write custom C code to stuff stuff well
when bash was not meant to do it.
CFA Johnson has good articles on how to do that:
Some ready to load builtins:
http://cfajohnson.com/shell/bash/loadables/
DIY builtins explained:
http://cfajohnson.com/shell/articles/dynamically-loadable/

Related

Does AWK Buffer Multiple Print Statements Before Writing Them to Disc?

I have an AWK script that writes tens of thousands of pretty long lines to a couple of files and nearly ten thousand lines to a few more files on a network drive (all needed for different purposes). I would like to make the file I/O as efficient as possible for a few reasons.
Does AWK immediately write to a file with every print(f) statement or does it buffer them? If so, how much buffering goes on?
I am considering writing everything to a buffer (e.g., rec1 "\n" rec2 "\n" rec3...) and then dumping it all with a single print command, but not if it won't have a net benefit.
I am curious, not just for this program, but also to sharpen my "best practices" skills. I program a lot in AWK, but haven't been able to find the answer to this, yet.
Thanks in advance...
Yes, as you can read in GNU Awk manual: I/O functions. Actually that is why fflush (accepted for inclusion in POSIX) exists: to flush the buffers. And here some practical evidence.
As #Quasimodo points out, yes awk buffers it's output by default and you can bypass that by inserting fflush() statements if you like.
For the other part of your question (I am considering writing everything to a buffer (e.g., rec1 "\n" rec2 "\n" rec3...) and then dumping it all with a single print command, but not if it won't have a net benefit.) - constantly appending to a variable using string concatenation in awk is roughly as slow as I/O since awk has to continually find new memory areas big enough to hold the result of the concatenation, then move the contents of the old location to the new, then append the new text, and then free up the previous area so there'd be no noticeable benefit in execution speed of buffering and then printing all at once vs just printing as you go.

Understanding LC_ALL=C and its implications for standard English characters

Forgive me for the clumsy way I'm approaching this question, everything I've learnt so far on the topic of character encoding has been in the last few hours and I'm aware I'm out of my depth. This may be answered elsewhere on the site, such as in my linked questions, but if it has, those answers are too dense for me to understand exactly what's being concluded in them.
I often need to grep through folders of excessively large text files (totalling more than 100GB). I've read about how using LC_ALL=C can speed this up considerably, but I want to be sure that doing so won't compromise the accuracy of my searches.
The files are old and have passed through many different online sources, so are likely to contain a jumble of characters from many different encodings, including UTF-8. (As an aside, is it possible for a single file to contain characters from multiple encodings?)
The bulk of what concerns me is this: if I want to search for a given b in my data, can I expect every letter b that's present in the data to be encoded as ASCII, or can the same letter also be encoded as UTF-8?
Or to put it another way, are ASCII characters always and exclusively ASCII? If even standard English characters can be encoded as UTF-8, and using LC_ALL=C grep would disregard all UTF-8 characters, then this would have the implication that my searches would miss search terms that are not in ASCII, which would obviously not be the behaviour that I want, and would be a considerable obstacle to adopting LC_ALL=C for grep.
About understanding UTF-8 vs ASCII, the following are very good
http://kunststube.net/encoding/
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
About difference in time with grep for UTF-8 files with small amount of not ASCII character, there is basically no difference using LC_ALL=C or LANG=C versus the standard LANG=en_US.UTF-8 or similar.
Test performed on Cygwin 64 bit, repeating 1000 times the search on 20GB of text:
$ time for i in $(seq 1000) ; do grep -q LAPTOP-82F08ILC wia-*.log ; done
real 0m53.289s
user 0m7.813s
sys 0m31.635s
$ time for i in $(seq 1000) ; do LC_ALL=C grep -q LAPTOP-82F08ILC wia-*.log ; done
real 0m53.027s
user 0m7.497s
sys 0m31.010s
s
$ ls -sh wia-*
10G wia-1024.log 160M wia-16.log 2.5G wia-256.log 40M wia-4.log 639M wia-64.log
1.3G wia-128.log 20M wia-2.log 320M wia-32.log 5.0G wia-512.log 80M wia-8.log
The difference is within the tolerance of repeatition that was in the 53-55 seconds for both cases

Reading a file in Python: slurp or filter?

I want to compare the effect of processing a stream as a filter (that is, get a bit, process, rinse), against slurping (that is, get all information, then process).
However, when I run the two codes below, I get comparable results. I was expecting to obtain a much worse result in the slurp version.
Are the codes snippets below doing anything different as described above? If they are equivalent, how could I adapt one of them for testing the filter/slurp difference?
I was testing the scripts with:
jot 100000000 | time python3 dont_slurp.py > /dev/null
jot 100000000 | time python3 slurp.py > /dev/null
Jot generates numbers from 1 to x. The codes snippets just numerate the lines.
Filter:
import sys
lineno = 0
for line in sys.stdin:
lineno += 1
print("{:>6} {}".format(lineno, line[:-1]))
Slurp:
import sys
f = sys.stdin
lineno = 0
for line in f:
lineno += 1
print('{:>6} {}'.format(lineno, line[:-1]))
First of all, your code samples are not doing what you think. All f = sys.stdin does is set f to the same file handle. The lines for line in f: and for line in sys.stdin: are functionally identical.
What you want is this:
import sys
lineno = 0
for line in sys.stdin.readlines():
lineno += 1
print('{:>6} {}'.format(lineno, line[:-1]))
readlines() returns a list, one element per line in the file. I believe it is not a generator, so you get the full list. The file handle itself acts as a generator, giving you one line at a time.
You should see performance differences with readline().
However, the answer to "which is better?" is "it depends". When you read line by line, you're making a system call, which in turn causes the OS to read file contents off of the disk in blocks. These blocks are likely larger than the size of the average line, and the block is likely cached. That means sometimes you hit the disk, taking lots of time, other times you hit the cache, taking little time.
When you read all at once, you load every byte from the file into memory at once. If you have enough free memory to hold all file contents, then this takes exactly the same amount of time as the line-by-line version. In both cases, it is basically just the time required to read the whole file sequentially with some little bit of overhead.
The difference is the case where you don't have enough free memory to hold the entire file. In that case, you read the whole file, but parts of it get swapped back out to disk by the virtual memory system. They then have to get pulled in again when you access that particular line.
Exactly how much time is lost depends on how much memory is in use, how much other activity is going on on your system, etc., so it can't be quantified in general.
This is a case where you honestly shouldn't worry about it until there is a problem. Do what is more natural in the code and only worry about performance if your program is too slow.

How long can I expect grep to take on a 10 TB file?

I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:
grep "cappucino" filename
I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.
Please correct me if I'm wrong:
I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?
The short answer is: no.
The longer answer is: it depends.
The even longer answer is: grep's performance depends on a lot of things:
are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
how long is your pattern - BM works faster for longer patterns
how many matches will you have - the less the matches the faster it will be
what is your CPU and memory - hard drive will help you only during reading not during computation time
what other options are you using with your grep
14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.
For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:
With 808320 hits
real 0m1.734s
user 0m1.334s
sys 0m0.120s
With 0 hits:
real 0m0.059s
user 0m0.046s
sys 0m0.016s
#Edit: in short read about Boyer-Moore :-)
#Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.

Efficiently computing floating-point arithmetic hundreds of thousands of times in Bash

Background
I work for a research institute that studies storm surges computationally, and am attempting to automate some of the HPC commands using Bash. Currently, the process is we download the data from NOAA and create the command file manually, line-by-line, inputting the location of each file along with a time for the program to read the data from that file and a wind magnification factor. There are hundreds of these data files in each download NOAA produces, which come out every 6 hours or so when a storm is in progress. This means that much of our time during a storm is spent making these command files.
Problem
I am limited in the tools I can use to automate this process because I simply have a user account and a monthly allotment of time on the supercomputers; I do not have the privilege to install new software on them. Plus, some of them are Crays, some are IBMs, some are HPs, and so forth. There isn't a consistent operating system between them; the only similarity is they are all Unix-based. So I have at my disposal tools like Bash, Perl, awk, and Python, but not necessarily tools like csh, ksh, zsh, bc, et cetera:
$ bc
-bash: bc: command not found
Further, my lead scientist has requested that all of the code I write for him be in Bash because he understands it, with minimal calls to external programs for things Bash cannot do. For example, it cannot do floating point arithmetic, and I need to be able to add floats. I can call Perl from within Bash, but that's slow:
$ time perl -E 'printf("%.2f", 360.00 + 0.25)'
360.25
real 0m0.052s
user 0m0.015s
sys 0m0.015s
1/20th of a second doesn't seem like a long time, but when I have to make this call 100 times in a single file, that equates to about 5 seconds to process one file. That isn't so bad when we are only making one of these every 6 hours. However, if this work is abstracted to a larger assignment, one where we point 1,000 synthetic storms at the Atlantic basin at one time in order to study what could have happened had the storm been stronger or taken a different path, 5 seconds quickly grows to more than an hour just to process text files. When you are billed by the hour, this poses a problem.
Question
What is a good way to speed this up? I currently have this for loop in the script (the one that takes 5 seconds to run):
for FORECAST in $DIRNAME; do
echo $HOURCOUNT" "$WINDMAG" "${FORECAST##*/} >> $FILENAME;
HOURCOUNT=$(echo "$HOURCOUNT $INCREMENT" | awk '{printf "%.2f", $1 + $2}');
done
I know a single call to awk or Perl to loop through the data files would be a hundred times faster than calling either once for each file in the directory, and that these languages can easily open a file and write to it, but the problem I am having is getting data back and forth. I have found a lot of resources on these three languages alone (awk, Perl, Python), but haven't been able to find as much on embedding them in a Bash script. The closest I have been able to come is to make this shell of an awk command:
awk -v HOURCOUNT="$HOURCOUNT" -v INCREMENT="$INCREMENT" -v WINDMAG="$WINDMAG" -v DIRNAME="$DIRNAME" -v FILENAME="$FILENAME" 'BEGIN{ for (FORECAST in DIRNAME) do
...
}'
But I am not certain that this is correct syntax, and if it is, if it's the best way to go about this, or if it will even work at all. I have been hitting my head against the wall for a few days now and decided to ask the internet before I plug on.
Bash is very capable as long as you have the ability you need. For floating point, you basically have two options, either bc (which at least on the box you show isn't installed [which is kind of hard to believe]) or calc. calc-2.12.4.13.tar.bz2
Either package is flexible and very capable floating-point programs that integrate well with bash. Since the powers that be have a preference for bash, I would investigate installing either bc or calc. (job security is a good thing)
If your superiors can be convinced to allow either perl or python, then either will do. If you have never programmed in either, both will have a learning curve, python slightly more so than perl. If you superiors there can read bash, then translating perl would be much easier to digest for them than python.
This is a fair outline of the options you have given your situation as you've explained it. Regardless of your choice, the task for you should not be that daunting in any of the languages. Just drop a line back when you get stuck.
Starting awk or another command just to do a single addition is never going to be efficient. Bash can't handle floats, so you need to shift your perspective. You say you only need to add floats, and I gather these floats represent a duration in hours. So use seconds instead.
for FORECAST in $DIRNAME; do
printf "%d.%02d %s %s\n" >> $FILENAME \
$((SECONDCOUNT / 3600)) \
$(((SECONDCOUNT % 3600) * 100 / 3600)) \
$WINDMAG \
${FORECAST##*/}
SECONDCOUNT=$((SECONDCOUNT + $SECONDS_INCREMENT))
done
(printf is standard and much nicer than echo for formatted output)
EDIT: Abstracted as a function and with a bit of demonstration code:
function format_as_hours {
local seconds=$1
local hours=$((seconds / 3600))
local fraction=$(((seconds % 3600) * 100 / 3600))
printf '%d.%02d' $hours $fraction
}
# loop for 0 to 2 hours in 5 minute steps
for ((i = 0; i <= 7200; i += 300)); do
format_as_hours $i
printf "\n"
done
If all these computers are unices, and they are expected to perform floating point computations, then each of them must have some fp capable app available. So a compound compound command along the lines of
bc -l some-comp || dc some-comp || ... || perl some comp

Resources