Performance Tuning an AWK? - bash

I've written a simple parser in BASH to take apart csv files and dump to a (temp) SQL-input file. The performance on this is pretty terrible; when running on a modern system I'm barely cracking 100 lines per second. I realize the ultimate answer is to rewrite this in a more performance oriented language, but as a learning opportunity, I'm curious where I can improve my BASH skills.
I suspect there are gains to be made by writing to an ram instead of to a file, then flushing all the text at once to the file, but I'm not clear on where/when BASH gets upset about memory usage (largest files I've parsed have been under 500MB).
The following code-block seems to eat most of the cycles, and as I understand, needs to be processed linearly due to checking timestamps (the data has a timestamp, but no timedate stamp, so I was forced ask the user for the start-day and check if the timestamp has cycled 24:00 -> 0:00), so parallel processing didn't seem like an option.
while read p; do
linetime=`printf "${p}" | awk '{printf $1}'`
if [[ "$lastline" > "$linetime" ]]
experimentdate=$(eval $datecmd)
printf "$p" | awk -v varout="$projname" -v experiment_day="$experimentdate " -v singlequote="$cleanquote" '{printf "insert into tool (project,project_datetime,reported_time,seconds,intensity) values ("singlequote""varout""singlequote","singlequote""experiment_day $1""singlequote","singlequote""$1""singlequote","$2","$3");\n"}' >> $sql_input_file
Ignore the singlequote nonsense, I needed this to run on both OSX & 'nix, so I had to workaround some issues with OSX's awk and singlequotes.
Any suggestions for how I can improve performance?

You do not want to start awk for every line you process in a loop. Replace your loop with awk or replace awk with builtin commands.
Both awk's are only used for printing. Replace these lines with additional parameters to the printf command.
I did not understand the codeblock for datecmd (not using $linetime but using the output variable experimentdate), but this one should be optimised: Can you use regular expressions or some other trick?
So you do not have the tune awk, but decide to use awk completely or get it out of your while-loop.

Your performance would improve if you did all the processing with awk. Awk can read your input file directly, express conditionals, and run external commands.
Awk is not the only one either. Perl and Python would be well suited to this task.


In the context of the bash shell and command output:
Is there a process/approach to help determine/measure the width of fields that appear to be fixed width?
(apart from the mark one human eyeball and counting on the screen method....)
If the output appears to be fixed width, is it possible/likely that it's actually delimited by some sort of non-printing character(s)?
If so, how would I go about hunting down said character?
I'm mostly after a way to do this in bash shell/script, but I'm not averse to a programming language approach.
Sample Worst Case Data:
Name value 1 empty_col simpleHeader complex multi-header
foo bar -someVal1 1someOtherVal
monty python circus -someVal2 2someOtherVal
exactly the field_widthNextVal -someVal3 3someOtherVal
My current approach:
The best I have come up with is redirecting the output to a file, then using a ruler/index type of feature in the editor to manually work out field widths. I'm hoping there is a smarter/faster way...
What I'm thinking:
With Headers:
Perhaps an approach that measures from the first character 'to the next character that is encountered, after having already encountered multiple spaces'?
Without Headers:
Drawing a bit of a blank on this one....?
This strikes me as the kind of problem that was cracked about 40 years ago though, so I'm guessing there are better solutions than mine to this stuff...
Some Helpful Information:
Column Widths
fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')
This is proving to be helpful for determining column widths. I don't fully understand how it works yet to provide a complete answer, but it might be helpful to a future someone else. Source:
File Examination
Redirect output to a file:
command >
Use hexdump or xxd against to look at it's raw information. See links for some basics on those tools:
hexdump output vs xxd output
# Determine Column Widths
# Source for this voodoo:
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')
# Iterate
while IFS= read -r line
# You can do put awk command in a separate line if this is clearer to you
awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"
# Or do it all in one line if you prefer:
field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"
*** Code Stuff Here ***
done <<< $(appropriate-command)
Some explanation of the above - for newbies (like me)
Okay, so I'm a complete newbie, but this is my answer, based on a grand total of about two days of clawing around in the dark. This answer is relevant to those who are also new and trying to process data in the bash shell and bash scripts.
Unlike the *nix wizards and warlocks that have presented many of the solutions you will find to specific problems (some impressively complex), this is just a simple outline to help people understand what it is that they probably don't know; that they don't know. You will have to go and look this stuff up separately, it's way to big to cover it all here.
I would strongly suggest just buying a book/video/course for shell scripting. You do learn a lot doing it the school of hard knocks way as I have for the last couple of days, but it's proving to be painfully slow. The devil is very much in the details with this stuff. A good structured course probably instils good habits from the get go too, rather than potentially developing your own habits/short hand 'that seems to work' but will likely and unwittingly, bite you later on.
Bash references:
Common Bash Mistakes, Traps and Pitfalls:
My take is that there is no 'one right way that works for everything' to achieve this particular task of processing fixed width command output. Notably, the fixed widths are dynamic and might changed each time the command is run. It can be done somewhat haphazardly using standard bash tools (it depends on the types of values in each field, particularly if they contain whitespace or unusual/control characters). That said, expect any fringe cases to trip up the 'one bash pipeline to parse them all' approach, unless you have really looked at your data and it's quite well sanitised.
My uninformed, basic approach:
To get much out of all this:
Learn the basics of how IFS= read -r line (and it's variants) work, it's one way of processing multiple lines of data, one line at a time. When doing this, you need to be aware of how things are expanded differently by the shell.
Grasp the basics of process substitution and command substitution, understand when data is being manipulated in a sub-shell, otherwise it disappears on you when you think you can recall it later.
It helps to grasp what Regular Expressions (regex) are. Half of the hieroglyphics that you encounter are probably regex in action.
Even further, it helps to understand when/what/why you need to 'escape' certain characters, at certain times, as this is why there is even more \ than you would expect amongst the hieroglyphics.
When doing redirection, be aware of the difference in > (overwrites without prompting) and >> (which appends to any existing data).
Understand differences in comparison operators and conditional tests (such as used with if statements and loop conditions).
if [ cond ] is not necessarily the same as if [[ cond ]]
look into the basics of arrays, and how to load, iterate over and query their elements.
bash -x is useful for debugging. Targeted debugging of specific lines is done by using set -x lines of code to debug set +x within the script.
As for the fixed width data:
If it's delimited:
Use the delimiter. Most *nix tools use a single white space as a default delimiter, but you can typically also set a specific delimiter (google how to do it for the specific tool).
Optional Step:
If there is no obvious delimiter, you can check to see if there is some secret hidden delimiter to take advantage of. There probably isn't, but you can feel good about yourself for checking. This is done by looking at the hex data in the file. Redirect the output of a command to a file (if you don't have the data in a file already). Do it using command > and then explore using hexdump -Cv (another tool is xxd).
If you're stuck with fixed width:
Basically to do something useful, you need to:
Read line by line (i.e. record by record).
Split the lines into their columns (i.e. field by field, this is the fixed-width aspect)
Check that you are really doing what you think you are doing; particularly if expanding or redirecting data. What you see on shell as command output, might not actually be exactly what you are presenting to your script/pipe (most commonly due to differences in how the shell expands args/variables, and tends to automatically manipulate whitespace without telling you...)
Once you know exactly what your processing pipe/script is seeing, you can then tidy up any unwanted whitespace and so forth.
Starting Guidelines:
Feed the pipe/script an entire line at a time, then chop up fields (unless you really know what you are doing). Doing the field separation inside any loops such as while IFS= read -r line; do stuff; done is less error prone (in terms of the 'what is my pipe actually seeing' problem. When I was doing it outside, it tended to produce more scenarios where the data was being modified without me understanding that it was being altered (let alone why), before it even reached the pipe/script. This obviously meant I got extremely confused as to why a pipe that worked in one setting on the command line, fell over when I 'feed the same data' in a script or by some other method (but the pipe really wasn't actually getting the same data). This comes back to preserving whitespace with fixed-width data, particularly during expansion and redireciton, process substitiution and command substitution. Typically it amounts to liberal use of double quotes when calling a variable, i.e. not $someData but "$someData". Use parenthesis to clear up which var you are talking about, i.e. ${var}bar. Similarly when capturing the entire output of a command.
If there is nothing to leverage as a delimiter, you have some choices. Hack away directly at the fixed width data using tools like:
cut -c n1-n2 this directly cuts things out, starting from character n1 through to n2.
awk '{print $1}' this uses a single space by default to separate fields and print the first field.
Or, you can try to be a bit more scientific and 'measure twic, cut once'.
You can work out the field widths fairly easily if there are headers. This line is particularly helpful (sourced from an answer I link below):
fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')
echo $fieldwidths
You can also look at all the data to see what length of data you are seeing in each field, and if you are actually getting the number of fields you expect (Thanks to David C. Rankin for this one!):
awk '{ for (i=1; i<=NF; i++) printf "%d\t",length($i) } {print ""}'
With that information, you can then set about chopping fields up with a bit more certainty that you are actually capturing the entire field (and only the entire field).
Tool options are many and varied, but I'm finding GNU awk (gawk) and perl's unpack to be the clearest. As part of a pipe/script consider this (sub in your relevant field widths and which ever field you want out in the {print $fieldnumber} obviously):
awk 'BEGIN {FIELDWIDTHS=$10 20 30 10}{print $1}
For command output with dynamic field widths, if you feed it into a while IFS= read -r line; do; done loop, you will need to parse the output using the awk above, as each time the field widths might have changed. Since I originally couldn't get the expansion right, I built the awk command on a separate line and stored it in a variable, which I then called in the pipe. Once you have it figured out though, you can just shove it all back into one line if you want:
# Determine Column Widths:
# Source for this voodoo:
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')
# Iterate
while IFS= read -r line
# Separate the awk command if you want:
# This uses GNU awk to split the column widths and pipes it to sed to remove leading and trailing spaces.
awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"
# Or do it all in one line, rather than two:
field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"
if [ "${DELETIONS[0]}" == 'all' ] && [ "${#DELETIONS[#]}" -eq 1 ] && [ "$field1" != 'UUID' ]; then
*** Code Stuff ***
*** More Code Stuff ***
done <<< $(appropriate-command)
Remove excess whitespace using various approaches:
tr -d '[:blank:] and/or tr -d '[:space:](the later eliminates new lines and vertical whitespace, not just horizontal like :blank: does. They both also remove internal whitespace).
sed s/^[ ]*//;s/[ ]*$// this cleans up only leading and trailing whitespace.
Now you should basically have clean, separated fields to work with one at a time, having started from multi-field, multi-line command output.
Once you get what is going on fairly well with the above, you can start to look into other more elegant approaches as presented in these answers:
Finding Dynamic Field Widths:
Using perl's unpack:
Awk and other good answers:
Some stuff just can't be done in a single pass. Like the perl answer above, it basically breaks the problem down into two parts. The first is turning the fixed width data into delimited data (just chose a delimiter that doesn't occur within any of the values in your fields/records!). Once you have it as delimited data, it makes the processing substantially easier from there on out.

Why is it much slower to use cut than awk to intercept two strings from each line in a file?

I have a file of about 1.6 million lines, each of which is about
The file name is order_all.csv.
Now I have two scripts
shell one
while read line
st="set "
key="$(echo $line | cut -d',' -f1)"
value="$(echo $line | cut -d',' -f2)"
echo "$st$key $value" >> output
done < order_all.csv
shell two
cat order_all.csv | awk -F ',' '{print "set " $1,$2}' > output
But I found that the second script was much faster than the first one. What's the reason?
In addition, I also hope that the newline character of every line output by the script is \r\n. What can I do about it?
As #zerkms has called out, the performance difference here is determined much more by the efficiency of the algorithms as opposed to the text processing command in play.
To understand the differences between the two, you'll want to look at how shell works compared to most other languages. Since shell is basically individual unix programs executed one by one, the performance of each line (command really) is that of a whole program in another language, all else being the same.
What that equates to here, is that by constructing a loop around each line of data, then executing a command, 'cut', you take on the overhead of starting a new program for every line of data (and in this case 2, since you call cut 2 times around).
Behind the scenes of executing a single instance of any unix command are some very expensive operating system calls which take gobs of time, such as fork(), not to mention the process of loading the command into memory and all that's involved there.
In your second version you smartly avoid starting new commands for each line of text by use of a pipe, '|'. This pipe streams the data to 'awk'. Awk only starts up once in this design as it reads from STDIN a line at a time until the end of file from the stream is encountered. 'cut' can work this way (in a stream) too, but processing the text is more limited in 'cut'. So, here the text processing occurs in a single process with the awk program loading and fork overhead done only one time while the text processing happens 1.6 million times.
I hope that helps.

More efficient way to loop through lines in shell

I've come to learn that looping through lines in bash by
while read line; do stuff; done <file
Is not the most efficient way to do it.
What is a more time/resource efficient method?
Here's a time'd example using Bash and awk. I have 1 million records in a file:
$ wc -l 1M
1000000 1M
Counting it's records with bash, using while read:
$ time while read -r line ; do ((i++)) ; done < 1M ; echo $i
real 0m12.440s
user 0m11.548s
sys 0m0.884s
Using let "i++" took 15.627 secs (real) and NOPing with do : ; 10.466. Using awk:
$ time awk '{i++}END{print i}' 1M
real 0m0.128s
user 0m0.128s
sys 0m0.000s
As others have said, it depends on what you're doing.
The reason it's inefficient is that everything runs in its own process. Depending on what you are doing, that may or may not be a big deal.
If what you want to do in the loop is run another shell process, you won't get any gain from eliminating the loop. If you can do what you need without the need for a loop, you could get a gain.
awk? Perl? C(++)? Of course it depends on if you're interested in CPU time or programmer time, and the latter depends on what the programmer is used to using.
The top answer to the question you linked to pretty much explains that the biggest problem is spawning external processes for simple text processing tasks. E.g. running an instance of awk or a pipeline of sed and cut for each single line just to get a part of the string is silly.
If you want to stay in shell, use the string processing parameter expansions (${var#word}, ${var:n:m}, ${var/search/replace} etc.) and other shell features as much as you can. If you see yourself running a set of commands for each input line, it's time to think the structure of the script again. Most of the text processing commands can process a whole file with one execution, so use that.
A trivial/silly example:
while read -r line; do
x=$(echo "$line" | awk '{print $2}')
somecmd "$x"
done < file
would be better as
awk < file '{print $2}' | while read -r x ; do somecmd "$x" ; done
Choose between awk or perl both are efficient

What can I do to speed up this bash script?

The code I have goes through a file and multiplies all the numbers in the first column by a number. The code works, but I think its somewhat slow. It takes 26.676s (walltime) to go through a file with 2302 lines in it. I'm using a 2.7 GHz Intel Core i5 processor. Here is the code.
sed -n 1p data.txt > data_diff.txt #outputs the header (x y)
while [ $i -lt 2303 ]; do
NUM=`sed -n "$i"p data.txt | awk '{print $1}'`
SEC=`sed -n "$i"p data.txt | awk '{print $2}'`
NNUM=$(bc <<< "$NUM*0.000123981")
echo $NNUM $SEC >> data_diff.txt
let i=$i+1
Honestly, the biggest speedup you can get will come from using a single language that can do the whole task itself. This is mostly because your script invokes 5 extra processes for each line, and invoking extra processes is slow, but also text processing in bash is really not that well optimized.
I'd recommend awk, given that you have it available:
awk '{ print $1*0.000123981, $2 }'
I'm sure you can improve this to skip the header line and print it without modification.
You can also do this sort of thing with Perl, Python, C, Fortran, and many other languages, though it's unlikely to make much difference for such a simple calculation.
Your script runs 4603 separate sed processes, 4602 separate awk processes, and 2301 separate bc processes. If echo were not a built-in then it would also run 2301 echo processes. Starting a process has relatively large overhead. Not so large that you would ordinarily notice it, but you are running over 11000 short processes. The wall time consumption doesn't seem unreasonable for that.
MOREOVER, each sed that you run processes the whole input file anew, selecting from it just one line. This is horribly inefficient.
The solution is to reduce the number of processes you are running, and especially to perform only a single run through the whole input file. A fairly easy way to do that would be to convert to an awk script, possibly with a bash wrapper. That might look something like this:
awk '
NR==1 { print; next }
NR>=2303 { exit }
{ print $1 * 0.000123981, $2 }
' data.txt > data_diff.txt
Note that the line beginning with NR>=2303 artificially stops processing the input file when it reaches the 2303rd line, as your original script does; you could omit that line of the script altogether to let it simply process all the lines, however many there are.
Note, too, that that uses awk's built-in FP arithmetic instead of running bc. If you actually need the arbitrary-precision arithmetic of bc then I'm sure you can figure out how to modify the script to get that.
As an example of how to speed up the bash script (without implying that this is the right solution)
{ IFS= read -r header
echo "$header"
# You can drop the third name "rest" if your input file
# only has two columns.
while read -r num sec rest; do
nnum=$( bc <<< "$num * 0.000123981" )
echo "$nnum $sec"
} < data.txt > data_diff.txt
Now you only have one extra call to bc per data line, necessitated by the fact that bash doesn't do floating-point arithmetic. The right answer is to use a single call to program that can do floating-point arithmetic, as pointed out by David Z.

Performance issue with parsing large log files (~5gb) using awk, grep, sed

I am currently dealing with log files with sizes approx. 5gb. I'm quite new to parsing log files and using UNIX bash, so I'll try to be as precise as possible. While searching through log files, I do the following: provide the request number to look for, then optionally to provide the action as a secondary filter. A typical command looks like this:
fgrep '2064351200' example.log | fgrep 'action: example'
This is fine dealing with smaller files, but with a log file that is 5gb, it's unbearably slow. I've read online it's great to use sed or awk to improve performance (or possibly even combination of both), but I'm not sure how this is accomplished. For example, using awk, I have a typical command:
awk '/2064351200/ {print}' example.log
Basically my ultimate goal is to be able print/return the records (or line number) that contain the strings (could be up to 4-5, and I've read piping is bad) to match in a log file efficiently.
On a side note, in bash shell, if I want to use awk and do some processing, how is that achieved? For example:
BEGIN { print "File\tOwner" }
{ print $8, "\t", \
END { print " - DONE -" }
That is a pretty simple awk script, and I would assume there's a way to put this into a one liner bash command? But I'm not sure how the structure is.
Thanks in advance for the help. Cheers.
You need to perform some tests to find out where your bottlenecks are, and how fast your various tools perform. Try some tests like this:
time fgrep '2064351200' example.log >/dev/null
time egrep '2064351200' example.log >/dev/null
time sed -e '/2064351200/!d' example.log >/dev/null
time awk '/2064351200/ {print}' example.log >/dev/null
Traditionally, egrep should be the fastest of the bunch (yes, faster than fgrep), but some modern implementations are adaptive and automatically switch to the most appropriate searching algorithm. If you have bmgrep (which uses the Boyer-Moore search algorithm), try that. Generally, sed and awk will be slower because they're designed as more general-purpose text manipulation tools rather than being tuned for the specific job of searching. But it really depends on the implementation, and the correct way to find out is to run tests. Run them each several times so you don't get messed up by things like caching and competing processes.
As #Ron pointed out, your search process may be disk I/O bound. If you will be searching the same log file a number of times, it may be faster to compress the log file first; this makes it faster to read off disk, but then require more CPU time to process because it has to be decompressed first. Try something like this:
compress -c example2.log >example2.log.Z
time zgrep '2064351200' example2.log.Z >/dev/null
gzip -c example2.log >example2.log.gz
time zgrep '2064351200' example2.log.gz >/dev/null
bzip2 -k example.log
time bzgrep '2064351200' example.log.bz2 >/dev/null
I just ran a quick test with a fairly compressible text file, and found that bzip2 compressed best, but then took far more CPU time to decompress, so the zgip option wound up being fastest overall. Your computer will have different disk and CPU performance than mine, so your results may be different. If you have any other compressors lying around, try them as well, and/or try different levels of gzip compression, etc.
Speaking of preprocessing: if you're searching the same log over and over, is there a way to preselect out just the log lines that you might be interested in? If so, grep them out into a smaller (maybe compressed) file, then search that instead of the whole thing. As with compression, you spend some extra time up front, but then each individual search runs faster.
A note about piping: other things being equal, piping a huge file through multiple commands will be slower than having a single command do all the work. But all things are not equal here, and if using multiple commands in a pipe (which is what zgrep and bzgrep do) buys you better overall performance, go for it. Also, consider whether you're actually passing all of the data through the entire pipe. In the example you gave, fgrep '2064351200' example.log | fgrep 'action: example', the first fgrep will discard most of the file; the pipe and second command only have to process the small fraction of the log that contains '2064351200', so the slowdown will likely be negligible.
EDIT: if the log file is "live" (i.e. new entries are being added), but the bulk of it is static, you may be able to use a partial preprocess approach: compress (& maybe prescan) the log, then when scanning use the compressed (&/prescanned) version plus a tail of the part of the log added since you did the prescan. Something like this:
# Precompress:
gzip -v -c example.log >example.log.gz
compressedsize=$(gzip -l example.log.gz | awk '{if(NR==2) print $2}')
# Search the compressed file + recent additions:
{ gzip -cdfq example.log.gz; tail -c +$compressedsize example.log; } | egrep '2064351200'
If you're going to be doing several related searches (e.g. a particular request, then specific actions with that request), you can save prescanned versions:
# Prescan for a particular request (repeat for each request you'll be working with):
gzip -cdfq example.log.gz | egrep '2064351200' > prescan-2064351200.log
# Search the prescanned file + recent additions:
{ cat prescan-2064351200.log; tail -c +$compressedsize example.log | egrep '2064351200'; } | egrep 'action: example'
If you don't know the sequence of your strings, then:
awk '/str1/ && /str2/ && /str3/ && /str4/' filename
If you know that they will appear one following another in the line:
grep 'str1.*str2.*str3.*str4' filename
(note for awk, {print} is the default action block, so it can be omitted if the condition is given)
Dealing with files that large is going to be slow no matter how you slice it.
As to multi-line programs on the command line,
$ awk 'BEGIN { print "File\tOwner" }
> { print $8, "\t", \
> $3}
> END { print " - DONE -" }' infile > outfile
Note the single quotes.
If you process the same file multiple times, it might be faster to read it into a database, and perhaps even create an index.
