Bash pattern matching loop super slow [duplicate] - bash

This question already has answers here:
Bash while read loop extremely slow compared to cat, why?
(4 answers)
Closed 4 years ago.
When I do this with awk it's relatively fast, even though it's Row By Agonizing Row (RBAR). I tried to make a quicker more elegant bug resistant solution in Bash that would only have to make far fewer passes through the file. It takes probably 10 seconds to get through the first 1,000 lines with bash using this code. I can make 25 passes through all million lines of file with awk in about the same time! How come bash is several orders of magnitude slower?
while read line
do
FIELD_1=`echo "$line" | cut -f1`
FIELD_2=`echo "$line" | cut -f2`
if [ "$MAIN_REF" == "$FIELD_1" ]; then
#echo "$line"
if [ "$FIELD_2" == "$REF_1" ]; then
((REF_1_COUNT++))
fi
((LINE_COUNT++))
if [ "$LINE_COUNT" == "1000" ]; then
echo $LINE_COUNT;
fi
fi
done < temp/refmatch

Bash is slow. That's just the way it is; it's designed to oversee the execution of specific tools, and it was never optimized for performance.
All the same, you can make it less slow by avoiding obvious inefficiencies. For example, read will split its input into separate words, so it would be both faster and clearer to write:
while read -r field1 field2 rest; do
# Do something with field1 and field2
instead of
while read line
do
FIELD_1=`echo "$line" | cut -f1`
FIELD_2=`echo "$line" | cut -f2`
Your version sets up two pipelines and creates four children (at least) for every line of input, whereas using read the way it was designed requires no external processes whatsoever.
If you are using cut because your lines are tab-separated and not just whitespace-separated, you can achieve the same effect with read by setting IFS locally:
while IFS=$'\t' read -r field1 field2 rest; do
# Do something with field1 and field2
Even so, don't expect it to be fast. It will just be less agonizingly slow. You would be better off fixing your awk script so that it doesn't require multiple passes. (If you can do that with bash, it can be done with awk and probably with less code.)
Note: I set three variables rather than two, because read puts the rest of the line into the last variable. If there are only two fields, no harm is done; setting a variable to an empty string is something bash can do reasonably rapidly.

As #codeforester points out, the original bash script spawns so many subprocesses.
Here's the modified version to minimize the overheads:
#!/bin/bash
while IFS=$'\t' read -r FIELD_1 FIELD_2 others; do
if [[ "$MAIN_REF" == "$FIELD_1" ]]; then
#echo "$line"
if [[ "$FIELD_2" == "$REF_1" ]]; then
let REF_1_COUNT++
fi
let LINE_COUNT++
echo "$LINE_COUNT"
if [[ "$LINE_COUNT" == "1000" ]]; then
echo "$LINE_COUNT"
fi
fi
done < temp/refmatch
It runs more than 20 times faster than the original one but I'm afraid it may be the limitation of bash script.

Related

Bash looping through file ends prematurely

I have troubles in Bash looping within a text file of ~20k lines.
Here is my (minimised) code:
LINE_NB=0
while IFS= read -r LINE; do
LINE_NB=$((LINE_NB+1))
CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< ${LINE})
echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'"
done <"${FILE}"
The while loop ends prematurely after a few hundreds iterations. However, the loop works correctly if I remove the CMD=$(sed...) part. So, evidently, there is some interference I cannot spot.
As I ready here, I also tried:
LINE_NB=0
while IFS= read -r -u4 LINE; do
LINE_NB=$((LINE_NB+1))
CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< ${LINE})
echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'"
done 4<"${FILE}"
but nothing changes. Any explanation for this behaviour and help on how can I solve it?
Thanks!
To clarify the situation for user1934428 (thanks for your interest!), I now have created a minimal script and added "set -x". The full script is as follows:
#!/usr/bin/env bash
set -x
FILE="$1"
LINE_NB=0
while IFS= read -u "$file_fd" -r LINE; do
LINE_NB=$((LINE_NB+1))
CMD=$(sed "s/\([^ ]*\) .*/\1/" <<< "${LINE}")
echo "[${LINE_NB}] ${LINE}: CMD='${CMD}'" #, TIME='${TIME}' "
done {file_fd}<"${FILE}"
echo "Done."
The input file is a list of ~20k lines of the form:
S1 0.018206
L1 0.018966
F1 0.006833
S2 0.004212
L2 0.008005
I8R190 18.3791
I4R349 18.5935
...
The while loops ends prematurely at (seemingly) random points. One possible output is:
+ FILE=20k/ir-collapsed.txt
+ LINE_NB=0
+ IFS=
+ read -u 10 -r LINE
+ LINE_NB=1
++ sed 's/\([^ ]*\) .*/\1/'
+ CMD=S1
+ echo '[1] S1 0.018206: CMD='\''S1'\'''
[1] S1 0.018206: CMD='S1'
+ echo '[6510] S1514 0.185504: CMD='\''S1514'\'''
...[snip]...
[6510] S1514 0.185504: CMD='S1514'
+ IFS=
+ read -u 10 -r LINE
+ echo Done.
Done.
As you can see, the loop ends prematurely after line 6510, while the input file is ~20k lines long.
Yes, making a stable file copy is a best start.
Learning awk and/or perl is still well worth your time. It's not as hard as it looks. :)
Aside from that, a couple of optimizations - try to never run any program inside a loop when you can avoid it. For a 20k line file, that's 20k seds, which really adds up unnecessarily. Instead you could just use parameter parsing for this one.
# don't use all caps.
# cmd=$(sed "s/\([^ ]*\) .*/\1/" <<< "${line}") becomes
cmd="${cmd%% *}" # strip everything from the first space
Using the read to handle that is even better, since you were already using it anyway, but don't spawn another if you can avoid it. As much as I love it, read is pretty inefficient; it has to do a lot of fiddling to handle all its options.
while IFS= read -u "$file_fd" cmd timeval; do
echo "[$((++line_nb))] CMD='${CMD}' TIME='${timeval}'"
done {file_fd}<"${file}"
or
while IFS= read -u "$file_fd" -r -a tok; do
echo "[$((++line_nb))] LINE='${tok[#]}' CMD='${tok[0]}' TIME='${tok[1]}'"
done {file_fd}<"${file}"
(This will sort of rebuild the line, but if there were tabs or extra spaces, etc, it will only pad with the 1st char of $IFS, which is a space by default. Shouldn't matter here.)
awk would have made short work of this, though, and been a lot faster, with better tools already built in.
awk '{printf "NR=[%d] LINE=[%s] CMD=[%s] TIME=[%s]\n",NR,$0,$1,$2 }' 20k/ir-collapsed.txt
Run some time comparisons - with and without the sed, with one read vs two, and then compare each against the awk. :)
The more things you have to do with each line, and the more lines there are in the file, the more it will matter. Make it a habit to do even small things as neatly as you can - it will pay off well in the long run.

Bash: base64 decode a file x times

During a scripting challenge, it was asked to me to decode X times (saying 100) a base64 files (base64.txt).
So I wrote this small bash script to do so.
for item in `cat base64.txt`;do
for count in {1..100};do
if [ $count -eq 1 ]; then
current=$(echo "$item" |base64 --decode)
else
current=$(echo "$current" |base64 --decode)
fi
if [ $count -eq 100 ]; then
echo $current
fi
done
done
It is working as expected, and I got the attended result.
What I am looking for now, is a way to improve this script, because I am far to be a specialist, and want to see what could improve the way I approach this kind of challenge.
Could some of you please give me some advice ?
decode X times (saying 100) a base64 file (base64.txt)
there is only 1 file, that contains 1 line in it.
Just read the content of the file, decode it 100 times and output.
state=$(<base64.txt)
for i in {1..100}; do
state=$(<<<"$state" base64 --decode)
done
echo "$state"
Notes:
Backticks ` are discouraged. Use $(...) instead. bash deprecated and obsolete syntax
for i in cat is a common antipattern in bash. How to read a file line by line in bash
If the file contains one line only, there is no need to iterate over the words in the file.
In bash echo "$item" | is a useless usage of echo (and is also a small risk that it may not work, when ex. item=-e). You can use a here string instead when in bash.

bash loop taking extremely long time

I have a list of times that I am looping through in the format HH:MM:SS to find the nearest but not past time. The code that I have is:
for i in ${times[#]}; do
hours=$(echo $i | sed 's/\([0-9]*\):.*/\1/g')
minutes=$(echo $i | sed 's/.*:\([0-9]*\):.*/\1/g')
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
if [[ hours -ge currentHours ]]; then
if [[ minutes -ge currentMinutes ]]; then
break
fi
fi
done
The variable times is an array of all the times that I am sorting through (its about 20-40 lines). I'd expect this to take less than 1 second however it is taking upwards of 5 seconds. Any suggestions for decreasing the time of the regular expression would be appreciated.
times=($(cat file.txt))
Here is a list of the times that are stored in a text file and are imported into the times variable using the above line of code.
6:05:00
6:35:00
7:05:00
7:36:00
8:08:00
8:40:00
9:10:00
9:40:00
10:11:00
10:41:00
11:11:00
11:41:00
12:11:00
12:41:00
13:11:00
13:41:00
14:11:00
14:41:00
15:11:00
15:41:00
15:56:00
16:11:00
16:26:00
16:41:00
16:58:00
17:11:00
17:26:00
17:41:00
18:11:00
18:41:00
19:10:00
19:40:00
20:10:00
20:40:00
21:15:00
21:45:00
One of the key things to understand in looking at bash scripts from a performance perspective is that while the bash interpreter is somewhat slow, the act of spawning an external process is extremely slow. Thus, while it can often speed up your scripts to use a single invocation of awk or sed to process a large stream of input, starting those invocations inside a tight loop will greatly outweigh the performance of those tools once they're running.
Any command substitution -- $() -- causes a second copy of the interpreter to be fork()ed off as a subshell. Invoking any command not built into bash -- date, sed, etc -- then causes a subprocess to be fork()ed off for that process, and then the executable associated with that process to be exec()'d -- something involves a great deal of OS-level overhead (the binary needs to be linked, loaded, etc).
This loop would be better written as:
IFS=: read -r currentHours currentMinutes < <(date +"%H:%M")
while IFS=: read -r hours minutes _; do
if (( hours >= currentHours )) && (( minutes >= currentMinutes )); then
break
fi
done <file.txt
In this form only one external command is run, date +"%H:%M", outside the loop. If you were only targeting bash 4.2 and newer (with built-in time formatting support), even this would be unnecessary:
printf -v currentHours '%(%H)T' -1
printf -v currentMinutes '%(%M)T' -1
...will directly place the current hour and minute into the variables currentHours and currentMinutes using only functionality built into modern bash releases.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
BashFAQ #100 - How can I do native string manipulations in bash? (Subsection: "Splitting a string into fields")
To be honest I'm not sure why it's taking an extremely long time but there are certainly some things which could be made more efficient.
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
for time in "${times[#]}"; do
IFS=: read -r hours minutes seconds <<<"$time"
if [[ hours -ge currentHours && minutes -ge currentMinutes ]]; then
break
fi
done
This uses read, a built-in command, to split the text into variables, rather than calling external commands and creating subshells.
I assume that you want the script to run so quickly that it's safe to reuse currentHours and currentMinutes within the loop.
Note that you can also just use awk to do the whole thing:
awk -F: -v currentHours="$(date +"%H") -v currentMinutes="$(date +"%M")" '
$1 >= currentHours && $2 >= currentMinutes { print; exit }' file.txt
Just to make the program produce some output, I added a print, so that the last line is printed.
awk to the rescue!
awk -v time="12:12:00" '
function pad(x) {split(x,ax,":"); return (ax[1]<10)?"0"x:x}
BEGIN {time=pad(time)}
time>pad($0) {next}
{print; exit}' times
12:41:00
with 0 padding the hour you can do string only comparison.

How can I monitor the average number of lines added to a file per second in a bash shell? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 6 years ago.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Improve this question
I'd like to monitor the average rate at which lines are being added to a log file in a bash shell.
I can currently monitor how many lines are in the file each second via the command
watch -n 1 'wc -l log.txt'
However, this gives me the total count of lines when I would prefer a rate instead. In other words, I would like a command to every so often output the number of lines that have been added to the file since the command was executed divided by the number of seconds the command has been running.
For a rough count of lines per second, try:
tail -f log.txt | { count=0; old=$(date +%s); while read line; do ((count++)); s=$(date +%s); if [ "$s" -ne "$old" ]; then echo "$count lines per second"; count=0; old=$s; fi; done; }
(Bash required.)
Or, as spread out over multiple lines:
tail -f log.txt | {
count=0
old=$(date +%s)
while read line
do
((count++))
s=$(date +%s)
if [ "$s" -ne "$old" ]
then
echo "$count lines per second"
count=0
old=$s
fi
done
}
This uses date to record the time in seconds. Meanwhile, it counts the number of lines produced by tail -f log.txt. Every time another second passes, the count of lines seen during that second is printed.
Demonstration
One one terminal, run the command:
while sleep 0.1; do echo $((count++)); done >>log.txt
This command writes one line to the file log.txt every roughly tenth of a second.
In another terminal, run:
$ tail -f log.txt | { count=0; old=$(date +%s); while read line; do ((count++)); s=$(date +%s); if [ "$s" -ne "$old" ]; then echo "$count lines per second"; count=0; old=$s; fi; done; }
15 lines per second
10 lines per second
10 lines per second
10 lines per second
9 lines per second
10 lines per second
Due to buffering, the first count is off. Subsequent counts are fairly accurate.
Simple script you can deploy:
Filename="log.txt"
ln_1=`wc -l $Filename | awk '{print $1}'`
while true
do
ln_2=${ln_1}
sleep 1
ln_1=`wc -l $Filename | awk '{print $1}'`
echo $(( ln_1-ln_2 )) lines increased
done
The tail command supports watching for appended lines via --follow option, which accepts a file descriptor, or file name. With this option, tail periodically checks for file changes. The interval of the checks depends on whether kernel supports inotify. Inotify-based implementations detect the changes promptly (I would say, almost instantly). If, however, the kernel doesn't support inotify, tail resorts to periodic checks. In the latter case, tail sleeps for one second by default. The sleep interval can be changed with --sleep-interval option.
I wouldn't rely on the sleep interval in calculations, however:
When ‘tail’ uses inotify, this polling-related option is usually ignored.
Especially because Bash has a built-in seconds counter, the SECONDS variable (see info bash SECONDS):
This variable expands to the number of seconds since the shell was started. Assignment to this variable resets the count to the value assigned, and the expanded value becomes the value assigned plus the number of seconds since the assignment.
Thus, you can initialize SECONDS to 1, run a loop reading the output of tail, and calculate the speed as number_of_lines / $SECONDS. But this will produce average for the entire execution time. Average for the last N seconds is much more practical. It is also easy to implement, as Bash allows to reset the seconds counter.
Example
The following example implements the idea. The also features watch-like output in interactive mode.
# The number of seconds for which we calculate the average speed
timespan=8
# The number of lines
lines=0
# We'll assume that the shell is running in interactive mode,
# if the standard output descriptor (1) is attached to the terminal.
# See http://www.tldp.org/LDP/abs/html/intandnonint.html
if [ -t 1 ]; then
is_interactive=1
format='%d lines/sec'
else
is_interactive=
format='%d lines/sec\n'
fi
# Reset the built-in seconds counter.
# Also, prevent division by zero in the calculation below.
SECONDS=1
# Save cursor position, if interactive
test -n "$is_interactive" && tput sc
while read line; do
if [[ $(( $SECONDS % $timespan )) -eq 0 ]]; then
SECONDS=1
lines=0
fi
if test -n "$is_interactive"; then
# Restore cursor position, then delete line
tput rc; tput el1
fi
printf "$format" $(( ++lines / SECONDS ))
done < <(tail -n0 -F log.txt)
P.S.
There are many other ways to get an offset in seconds. For example, you can fetch the current Unix time using the built-in printf function:
# -1 represents the current time
# %s is strftime's format string for the number of seconds since the Epoch
timestamp=$(builtin printf '%(%s)T' -1)
Another way is to invoke the date command: date +%s.
But I believe that reading from the SECONDS variable is faster and cleaner.

Manipulating data text file with bash command?

I was given this text file, call stock.txt, the content of the text file is:
pepsi;drinks;3
fries;snacks;6
apple;fruits;9
baron;drinks;7
orange;fruits;2
chips;snacks;8
I will need to use bash-script to come up this output:
Total amount for drinks: 10
Total amount for snacks: 14
Total amount for fruits: 11
Total of everything: 35
My gut tells me I will need to use sed, group, grep and something else.
Where should I start?
I would break the exercise down into steps
Step 1: Read the file one line at a time
while read -r line
do
# do something with $line
done
Step 2: Pattern match (drinks, snacks, fruits) and do some simple arithmetic. This step requires that you tokenized each line which I'll leave an exercise for you to figure out.
if [[ "$line" =~ "drinks" ]]
then
echo "matched drinks"
.
.
.
fi
Pure Bash. A nice application for an associative array:
declare -A category # associative array
IFS=';'
while read name cate price ; do
((category[$cate]+=price))
done < stock.txt
sum=0
for cate in ${!category[#]}; do # loop over the indices
printf "Total amount of %s: %d\n" $cate ${category[$cate]}
((sum+=${category[$cate]}))
done
printf "Total amount of everything: %d\n" $sum
There is a short description here about processing comma separated files in bash here:
http://www.cyberciti.biz/faq/unix-linux-bash-read-comma-separated-cvsfile/
You could do something similar. Just change IFS from comma to semicolon.
Oh yeah, and a general hint for learning bash: man is your friend. Use this command to see manual pages for all (or most) of commands and utilities.
Example: man read shows the manual page for read command. On most systems it will be opened in less, so you should exit the manual by pressing q (may be funny, but it took me a while to figure that out)
The easy way to do this is using a hash table, which is supported directly by bash 4.x and of course can be found in awk and perl. If you don't have a hash table then you need to loop twice: once to collect the unique values of the second column, once to total.
There are many ways to do this. Here's a fun one which doesn't use awk, sed or perl. The only external utilities I've used here are cut, sort and uniq. You could even replace cut with a little more effort. In fact lines 5-9 could have been written more easily with grep, (grep $kind stock.txt) but I avoided that to show off the power of bash.
for kind in $(cut -d\; -f 2 stock.txt | sort | uniq) ; do
total=0
while read d ; do
total=$(( total+d ))
done < <(
while read line ; do
[[ $line =~ $kind ]] && echo $line
done < stock.txt | cut -d\; -f3
)
echo "Total amount for $kind: $total"
done
We lose the strict ordering of your original output here. An exercise for you might be to find a way not to do that.
Discussion:
The first line describes a sub-shell with a simple pipeline using cut. We read the third field from the stock.txt file, with fields delineated by ;, written \; here so the shell does not interpret it. The result is a newline-separated list of values from stock.txt. This is piped to sort, then uniq. This performs our "grouping" step, since the pipeline will output an alphabetic list of items from the second column but will only list each item once no matter how many times it appeared in the input file.
Also on the first line is a typical for loop: For each item resulting from the sub-shell we loop once, storing the value of the item in the variable kind. This is the other half of the grouping step, making sure that each "Total" output line occurs once.
On the second line total is initialized to zero so that it always resets whenever a new group is started.
The third line begins the 'totaling' loop, in which for the current kind we find the sum of its occurrences. here we declare that we will read the variable d in from stdin on each iteration of the loop.
On the fourth line the totaling actually occurs: Using shell arithmatic we add the value in d to the value in total.
Line five ends the while loop and then describes its input. We use shell input redirection via < to specify that the input to the loop, and thus to the read command, comes from a file. We then use process substitution to specify that the file will actually be the results of a command.
On the sixth line the command that will feed the while-read loop begins. It is itself another while-read loop, this time reading into the variable line. On the seventh line the test is performed via a conditional construct. Here we use [[ for its =~ operator, which is a pattern matching operator. We are testing to see whether $line matches our current $kind.
On the eighth line we end the inner while-read loop and specify that its input comes from the stock.txt file, then we pipe the output of the entire loop, which by now is simply all lines matching $kind, to cut and instruct it to show only the third field, which is the numeric field. On line nine we then end the process substitution command, the output of which is a newline-delineated list of numbers from lines which were of the group specified by kind.
Given that the total is now known and the kind is known it is a simple matter to print the results to the screen.
The below answer is OP's. As it was edited in the question itself and OP hasn't come back for 6 years, I am editing out the answer from the question and posting it as wiki here.
My answer, to get the total price, I use this:
...
PRICE=0
IFS=";" # new field separator, the end of line
while read name cate price
do
let PRICE=PRICE+$price
done < stock.txt
echo $PRICE
When I echo, its :35, which is correct. Now I will moving on using awk to get the sub-category result.
Whole Solution:
Thanks guys, I manage to do it myself. Here is my code:
#!/bin/bash
INPUT=stock.txt
PRICE=0
DRINKS=0
SNACKS=0
FRUITS=0
old_IFS=$IFS # save the field separator
IFS=";" # new field separator, the end of line
while read name cate price
do
if [ $cate = "drinks" ]; then
let DRINKS=DRINKS+$price
fi
if [ $cate = "snacks" ]; then
let SNACKS=SNACKS+$price
fi
if [ $cate = "fruits" ]; then
let FRUITS=FRUITS+$price
fi
# Total
let PRICE=PRICE+$price
done < $INPUT
echo -e "Drinks: " $DRINKS
echo -e "Snacks: " $SNACKS
echo -e "Fruits: " $FRUITS
echo -e "Price " $PRICE
IFS=$old_IFS

Resources