bash loop taking extremely long time - bash

I have a list of times that I am looping through in the format HH:MM:SS to find the nearest but not past time. The code that I have is:
for i in ${times[#]}; do
hours=$(echo $i | sed 's/\([0-9]*\):.*/\1/g')
minutes=$(echo $i | sed 's/.*:\([0-9]*\):.*/\1/g')
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
if [[ hours -ge currentHours ]]; then
if [[ minutes -ge currentMinutes ]]; then
break
fi
fi
done
The variable times is an array of all the times that I am sorting through (its about 20-40 lines). I'd expect this to take less than 1 second however it is taking upwards of 5 seconds. Any suggestions for decreasing the time of the regular expression would be appreciated.
times=($(cat file.txt))
Here is a list of the times that are stored in a text file and are imported into the times variable using the above line of code.
6:05:00
6:35:00
7:05:00
7:36:00
8:08:00
8:40:00
9:10:00
9:40:00
10:11:00
10:41:00
11:11:00
11:41:00
12:11:00
12:41:00
13:11:00
13:41:00
14:11:00
14:41:00
15:11:00
15:41:00
15:56:00
16:11:00
16:26:00
16:41:00
16:58:00
17:11:00
17:26:00
17:41:00
18:11:00
18:41:00
19:10:00
19:40:00
20:10:00
20:40:00
21:15:00
21:45:00

One of the key things to understand in looking at bash scripts from a performance perspective is that while the bash interpreter is somewhat slow, the act of spawning an external process is extremely slow. Thus, while it can often speed up your scripts to use a single invocation of awk or sed to process a large stream of input, starting those invocations inside a tight loop will greatly outweigh the performance of those tools once they're running.
Any command substitution -- $() -- causes a second copy of the interpreter to be fork()ed off as a subshell. Invoking any command not built into bash -- date, sed, etc -- then causes a subprocess to be fork()ed off for that process, and then the executable associated with that process to be exec()'d -- something involves a great deal of OS-level overhead (the binary needs to be linked, loaded, etc).
This loop would be better written as:
IFS=: read -r currentHours currentMinutes < <(date +"%H:%M")
while IFS=: read -r hours minutes _; do
if (( hours >= currentHours )) && (( minutes >= currentMinutes )); then
break
fi
done <file.txt
In this form only one external command is run, date +"%H:%M", outside the loop. If you were only targeting bash 4.2 and newer (with built-in time formatting support), even this would be unnecessary:
printf -v currentHours '%(%H)T' -1
printf -v currentMinutes '%(%M)T' -1
...will directly place the current hour and minute into the variables currentHours and currentMinutes using only functionality built into modern bash releases.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
BashFAQ #100 - How can I do native string manipulations in bash? (Subsection: "Splitting a string into fields")

To be honest I'm not sure why it's taking an extremely long time but there are certainly some things which could be made more efficient.
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
for time in "${times[#]}"; do
IFS=: read -r hours minutes seconds <<<"$time"
if [[ hours -ge currentHours && minutes -ge currentMinutes ]]; then
break
fi
done
This uses read, a built-in command, to split the text into variables, rather than calling external commands and creating subshells.
I assume that you want the script to run so quickly that it's safe to reuse currentHours and currentMinutes within the loop.
Note that you can also just use awk to do the whole thing:
awk -F: -v currentHours="$(date +"%H") -v currentMinutes="$(date +"%M")" '
$1 >= currentHours && $2 >= currentMinutes { print; exit }' file.txt
Just to make the program produce some output, I added a print, so that the last line is printed.

awk to the rescue!
awk -v time="12:12:00" '
function pad(x) {split(x,ax,":"); return (ax[1]<10)?"0"x:x}
BEGIN {time=pad(time)}
time>pad($0) {next}
{print; exit}' times
12:41:00
with 0 padding the hour you can do string only comparison.

Related

Bash pattern matching loop super slow [duplicate]

This question already has answers here:
Bash while read loop extremely slow compared to cat, why?
(4 answers)
Closed 4 years ago.
When I do this with awk it's relatively fast, even though it's Row By Agonizing Row (RBAR). I tried to make a quicker more elegant bug resistant solution in Bash that would only have to make far fewer passes through the file. It takes probably 10 seconds to get through the first 1,000 lines with bash using this code. I can make 25 passes through all million lines of file with awk in about the same time! How come bash is several orders of magnitude slower?
while read line
do
FIELD_1=`echo "$line" | cut -f1`
FIELD_2=`echo "$line" | cut -f2`
if [ "$MAIN_REF" == "$FIELD_1" ]; then
#echo "$line"
if [ "$FIELD_2" == "$REF_1" ]; then
((REF_1_COUNT++))
fi
((LINE_COUNT++))
if [ "$LINE_COUNT" == "1000" ]; then
echo $LINE_COUNT;
fi
fi
done < temp/refmatch
Bash is slow. That's just the way it is; it's designed to oversee the execution of specific tools, and it was never optimized for performance.
All the same, you can make it less slow by avoiding obvious inefficiencies. For example, read will split its input into separate words, so it would be both faster and clearer to write:
while read -r field1 field2 rest; do
# Do something with field1 and field2
instead of
while read line
do
FIELD_1=`echo "$line" | cut -f1`
FIELD_2=`echo "$line" | cut -f2`
Your version sets up two pipelines and creates four children (at least) for every line of input, whereas using read the way it was designed requires no external processes whatsoever.
If you are using cut because your lines are tab-separated and not just whitespace-separated, you can achieve the same effect with read by setting IFS locally:
while IFS=$'\t' read -r field1 field2 rest; do
# Do something with field1 and field2
Even so, don't expect it to be fast. It will just be less agonizingly slow. You would be better off fixing your awk script so that it doesn't require multiple passes. (If you can do that with bash, it can be done with awk and probably with less code.)
Note: I set three variables rather than two, because read puts the rest of the line into the last variable. If there are only two fields, no harm is done; setting a variable to an empty string is something bash can do reasonably rapidly.
As #codeforester points out, the original bash script spawns so many subprocesses.
Here's the modified version to minimize the overheads:
#!/bin/bash
while IFS=$'\t' read -r FIELD_1 FIELD_2 others; do
if [[ "$MAIN_REF" == "$FIELD_1" ]]; then
#echo "$line"
if [[ "$FIELD_2" == "$REF_1" ]]; then
let REF_1_COUNT++
fi
let LINE_COUNT++
echo "$LINE_COUNT"
if [[ "$LINE_COUNT" == "1000" ]]; then
echo "$LINE_COUNT"
fi
fi
done < temp/refmatch
It runs more than 20 times faster than the original one but I'm afraid it may be the limitation of bash script.

How can I monitor the average number of lines added to a file per second in a bash shell? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 6 years ago.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Improve this question
I'd like to monitor the average rate at which lines are being added to a log file in a bash shell.
I can currently monitor how many lines are in the file each second via the command
watch -n 1 'wc -l log.txt'
However, this gives me the total count of lines when I would prefer a rate instead. In other words, I would like a command to every so often output the number of lines that have been added to the file since the command was executed divided by the number of seconds the command has been running.
For a rough count of lines per second, try:
tail -f log.txt | { count=0; old=$(date +%s); while read line; do ((count++)); s=$(date +%s); if [ "$s" -ne "$old" ]; then echo "$count lines per second"; count=0; old=$s; fi; done; }
(Bash required.)
Or, as spread out over multiple lines:
tail -f log.txt | {
count=0
old=$(date +%s)
while read line
do
((count++))
s=$(date +%s)
if [ "$s" -ne "$old" ]
then
echo "$count lines per second"
count=0
old=$s
fi
done
}
This uses date to record the time in seconds. Meanwhile, it counts the number of lines produced by tail -f log.txt. Every time another second passes, the count of lines seen during that second is printed.
Demonstration
One one terminal, run the command:
while sleep 0.1; do echo $((count++)); done >>log.txt
This command writes one line to the file log.txt every roughly tenth of a second.
In another terminal, run:
$ tail -f log.txt | { count=0; old=$(date +%s); while read line; do ((count++)); s=$(date +%s); if [ "$s" -ne "$old" ]; then echo "$count lines per second"; count=0; old=$s; fi; done; }
15 lines per second
10 lines per second
10 lines per second
10 lines per second
9 lines per second
10 lines per second
Due to buffering, the first count is off. Subsequent counts are fairly accurate.
Simple script you can deploy:
Filename="log.txt"
ln_1=`wc -l $Filename | awk '{print $1}'`
while true
do
ln_2=${ln_1}
sleep 1
ln_1=`wc -l $Filename | awk '{print $1}'`
echo $(( ln_1-ln_2 )) lines increased
done
The tail command supports watching for appended lines via --follow option, which accepts a file descriptor, or file name. With this option, tail periodically checks for file changes. The interval of the checks depends on whether kernel supports inotify. Inotify-based implementations detect the changes promptly (I would say, almost instantly). If, however, the kernel doesn't support inotify, tail resorts to periodic checks. In the latter case, tail sleeps for one second by default. The sleep interval can be changed with --sleep-interval option.
I wouldn't rely on the sleep interval in calculations, however:
When ‘tail’ uses inotify, this polling-related option is usually ignored.
Especially because Bash has a built-in seconds counter, the SECONDS variable (see info bash SECONDS):
This variable expands to the number of seconds since the shell was started. Assignment to this variable resets the count to the value assigned, and the expanded value becomes the value assigned plus the number of seconds since the assignment.
Thus, you can initialize SECONDS to 1, run a loop reading the output of tail, and calculate the speed as number_of_lines / $SECONDS. But this will produce average for the entire execution time. Average for the last N seconds is much more practical. It is also easy to implement, as Bash allows to reset the seconds counter.
Example
The following example implements the idea. The also features watch-like output in interactive mode.
# The number of seconds for which we calculate the average speed
timespan=8
# The number of lines
lines=0
# We'll assume that the shell is running in interactive mode,
# if the standard output descriptor (1) is attached to the terminal.
# See http://www.tldp.org/LDP/abs/html/intandnonint.html
if [ -t 1 ]; then
is_interactive=1
format='%d lines/sec'
else
is_interactive=
format='%d lines/sec\n'
fi
# Reset the built-in seconds counter.
# Also, prevent division by zero in the calculation below.
SECONDS=1
# Save cursor position, if interactive
test -n "$is_interactive" && tput sc
while read line; do
if [[ $(( $SECONDS % $timespan )) -eq 0 ]]; then
SECONDS=1
lines=0
fi
if test -n "$is_interactive"; then
# Restore cursor position, then delete line
tput rc; tput el1
fi
printf "$format" $(( ++lines / SECONDS ))
done < <(tail -n0 -F log.txt)
P.S.
There are many other ways to get an offset in seconds. For example, you can fetch the current Unix time using the built-in printf function:
# -1 represents the current time
# %s is strftime's format string for the number of seconds since the Epoch
timestamp=$(builtin printf '%(%s)T' -1)
Another way is to invoke the date command: date +%s.
But I believe that reading from the SECONDS variable is faster and cleaner.

Splitting large files efficiently (currently using awk)

I have a 4 GB file that I need to do some operations on. I have a Bash script to do this, but it Bash seems ill suited to reading large data files into an array. So I decided to break up my file with awk.
My current script is :
for((i=0; i<100; i++)); do awk -v i=$i 'BEGIN{binsize=60000}{if(binsize*i < NR && NR <= binsize*(i+1)){print}}END{}' my_large_file.txt &> my_large_file_split$i.fastq; done
However the problem with this script is that it will read in and loop through this large file 100 times (which presumably will lead to about 400GB of IO).
QUESTION : Is there better strategy of reading in the the large file once? Perhaps doing the writing to files within awk instead of redirecting its output?
Assuming binsize is the number of lines you want per chunk, you could just maintain and reset a line counter as you step through the file, and setting alternate output files within awk instead of using the shell to redirect.
awk -v binsize=60000 '
BEGIN {
outfile="output_chunk_1.txt"
}
count > binsize {
if (filenum>1) {
close(outfile)
}
filenum++
outfile="output_chunk_" filenum ".txt"
count=0
}
{
count++
print > outfile
}
' my_large_file.txt
I haven't actually tested this code, so if it doesn't work verbatim, at least it should give you an idea of a strategy to use. :-)
The idea is that we'll step through the file, updating a filename in a variable whenever our line count for a chunk exceeds binsize. Note that the close(outfile) isn't strictly necessary, as awk will of course close any open files when it exits, but it may save you a few bytes of memory per open file handle (which will only be significant if you have many many output files).
That said, you could do almost exactly the same thing in bash alone:
#!/usr/bin/env bash
binsize=60000
filenum=1; count=0
while read -r line; do
if [ $count -gt $binsize ]; then
((filenum++))
count=0
fi
((count++))
outfile="output_chunk_${filenum}.txt"
printf '%s\n' "$line" >> $outfile
done < my_large_file.txt
(Also untested.)
And while I'd expect the awk solution to be faster than bash, it might not hurt to do your own benchmarks. :)

Do I need stay away from bash scripts for big files?

I have big log files(1-2 gb and more). I'm new on programming and bash so useful and easy for me. When I need something, I can do (someone help me on here). Simple scripts works fine, but when I need complex operations, maybe bash so slow maybe my programming skill so bad, it's so slow working.
So do I need C for complex programming on my server log files or do I need just optimization my scripts?
If I need just optimization, how can I check where is bad or where is good on my codes?
For example I have while-do loop:
while read -r date month size;
do
...
...
done < file.tmp
How can I use awk for faster run?
That depends on how you use bash. To illustrate, consider how you'd sum a possibly large number of integers.
This function does what Bash was meant for: being control logic for calling other utilities.
sumlines_fast() {
awk '{n += $1} END {print n}'
}
It runs in 0.5 seconds on a million line file. That's the kind of bash code you can very effectively use for larger files.
Meanwhile, this function does what Bash is not intended for: being a general purpose programming language:
sumlines_slow() {
local i=0
while IFS= read -r line
do
(( i += $line ))
done
echo "$i"
}
This function is slow, and takes 30 seconds to sum the same million line file. You should not be doing this for larger files.
Finally, here's a function that could have been written by someone who has no understanding of bash at all:
sumlines_garbage() {
i=0
for f in `cat`
do
i=`echo $f + $i | bc`
done
echo $i
}
It treats forks as being free and therefore runs ridiculously slowly. It would take something like five hours to sum the file. You should not be using this at all.

How to count number of forked (sub-?)processes

Somebody else has written (TM) some bash script that forks very many sub-processes. It needs optimization. But I'm looking for a way to measure "how bad" the problem is.
Can I / How would I get a count that says how many sub-processes were forked by this script all-in-all / recursively?
This is a simplified version of what the existing, forking code looks like - a poor man's grep:
#!/bin/bash
file=/tmp/1000lines.txt
match=$1
let cnt=0
while read line
do
cnt=`expr $cnt + 1`
lineArray[$cnt]="${line}"
done < $file
totalLines=$cnt
cnt=0
while [ $cnt -lt $totalLines ]
do
cnt=`expr $cnt + 1`
matches=`echo ${lineArray[$cnt]}|grep $match`
if [ "$matches" ] ; then
echo ${lineArray[$cnt]}
fi
done
It takes the script 20 seconds to look for $1 in 1000 lines of input. This code forks way too many sub-processes. In the real code, there are longer pipes (e.g. progA | progB | progC) operating on each line using grep, cut, awk, sed and so on.
This is a busy system with lots of other stuff going on, so a count of how many processes were forked on the entire system during the run-time of the script would be of some use to me, but I'd prefer a count of processes started by this script and descendants. And I guess I could analyze the script and count it myself, but the script is long and rather complicated, so I'd just like to instrument it with this counter for debugging, if possible.
To clarify:
I'm not looking for the number of processes under $$ at any given time (e.g. via ps), but the number of processes run during the entire life of the script.
I'm also not looking for a faster version of this particular example script (I can do that). I'm looking for a way to determine which of the 30+ scripts to optimize first to use bash built-ins.
You can count the forked processes simply trapping the SIGCHLD signal. If You can edit the script file then You can do this:
set -o monitor # or set -m
trap "((++fork))" CHLD
So fork variable will contain the number of forks. At the end You can print this value:
echo $fork FORKS
For a 1000 lines input file it will print:
3000 FORKS
This code forks for two reasons. One for each expr ... and one for `echo ...|grep...`. So in the reading while-loop it forks every time when a line is read; in the processing while-loop it forks 2 times (one because of expr ... and one for `echo ...|grep ...`). So for a 1000 lines file it forks 3000 times.
But this is not exact! It is just the forks done by the calling shell. There are more forks, because `echo ...|grep...` forks to start a bash to run this code. But after it is also forks twice: one for echo and one for grep. So actually it is 3 forks, not one. So it is rather 5000 FORKS, not 3000.
If You need to count the forks of the forks (of the forks...) as well (or You cannot modify the bash script or You want it to do from an other script), a more exact solution can be to used
strace -fo s.log ./x.sh
It will print lines like this:
30934 execve("./x.sh", ["./x.sh"], [/* 61 vars */]) = 0
Then You need to count the unique PIDs using something like this (first number is the PID):
awk '{n[$1]}END{print length(n)}' s.log
In case of this script I got 5001 (the +1 is the PID of the original bash script).
COMMENTS
Actually in this case all forks can be avoided:
Instead of
cnt=`expr $cnt + 1`
Use
((++cnt))
Instead of
matches=`echo ${lineArray[$cnt]}|grep $match`
if [ "$matches" ] ; then
echo ${lineArray[$cnt]}
fi
You can use bash's internal pattern matching:
[[ ${lineArray[cnt]} =~ $match ]] && echo ${lineArray[cnt]}
Mind that bash =~ uses ERE not RE (like grep). So it will behave like egrep (or grep -E), not grep.
I assume that the defined lineArray is not pointless (otherwise in the reading loop the matching could be tested and the lineArray is not needed) and it is used for other purpose as well. In that case I may suggest a little bit shorter version:
readarray -t lineArray <infile
for line in "${lineArray[#]}";{ [[ $line} =~ $match ]] && echo $line; }
First line reads the complete infile to lineArray without any loop. The second line is process the array element-by-element.
MEASURES
Original script for 1000 lines (on cygwin):
$ time ./test.sh
3000 FORKS
real 0m48.725s
user 0m14.107s
sys 0m30.659s
Modified version
FORKS
real 0m0.075s
user 0m0.031s
sys 0m0.031s
Same on linux:
3000 FORKS
real 0m4.745s
user 0m1.015s
sys 0m4.396s
and
FORKS
real 0m0.028s
user 0m0.022s
sys 0m0.005s
So this version uses no fork (or clone) at all. I may suggest to use this version only for small (<100 KiB) files. In other cases grap, egrep, awk over performs the pure bash solution. But this should be checked by a performance test.
For a thousand lines on linux I got the following:
$ time grep Solaris infile # Solaris is not in the infile
real 0m0.001s
user 0m0.000s
sys 0m0.001s

Resources