Bash running time optimization - bash

I am trying to solve an optimization problem and to find the most efficient way of performing the following commands:
whois -> sed -> while (exit while) ->perform action
while loop currently look like
while [x eq smth]; do
x=$((x+1))
done
some action
Maybe it is more efficient to have while true with an if inside (if clause the same as for while). Also, what is the best case using bash to evaluate the time required for every single step?

The by far biggest performance penalty and most common performance problem in Bash is unnecessary forking.
while [[ something ]]
do
var+=$(echo "$expression" | awk '{print $1}')
done
will be thousands of times slower than
while [[ something ]]
do
var+=${expression%% *}
done
Since the former will cause two forks per iteration, while the latter causes none.
Things that cause forks include but are not limited to pipe | lines, $(command expansion), <(process substitution), (explicit subshells), and using any command not listed in help (which type somecmd will identify as 'builtin' or 'shell keyword').

Well for starters you could remove $(, this creates a subshell and is sure
to slow the task down somewhat
while [ x -eq smth ]
do
(( x++ ))
done

Related

stop scripting in a specified time(2 mins)

#!/bin/bash
commonguess(){
for guess in $(< passwordlist)
do
try=$(echo "$guess" | sha256sum | awk '{print $1}' )
if [ "$try" == "$xxx" ]
then
echo "$name:$try"
return 0
fi
done
return 1
}
dict(){...}
brute(){...}
while IFS=':' read -r name hashing;do
commonguess || dict || brute
done
my code has been fixed, and i need to do one more thing. when i run function brute, it should stop after 2 mins. I know sleep command can make the script pause, however i have been told it is not a good idea to use "kill". So i am wondering is there any way to do this.
You'd get a better answer if you were more specific about what brute actually consists of.
If it's pure shell itself, the cleanest way to have it manage its execution time is to check the SECONDS variable frequently enough so you can abort the process by yourself it it ever goes over some threshold.
It it's external, you're not going to be able to avoid kill. Or some invoke wrapper that handles timeout on its own, that probably already exists.

bc: prevent "divide by zero" runtime error on multiple operations

I'm using bc to do a series of calculation.
I'm calling it though a bash script that first of all puts all the expressions to be calculated in a single variable, the passes them to bc to calculate the results.
The script is something like that:
#!/bin/bash
....
list=""
for frml in `cat $frmlList`
do
value=`echo $frml`
list="$list;$value"
done
echo "scale=3;$list"| bc
the frmlList variable contains a list of expressions that are the output of another program, for simplicity i don't mention every operation, but on its contents are done some sed operations before to assign it to the "value" variable.
In the end, the "list" variable contains a list of expressions separated by semicolon that bc understands.
Now what happens is that in my formula list, sometimes happens that there is a division by 0.
When it happens bc stops its computation giving a "Runtime Error: divide by zero".
I would bc to not end its work on that error, but to skip it and continue with the next formula evaluation.
Is possible to achieve something like that?
The same thing happens in a simpler situation:
echo "scale=2;1+1;1/0;2+2;" | bc
the output is
2
Runtime error (func=(main), adr=17): Divide by zero
I would like to have something like
2
Runtime error (func=(main), adr=17): Divide by zero
4
Thank you in advance :)
Ok, in the end i found a workaround that does the trick quite well.
The idea is to parallelize the execution of bc using subshells, this way if an evaluation fails the other can be still done.
In my script i did something like this:
#!/bin/bash
i=0
for frml in `cat $frmlList`
do
i=$((i+1))
(echo "scale=3;$value"| bc -l extensions.bc > "file_$i.tmp") &
if (( $i % 10 == 0 )); then
wait;
fi # Limit to 10 concurrent subshells.
done
#do something with the written files
I know no simple way to do this. If the expressions are independent, you can try to run them all in bc. If that fails, feed them to bc one by one, skipping the broken ones.
If expressions depend on each other, then you probably need something more powerful than bc. Or you can try to append expression after expression to an input file. If bc fails, remove the last one (maybe restore the file from a backup) and try with the next one.

Bash conditional on command exit code

In bash, I want to say "if a file doesn't contain XYZ, then" do a bunch of things. The most natural way to transpose this into code is something like:
if [ ! grep --quiet XYZ "$MyFile" ] ; then
... do things ...
fi
But of course, that's not valid Bash syntax. I could use backticks, but then I'll be testing the output of the file. The two alternatives I can think of are:
grep --quiet XYZ "$MyFile"
if [ $? -ne 0 ]; then
... do things ...
fi
And
grep --quiet XYZ "$MyFile" ||
( ... do things ...
)
I kind of prefer the second one, it's more Lispy and the || for control flow isn't that uncommon in scripting languages. I can see arguments for the first one too, although when the person reads the first line, they don't know why you're executing grep, it looks like you're executing it for it's main effect, rather than just to control a branch in script.
Is there a third, more direct way which uses an if statement and has the grep in the condition?
Yes there is:
if grep --quiet .....
then
# If grep finds something
fi
or if the grep fails
if ! grep --quiet .....
then
# If grep doesn't find something
fi
You don't need the [ ] (test) to check the return value of a command. Just try:
if ! grep --quiet XYZ "$MyFile" ; then
This is a matter of taste since there obviously are multiple working solutions. When I deal with a problem like this, I usually apply wc -l after grep in order to count the lines that match. Then you have a single integer number that you can evaluate within a test condition. If the question only is whether there is a match at all (the number of matching lines does not matter), then applying wc probably is OTT and evaluation of grep's return code seems to be the best solution:
Normally, the exit status is 0 if selected lines are found and 1
otherwise. But the exit status is 2 if an error occurred, unless the
-q or --quiet or --silent option is used and a selected line is found. Note, however, that POSIX only mandates, for programs such as grep,
cmp, and diff, that the exit status in case of error be greater than
1; it is therefore advisable, for the sake of portability, to use
logic that tests for this general condition instead of strict equality
with 2.

How to prevent code/option injection in a bash script

I have written a small bash script called "isinFile.sh" for checking if the first term given to the script can be found in the file "file.txt":
#!/bin/bash
FILE="file.txt"
if [ `grep -w "$1" $FILE` ]; then
echo "true"
else
echo "false"
fi
However, running the script like
> ./isinFile.sh -x
breaks the script, since -x is interpreted by grep as an option.
So I improved my script
#!/bin/bash
FILE="file.txt"
if [ `grep -w -- "$1" $FILE` ]; then
echo "true"
else
echo "false"
fi
using -- as an argument to grep. Now running
> ./isinFile.sh -x
false
works. But is using -- the correct and only way to prevent code/option injection in bash scripts? I have not seen it in the wild, only found it mentioned in ABASH: Finding Bugs in Bash Scripts.
grep -w -- ...
prevents that interpretation in what follows --
EDIT
(I did not read the last part sorry). Yes, it is the only way. The other way is to avoid it as first part of the search; e.g. ".{0}-x" works too but it is odd., so e.g.
grep -w ".{0}$1" ...
should work too.
There's actually another code injection (or whatever you want to call it) bug in this script: it simply hands the output of grep to the [ (aka test) command, and assumes that'll return true if it's not empty. But if the output is more than one "word" long, [ will treat it as an expression and try to evaluate it. For example, suppose the file contains the line 0 -eq 2 and you search for "0" -- [ will decide that 0 is not equal to 2, and the script will print false despite the fact that it found a match.
The best way to fix this is to use Ignacio Vazquez-Abrams' suggestion (as clarified by Dennis Williamson) -- this completely avoids the parsing problem, and is also faster (since -q makes grep stop searching at the first match). If that option weren't available, another method would be to protect the output with double-quotes: if [ "$(grep -w -- "$1" "$FILE")" ]; then (note that I also used $() instead of backquotes 'cause I find them much easier to read, and quotes around $FILE just in case it contains anything funny, like whitespace).
Though not applicable in this particular case, another technique can be used to prevent filenames that start with hyphens from being interpreted as options:
rm ./-x
or
rm /path/to/-x

Easy parallelisation

I often find myself writing simple for loops to perform an operation to many files, for example:
for i in `find . | grep ".xml$"`; do bzip2 $i; done
It seems a bit depressing that on my 4-core machine only one core is getting used.. is there an easy way I can add parallelism to my shell scripting?
EDIT: To introduce a bit more context to my problems, sorry I was not more clear to start with!
I often want to run simple(ish) scripts, such as plot a graph, compress or uncompress, or run some program, on reasonable sized datasets (usually between 100 and 10,000). The scripts I use to solve such problems look like the one above, but might have a different command, or even a sequence of commands to execute.
For example, just now I am running:
for i in `find . | grep ".xml.bz2$"`; do find_graph -build_graph $i.graph $i; done
So my problems are in no way bzip specific! (Although parallel bzip does look cool, I intend to use it in future).
Solution: Use xargs to run in parallel (don't forget the -n option!)
find -name \*.xml -print0 | xargs -0 -n 1 -P 3 bzip2
This perl program fits your needs fairly well, you would just do this:
runN -n 4 bzip2 `find . | grep ".xml$"`
gnu make has a nice parallelism feature (eg. -j 5) that would work in your case. Create a Makefile
%.xml.bz2 : %.xml
all: $(patsubt %.xml,%xml.bz2,$(shell find . -name '*.xml') )
then do a
nice make -j 5
replace '5' with some number, probably 1 more than the number of CPU's. You might want to do 'nice' this just in case someone else wants to use the machine while you are on it.
The answer to the general question is difficult, because it depends on the details of the things you are parallelizing.
On the other hand, for this specific purpose, you should use pbzip2 instead of plain bzip2 (chances are that pbzip2 is already installed or at least in the repositories or your distro). See here for details: http://compression.ca/pbzip2/
I find this kind of operation counterproductive. The reason is the more processes access the disk at the same time the higher the read/write time goes so the final result ends in a longer time. The bottleneck here won't be a CPU issue, no matter how many cores you have.
Haven't you ever performed a simple two big file copies at the same time on the same HD drive? I is usually faster to copy one and then another.
I know this task involves some CPU power (bzip2 is demanding compression method), but try measuring first CPU load before going the "challenging" path we all technicians tend to choose much more often than needed.
I did something like this for bash. The parallel make trick is probably a lot faster for one-offs, but here is the main code section to implement something like this in bash, you will need to modify it for your purposes though:
#!/bin/bash
# Replace NNN with the number of loops you want to run through
# and CMD with the command you want to parallel-ize.
set -m
nodes=`grep processor /proc/cpuinfo | wc -l`
job=($(yes 0 | head -n $nodes | tr '\n' ' '))
isin()
{
local v=$1
shift 1
while (( $# > 0 ))
do
if [ $v = $1 ]; then return 0; fi
shift 1
done
return 1
}
dowait()
{
while true
do
nj=( $(jobs -p) )
if (( ${#nj[#]} < nodes ))
then
for (( o=0; o<nodes; o++ ))
do
if ! isin ${job[$o]} ${nj[*]}; then let job[o]=0; fi
done
return;
fi
sleep 1
done
}
let x=0
while (( x < NNN ))
do
for (( o=0; o<nodes; o++ ))
do
if (( job[o] == 0 )); then break; fi
done
if (( o == nodes )); then
dowait;
continue;
fi
CMD &
let job[o]=$!
let x++
done
wait
If you had to solve the problem today you would probably use a tool like GNU Parallel (unless there is a specialized parallelized tool for your task like pbzip2):
find . | grep ".xml$" | parallel bzip2
To learn more:
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial). You command line
with love you for it.
I think you could to the following
for i in `find . | grep ".xml$"`; do bzip2 $i&; done
But that would spin off however many processes as you have files instantly and isn't an optimal as just running four processes at a time.

Resources