Is this the faster way to test cpu load using shell scripting? - bash

I'm relatively new to shell scripting and I'm in the process of writing my own health checking scripts using bash.
Is the following script to test cpu load the best I can have in terms of performance, readability and maintainability?
#!/bin/sh
getloadavg5 () {
echo $(cat /proc/loadavg | cut -f2 -d' ')
}
getnumcpus () {
echo $(cat /proc/cpuinfo | grep '^processor' | wc -l)
}
awk \
-v failthold=0.8 \
-v warnthold=0.7 \
-v loadavg=$(getloadavg5) \
-v numcpus=$(getnumcpus) \
'BEGIN {
ratio=loadavg/numcpus
if (ratio >= failthold) exit 2
if (ratio >= warnthold) exit 1
exit 0
}'

This might be more suitable for the code review stackexchange, but without condoning the use of load averages in this way, here are some ideas:
#!/bin/sh
read -r one five fifteen rest < /proc/loadavg
cpus=$(grep -c '^processor' /proc/cpuinfo)
awk \
-v failthold=0.8 \
-v warnthold=0.7 \
-v loadavg="$five" \
-v numcpus="$cpus" \
'BEGIN {
ratio=loadavg/numcpus
if (ratio >= failthold) exit 2
if (ratio >= warnthold) exit 1
exit 0
}'
It doesn't have any of the unnecessary cats/echos.
It also happens to run faster thanks to forking 1 or 2 times (depending on shell) instead of ~10, but if performance is an issue then shell scripts should be avoided in general.

Related

Passing args to defined bash functions through GNU parallel

Let me show you a snippet of my Bash script and how I try to run parallel:
parallel -a "$file" \
-k \
-j8 \
--block 100M \
--pipepart \
--bar \
--will-cite \
_fix_col_number {} | _unify_null_value {} >> "$OUTPUT_DIR/$new_filename"
So, I am basically trying to process each line in a file in parallel using Bash functions defined inside my script. However, I am not sure how to pass each line to my defined functions "_fix_col_number" and "_unify_null_value". Whatever I do, nothing gets passed to the functions.
I am exporting the functions like this in my script:
declare -x NUM_OF_COLUMNS
export -f _fix_col_number
export -f _add_tabs
export -f _unify_null_value
The mentioned functions are:
_unify_null_value()
{
_string=$(echo "$1" | perl -0777 -pe "s/(?<=\t)\.(?=\s)//g" | \
perl -0777 -pe "s/(?<=\t)NA(?=\s)//g" | \
perl -0777 -pe "s/(?<=\t)No Info(?=\s)//g")
echo "$_string"
}
_add_tabs()
{
_tabs=""
for (( c=1; c<=$1; c++ ))
do
_tabs="$_tabs\t"
done
echo -e "$_tabs"
}
_fix_col_number()
{
line_cols=$(echo "$1" | awk -F"\t" '{ print NF }')
if [[ $line_cols -gt $NUM_OF_COLUMNS ]]; then
new_line=$(echo "$1" | cut -f1-"$NUM_OF_COLUMNS")
echo -e "$new_line\n"
elif [[ $line_cols -lt $NUM_OF_COLUMNS ]]; then
missing_columns=$(( NUM_OF_COLUMNS - line_cols ))
new_line="${1//$'\n'/}$(_add_tabs $missing_columns)"
echo -e "$new_line\n"
else
echo -e "$1"
fi
}
I tried removing {} from parallel. Not really sure what I am doing wrong.
I see two problems in the invocation plus additional problems with the functions:
With --pipepart there are no arguments. The blocks read from -a file are passed over stdin to your functions. Try the following commands to confirm this:
seq 9 > file
parallel -a file --pipepart echo
parallel -a file --pipepart cat
Theoretically, you could read stdin into a variable and pass that variable to your functions, ...
parallel -a file --pipepart 'b=$(cat); someFunction "$b"'
... but I wouldn't recommend it, especially since your blocks are 100MB each.
Bash interprets the pipe | in your command before parallel even sees it. To run a pipe, quote the entire command:
parallel ... 'b=$(cat); _fix_col_number "$b" | _unify_null_value "$b"' >> ...
_fix_col_number seems to assume its argument to be a single line, but receives 100MB blocks instead.
_unify_null_value does not read stdin, so _fix_col_number {} | _unify_null_value {} is equivalent to _unify_null_value {}.
That being said, your functions can be drastically improved. They start a lot of processes which becomes incredibly expensive for larger files. You can do some trivial improvements like combining perl ... | perl ... | perl ... into a single perl. Likewise, instead of storing everything in variables, you can process stdin directly: Just use f() { cmd1 | cmd2; } instead of f() { var=$(echo "$1" | cmd1); var=$(echo "$var" | cmd2); echo "$var"; }.
However, don't waste time on small things like these. A complete rewrite in sed, awk, or perl is easy and should outperfom every optimization on the existing functions.
Try
n="INSERT NUMBER OF COLUMNS HERE"
tabs=$(perl -e "print \"\t\" x $n")
perl -pe "s/\r?\$/$tabs/; s/\t\K(\.|NA|No Info)(?=\s)//g;" file |
cut -f "1-$n"
If you still find this too slow, leave out file; pack the command into a function, export that function and then call parallel -a file -k --pipepart nameOfTheFunction. The option --block is not necessary as pipepart will evenly split the input based on the number of jobs (can be specified with -j).

Shell Script loop is executing multiple times

I have a log file. I’m doing tail -f and grep options whenever new logs are coming. I’m facing loop issue, It is executing multiple times. here is my script,
AuditTypeID=$""
QueryResult=$""
tail -n 0 -F hive-server2.log \
| while read LINE
do
if [ `echo $LINE | grep -c "select *" ` -gt 0 ]
then
AuditTypeID=15
QueryResult=$(
awk '
BEGIN{ print "" }
/Executing command\(queryId/{ sub(/.*queryId=[^[:space:]]+: /,""); q=$0 }
/s3:\/\//{ print "," q }
' OFS=',' hive-server2.log \
| sed -n \$p
)
elif [ `echo $LINE | grep -c 'select count' ` -gt 0 ]
then
AuditTypeID=22
QueryResult="$(
grep -oE 'select count\(.\) from [a-zA-Z][a-zA-Z0-9]*' hive-server2.log \
| sed -n \$p
)"
fi
user=$(
cat hive-server2.log \
| grep user \
| awk -F "[. ]" '{print "," $(NF-1)}' \
| tr -d ',' \
| tr -d 'UTC'
)
Additional_Info=$(
echo -e "{\"user\":\"""${user}""\", \"query\":\"""${QueryResult}""\",\"s3Path\":\"""${s3}""\"}"
)
echo -e "$Additional_Info" > op.json
for file in /var/log/hive/op.json
do
boto-rsync $file s3://hive-log/log/script/$file.$current_time
done
done
It will filter the operations based on the keyword. For some reason it is executing multiple times. I need to save the output for only one instance and any help to simplify the logic is appreciated.
First thing I see in your script is that in the first awk scriptlet inside the if statement you seem to be reparsing the whole of hive-server2.log (which is probably racy/bad because you are tailing to your script, and hive-server.log is growing?)... and this reparsing of the log seems to be a common theme in the script -- I think this is the root cause of the issue.
One simplification ;) readily apparent is removal of the elif code -- it will never run because /select count/ will be matched by the if statement's /select */.
To truly take a stab at simplifying this, my strategy would be to rewrite the whole of this in awk. There is nothing that you are doing here that is beyond awk's built-in capabilities -- and awk can fire off external shell commands as easily as sh. The awk implementation will also likely be much faster.
I started trying to do this translation, but with the way you are specifying multiple reparsing of hive-server2.log, I frankly got lost. Having a bit of input and intended output would help here... Please post hive-server2.log and your expected output.

How do I detect a failed subprocess in a bash read statement?

In bash we can set an environment variable from a sequence of commands using read and a pipe to a subprocess. But I'm having trouble detecting errors in my processing in one edge case - a part of the subprocess pipeline producing some output before erroring.
A simplified example which takes an input file, looks for a line starting with "foo" and sets var to the first word on that line is:
set -e
set -o pipefail
set -o nounset
die() {
echo $1 > /dev/stderr
exit 1
}
read -r var rest < <( \
cat data.txt \
| grep foo \
|| die "PIPELINE" \
) || die "OUTER"
echo "var=$var"
Running this with data.txt like
blah
zap foo awesome
bang foo
will output
var=zap
Running this on a data.txt file that doesn't contain foo outputs (to stderr)
DEAD: PIPELINE
DEAD: OUTER
This is all as expected.
We can introduce another no-op stage like cat at the end of the process
...
read -r var rest < <( \
cat data.txt \
| grep foo \
| cat \
|| die "PIPELINE" \
) || die "OUTER"
...
and everything continues to work.
But if the additional stage is paste -s -d' ' and the input does not contain "foo" the output is
var=
DEAD: PIPELINE
Which seems to show that the pipeline errors, but read succeeds with an empty line. (It looks like paste -s -d' ' outputs a line of output even when its input is empty.)
Is there a simple way to detect this failure of the pipeline, and cause the main script to error out?
I guess I could check that the variable is not empty - but this is a simplified version - I'm actually using sed and paste to join multiple lines to set multiple variables, like
read -r v1 v2 v3 rest < <( \
cat data.txt \
| grep "^foo=" \
| sed -e 's/foo=//' \
| paste -s -d' ' \
|| die "PIPELINE"
) || die "OUTER"
You could use another grep to see if the output of paste contained something:
read -r var rest < <( \
cat data.txt \
| grep foo \
| paste -s -d' ' \
| grep . \
|| die "PIPELINE" \
) || die "OUTER"
In the end I went with two different solutions depending on the context.
The first was to pipe the results to a temporary file. This will process the entire file before performing the read, and thus any failures in the pipe will cause the script to fail.
cat data.txt \
| grep "^foo=" \
| sed -e 's/foo=//' \
| paste -s -d' ' \
> $TMP/result.txt
|| die "PIPELINE"
read -r var rest < $TMP/result.txt || die "OUTER"
The second was to just test that the variables were set. While this meant
there was a bunch of duplication that I wanted to avoid, it seemed the most bullet-proof solution.
read -r var rest < <( cat data.txt \
| grep "^foo=" \
| sed -e 's/foo=//' \
| paste -s -d' ' \
|| die "PIPELINE"
) || die "OUTER"
[ ! -z "$var" ] || die "VARIABLE NOT SET"

Need help escaping from awk quotations in bash script

I have an alias in my bashrc file that outputs current folder contents and system available storage, updated continuously by the watch function.
alias wtch='watch -n 0 -t "du -sch * -B 1000000 2>/dev/null | sort -h && df -h -B 1000000| head -2 | awk '{print \$4}'"'
The string worked fine until I put in the awk part. I know I need to escape the single quotation marks, while still staying in the double quotation marks and the $4 but I haven't been able to get it to work. What am I doing wrong?
This is the error I get
-bash: alias: $4}": not found
Since the quoting for the alias is making it tough, you could just make it a function instead:
wtch() {
watch -n 0 -t "du -sch * -B 1000000 2>/dev/null | sort -h && df -h -B 1000000| head -2 | awk '{print $4}'"
}
This is a lot like issue 2 in the BashFAQ/050
Also, a minor thing but you can skip the head process at the end and just have awk do it, even exiting after the second row like
wtch() {
watch -n 0 -t "du -sch * -B 1000000 2>/dev/null | sort -h && df -h -B 1000000| awk '{print $4} NR >= 3 {exit}'"
}
In this case you can use cut instead of awk. And you'll have the same effect.
alias wtch="watch -n 0 -t 'du -sch * -B 1000000 2>/dev/null | sort -h && df -h -B 1000000| head -2 | cut -d\ -f4'"
Explaining cut:
-d option defines a delimiter
-d\ means that my delimiter is space
-f selects a column
-f4 gives you the fourth column

Bash Xargs Sleep (Multiple Command Line Arguments)

Ok so I have the following script that updates Route43 DNS entries. Unfortunately there is a limit to the number of calls per second you can make so I need to make the final Xargs command sleep for about a second between each iteration.
I've tried a couple of things like ' {../cli53 blah; sleep 10; } ' and I cant seem to get it to work. Does anyone have any suggestions please:
#!/bin/bash
set root='dirname $0'
ec2-describe-instances -O ******* -W ******* --region eu-west-1 |
perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/
and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sName\s+(\S+)/
and print "$1 $dns\n"' |
perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' |
grep -v 'i-' | xargs --verbose -n 4 /usr/local/bin/cli53 rrcreate -x 5 contoso.com
Edit: Thanks Etan for the Answer. Here is my solution for anyone else that needs it:
I had to include the -I %variable% switch into the xargs statement aswel to make sure that the feed in was passed as parameters to cli53 but it all looks to be working nicely now.
#!/bin/bash
set root='dirname $0'
ec2-describe-instances -O ******* -W ******* --region eu-west-1 |
perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/
and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sName\s+(\S+)/
and print "$1 $dns\n"' |
perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' |
grep -v '^i-' |
xargs --verbose -n 4 -I myvar /bin/sh -c '{ /usr/local/bin/cli53 rrcreate -x 5 contoso.com 'myvar'; sleep 1; printf "\n\n"; }'
The simplest solution would be to simply put the cli53 and sleep calls in a script and use xargs to execute the script.
If you don't want to do that you should be able to do what you were trying to do with this:
... | xargs ... /bin/sh -c '{ /usr/local/bin/cli53 ... "$#"; sleep 10; }' -

Resources