Bash Script is super slow - performance

I'm updating an old script to parse ARP data and get useful information out of it. We added a new router and while I can pull the ARP data out of the router it's in a new format. I've got a file "zTempMonth" which is a all the arp data from both sets of routers that I need to compile down into a new data format that's normalized. The below lines of code do what I need them to logically - but it's extremely slow - as in it will take days to run these loops where previously the script took 20-30 minutes. Is there a way to speed this up, or identify what's slowing it down?
Thank you in advance,
echo "Parsing zTempMonth"
while read LINE
do
wc=`echo $LINE | wc -w`
if [[ $wc -eq "6" ]]; then
true
out=$(echo $LINE | awk '{ print $2 " " $4 " " $6}')
echo $out >> zTempMonth.tmp
else
false
fi
if [[ $wc -eq "4" ]]; then
true
out=$(echo $LINE | awk '{ print $1 " " $3 " " $4}')
echo $out >> zTempMonth.tmp
else
false
fi
done < zTempMonth

While read loops are slow.
Subshells in a loop are slow.
>> (open(f, 'a')) calls in a loop are slow.
You could speed this up and remain in pure bash, just by losing #2 and #3:
#!/usr/bin/env bash
while read -a line; do
case "${#line[#]}" in
6) printf '%s %s %s\n' "${line[1]}" "${line[3]}" "${line[5]}";;
4) printf '%s %s %s\n' "${line[0]}" "${line[2]}" "${line[3]}";;
esac
done < zTempMonth >> zTempMonth.tmp
But if there are more than a few lines, this will still be slower than pure awk. Consider an awk script as simple as this:
BEGIN {
print "Parsing zTempMonth"
}
NF == 6 {
print $2 " " $4 " " $6
}
NF == 4 {
print $1 " " $3 " " $4
}
You could execute it like this:
awk -f thatAwkScript zTempMonth >> zTempMonth.tmp
to get the same append approach as your current script.

When writing shell scripts, it’s almost always better to call a function directly rather than using a subshell to call the function. The usual convention that I’ve seen is to echo the return value of the function and capture that output using a subshell. For example:
#!/bin/bash
function get_path() {
echo "/path/to/something"
}
mypath="$(get_path)"
This works fine, but there is a significant speed overhead to using a subshell and there is a much faster alternative. Instead, you can just have a convention wherein a particular variable is always the return value of the function (I use retval). This has the added benefit of also allowing you to return arrays from your functions.
If you don’t know what a subshell is, for the purposes of this blog post, a subshell is another bash shell that is spawned whenever you use $() or `` and is used to execute the code you put inside.
I did some simple testing to allow you to observe the overhead. For two functionally equivalent scripts:
This one uses a subshell:
#!/bin/bash
function a() {
echo hello
}
for (( i = 0; i < 10000; i++ )); do
echo "$(a)"
done
This one uses a variable:
#!/bin/bash
function a() {
retval="hello"
}
for (( i = 0; i < 10000; i++ )); do
a
echo "$retval"
done
The speed difference between these two is noticeable and significant.
$ for i in variable subshell; do
> echo -e "\n$i"; time ./$i > /dev/null
> done
variable
real 0m0.367s
user 0m0.346s
sys 0m0.015s
subshell
real 0m11.937s
user 0m3.121s
sys 0m0.359s
As you can see, when using variable, execution takes 0.367 seconds. subshell however takes a full 11.937 seconds!
Source: http://rus.har.mn/blog/2010-07-05/subshells/

Related

Want to respond to the output of a script, but limit responses to every second; POSIX

I am trying to write a script (for dash, so I want it posix compliant). What I want the script to do is print me the name of my currently connected SSID (in json format).
This info is to be processed by my status bar (polybar, waybar and i3status). I would like to print this information as few times as possible, and immediately on network change.
I have found the command ip monitor useful; since it prints lines when there is a network change. However, it is very verbose. (I have similar scripts set up, and for pulseaudio, I use grep to filter unwanted lines for example) I do not neccessarily want to filter the lines; since filtering doesn't really reduce the number of lines my script prints.
I managed to write a script in bash, but the redirection is not POSIX compliant, so I can't switch to dash.
This is the code I currently have, that depends on bash's redirection
#!/bin/bash
_int="wifi"
get_text () {
_ssid="$(iwctl station "${_int}" get-networks | awk '/>/ {print $2} ' 2>/dev/null)"
if [ -z "${_ssid}" ] ; then
echo '{ "format": "", "mute": true, "prefix": ""}'
else
echo "{ \"format\":\"${_ssid}\", \"mute\": false, \"prefix\":\" \" }"
fi
}
while : ; do
get_text
_time="$(date +%s)"
while read line; do
if [ "$_time" != "$(date +%s)" ] ; then
break
fi
done < <(ip monitor)
done
If I do this, it is POSIX compliant, but the output is printed ~20 times (the number is not consistent)
#!/bin/dash
_int="wifi"
get_text () {
_ssid="$(iwctl station "${_int}" get-networks | awk '/>/ {print $2} ' 2>/dev/null)"
if [ -z "${_ssid}" ] ; then
echo '{ "format": "", "mute": true, "prefix": ""}'
else
echo "{ \"format\":\"${_ssid}\", \"mute\": false, \"prefix\":\" \" }"
fi
}
ip monitor | while read line; do
if echo $line | grep -q 'wifi' ; then
get_text
fi
done
EDIT: I found a solution, as I was typing my question. My loop is now the following; which checks if the time of the outputs were in different seconds.
get_text
ip monitor | while read line; do
if echo $line | grep -q "${_int}" && [ "$_time" != "$(date +%s)" ]; then
get_text
_time="$(date +%s)"
fi
done
Do note that ip monitor does print output that does not indicate network connection change. But I can handle couple prints every minute. I am afraid of very closed interspaced outputs; since those cause the infobars to be redrawn. I also have a python script that parses the arguments, so many outputs in a short time multiplies my overhead.

Quickly generating test data (UUIDs, large random numbers, etc) with bash scripting

I have a small bash script with a function containing a case statement which echoes random data if the 1st argument matches the case parameter.
Code is as follows:
#!/usr/bin/env bash '
AC='auto-increment'
UUID='uuid'
LAT='lat'
LONG='long'
IP='ip'
generate_mock_data() {
# ARGS: $1 - data type, $2 - loop index
case ${1} in
${AC})
echo ${2} ;;
${UUID})
uuidgen ;;
${LAT})
echo $((RANDOM % 180 - 90)).$(shuf -i1000000-9999999 -n1) ;;
${LONG})
echo $((RANDOM % 360 - 180)).$(shuf -i1000000-9999999 -n1) ;;
${IP})
echo $((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256)) ;;
esac
}
# Writing data to file
headers=('auto-increment' 'uuid' 'lat' 'long' 'ip')
for i in {1..2500}; do
for header in "${headers[#]}"; do
echo -n $(generate_mock_data ${header} ${i}),
done
echo # New line
done >> file.csv
However, execution time is incredibly slow for just 2500 rows:
real 0m8.876s
user 0m0.576s
sys 0m0.868s
What am I doing wrong ? is there anything I can do to speed up the process ? or is bash not the right language for these type of operations ?
I also tried profiling the entire script but after looking at the logs I didn't notice any significant bottlenecks.
It seems you can generate a UUID pretty fast with Python, so if you just execute Python once to generate 2,500 UUIDs, and you aren't a Python programmer -like me ;-) then you can patch them up with awk:
python -c 'import uuid; print("\n".join([str(uuid.uuid4()).upper() for x in range(2500)]))' |
awk '{
lat=-90+180*rand();
lon=-180+360*rand();
ip=int(256*rand()) "." int(256*rand()) "." int(256*rand()) "." int(256*rand());
print NR,$0,lat,lon,ip
}' OFS=,
This takes 0.06s on my iMac.
OFS is the "Output Field Separator"
NR is the line number
$0 means "the whole input line"
You can try the Python on its own, like this:
python -c 'import uuid; print("\n".join([str(uuid.uuid4()).upper() for x in range(2500)]))'
Is Shell The Right Tool?
Not really, but if you avoid bad practices, you can make something relatively fast.
With ksh93, the below reliably runs in 0.5-0.6s wall-clock; with bash, 1.2-1.3s.
What Does It Look Like?
#!/usr/bin/env bash
# Comment these two lines if running with ksh93, obviously. :)
[ -z "$BASH_VERSION" ] && { echo "This requires bash 4.1 or newer" >&2; exit 1; }
[[ $BASH_VERSION = [123].* ]] && { echo "This requires bash 4.1 or newer" >&2; exit 1; }
uuid_stream() {
python -c '
import uuid
try:
while True:
print str(uuid.uuid4()).upper()
except IOError:
pass # probably an EPIPE because we were closed.
'
}
# generate a file descriptor that emits a shuffled stream of integers
exec {large_int_fd}< <(while shuf -r -i1000000-9999999; do :; done)
# generate a file descriptor that emits an endless stream of UUIDs
exec {uuid_fd}< <(uuid_stream)
generate_mock_data() {
typeset val
case $1 in
auto-increment) val="$2" ;;
uuid) IFS= read -r val <&"$uuid_fd" || exit;;
lat) IFS= read -r val <&"$large_int_fd" || exit
val="$((RANDOM % 180 - 90)).$val" ;;
long) IFS= read -r val <&"$large_int_fd" || exit
val="$((RANDOM % 360 - 180)).$val" ;;
ip) val="$((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256))" ;;
esac
printf '%s' "$val"
}
for ((i=0; i<2500; i++)); do
for header in auto-increment uuid lat long ip; do
generate_mock_data "$header" "$i"
printf ,
done
echo
done > file.csv
What's Different?
There are no command substitutions inside the inner loop. That means we don't ever use $() or any synonym for same. Each of these involves a fork() -- creating a new OS-level copy of the process -- and a wait(), with a bunch of FIFO magic to capture our output.
There are no external commands inside the inner loop. Any external command is even worse than a command substitution: They require a fork, and then additionally require an execve, with the dynamic linker and loader being invoked to pull in all the library dependencies for whichever external command is being run.
Because we don't have a command substitution stripping newlines, we have the function just not emitting them.

String together awk commands

I'm writing a script that searches a file, gets info that it then stores into variables, and executes a program that I made using those variables as data. I actually have all of that working, but I need to take it a step further:
What I currently have is
#!/bin/sh
START=0
END=9
LOOP=10
PASS=0
for i in $(seq 0 $LOOP)
do
LEN=$(awk '/Len =/ { print $3; exit;}' ../../Tests/shabittestvectors/SHA1ShortMsg.rsp)
MSG=$(awk '/Msg =/ { print $3; exit; }' ../../Tests/shabittestvectors/SHA1ShortMsg.rsp)
MD=$(awk '/MD =/ { print $3; exit; }' ../../Tests/shabittestvectors/SHA1ShortMsg.rsp)
echo $LEN
echo $MSG
MD=${MD:0:-1}
CIPHER=$(./cyassl hash -sha -i $MSG -l $LEN)
echo $MD
echo $CIPHER
if [ $MD == $CIPHER ]; then
echo "PASSED"
PASS=$[PASS + 1]
echo $PASS
fi
done
if [ $PASS == $[LOOP+1] ]; then
echo "All Tests Successful"
fi
And the input file looks like this:
Len = 0
Msg = 00
MD = da39a3ee5e6b4b0d3255bfef95601890afd80709
Len = 1
Msg = 00
MD = bb6b3e18f0115b57925241676f5b1ae88747b08a
Len = 2
Msg = 40
MD = ec6b39952e1a3ec3ab3507185cf756181c84bbe2
All the program does right now, is read the first instances of the variables and loop around there. I'm hoping to use START and END to determine the lines in which it checks the file, and then increment them every time it loops to obtain the other instances of the variable names, but all of my attempts have been unsuccessful so far. Any ideas?
EDIT: Output should look something like, providing my program "./cyassl" works as it should
0
00
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
PASSED
1
00
bb6b3e18f0115b57925241676f5b1ae88747b08a
bb6b3e18f0115b57925241676f5b1ae88747b08a
PASSED
2
40
ec6b39952e1a3ec3ab3507185cf756181c84bbe2
ec6b39952e1a3ec3ab3507185cf756181c84bbe2
PASSED
etc.
There's no need to make multiple passes on the input file.
#!/bin/sh
exec < ../../Tests/shabittestvectors/SHA1ShortMsg.rsp
status=pass
awk '{print $3,$6,$9}' RS= | {
while read len msg md; do
if test "$(./cyassl hash -sha -i $msg -l $len)" = "$md"; then
echo passed
else
status=fail
fi
done
test "$status" = pass && echo all tests passed
}
The awk will read from stdin (which the exec redirects from the file; personally I would skip that line and have the caller direct input appropriately) and splits the input into records of one paragraph each. A "paragraph" here means that the records are separated by blank lines (the lines must be truly blank, and cannot contain whitespace). Awk then parses each record and prints the 3rd, 6th, and 9th field on a single line. This is a bit fragile, but for the shown input those fields represent length, message, and md hash, respectively. All the awk is doing is rearranging the input so that it is one record per line. Once the data is in a more readable format, a subshell reads the data one line at a time, parsing it into the variables named "len", "msg", and "md". The do loop processes once per line of input, spewing the rather verbose message "passed" with each test it runs (I would remove that, but retained it here for consistency with the original script), and setting the status if any tests fail. The braces are necessary to ensure that the value of the variable status is retained after the do loop terminates.
The following code,
inputfile="../../Tests/shabittestvectors/SHA1ShortMsg.rsp"
while read -r len msg md
do
echo got: LEN:$len MSG:$msg MD:$md
#cypher=$(./cyassl hash -sha -i $msg -l $len)
#continue as you wish
done < <(perl -00 -F'[\s=]+|\n' -lE 'say qq{$F[1] $F[3] $F[5]}' < "$inputfile")
for your input data, produces:
got: LEN:0 MSG:00 MD:da39a3ee5e6b4b0d3255bfef95601890afd80709
got: LEN:1 MSG:00 MD:bb6b3e18f0115b57925241676f5b1ae88747b08a
got: LEN:2 MSG:40 MD:ec6b39952e1a3ec3ab3507185cf756181c84bbe2
If your input data is in order you can have this with a simplified bash:
#!/bin/bash
LOOP=10
PASS=0
FILE='../../Tests/shabittestvectors/SHA1ShortMsg.rsp'
for (( I = 1; I <= LOOP; ++I )); do
read -r LEN && read -r MSG && read -r MD || break
echo "$LEN"
echo "$MSG"
MD=${MD:0:-1}
CIPHER=$(exec ./cyassl hash -sha -i "$MSG" -l "$LEN")
echo "$MD"
echo "$CIPHER"
if [[ $MD == "$CIPHER" ]]; then
echo "PASSED"
(( ++PASS ))
fi
done < <(exec awk '/Len =/,/Msg =/,/MD =/ { print $3 }' "$FILE")
[[ PASS -eq LOOP ]] && echo "All Tests Successful."
Just make sure you don't run it as sh e.g. sh script.sh. bash script.sh most likely.

bash: how does float arithmetic work?

I'm gonna tear my hair out: I have this script:
#!/bin/bash
if [[ $# -eq 2 ]]
then
total=0
IFS=' '
while read one two; do
total=$((total+two))
done < $2
echo "Total: $total"
fi
Its supposed to add up my gas receipts I have saved in a file in this format:
3/9/13 21.76
output:
./getgas: line 9: 21.76: syntax error: invalid arithmetic operator (error token is ".76")
I read online that its possible to do float math in bash, and I found an an example script that works and it has:
function float_eval()
{
local stat=0
local result=0.0
if [[ $# -gt 0 ]]; then
result=$(echo "scale=$float_scale; $*" | bc -q 2>/dev/null)
stat=$?
if [[ $stat -eq 0 && -z "$result" ]]; then stat=1; fi
fi
echo $result
return $stat
}
which looks awesome, and runs no problem
WTF is going on here. I can easily do this is C but this crap is making me mad
EDIT: I don't anything about awk. It looks promising but I don't even know how to run those one-liners you guys posted
awk '{ sum += $2 } END { printf("Total: %.2f\n", sum); }' $2
Add up column 2 (that's the $2 in the awk script) of the file named by shell script argument $2 (rife with opportunities for confusion) and print the result at the end.
I don't [know] anything about awk. It looks promising but I don't even know how to run those one-liners you guys posted.
In the context of your script:
#!/bin/bash
if [[ $# -eq 2 ]]
then
awk '{ sum += $2 } END { printf("Total: %.2f\n", sum); }' $2
else
echo "Usage: $0 arg1 receipts-file" >&2; exit 1
fi
Or just write it on the command line, substituting the receipts file name for the $2 after the awk command. Or leave that blank and redirect from the file. Or type the dates and values in. Or, …
Your script demands two arguments, but doesn't use the first one, which is a bit puzzling.
As noted in the comments, you could simplify that to:
#!/bin/bash
exec awk '{ sum += $2 } END { printf("Total: %.2f\n", sum) }' "$#"
Or even use the shebang to full power:
#!/usr/bin/awk -f
{ sum += $2 }
END { printf("Total: %.2f\n", sum) }
The kernel will execute awk for you, and that's the awk script written out as a two line program. Of course, if awk is in /bin/awk, then you have to fix the shebang line; the shell looks in many places for awk and will probably find it. So there are advantages to sticking with a shell script. Both these revisions simply sum what's on standard input if there are no files specified, or what is in all the files specified if there is one or more files specified on the command line.
In bash you can only operate on integers. The example script you posted uses bc which is an arbitrary-precision calculation, included with most UNIX-like OS-es. So the script prepares an expression and pipes it to bc (the initial scale=... expression configures the number of significant digits bc should display.
A simplified example would be:
echo -e 'scale=2\n1.234+5.67\nquit' | bc
You could also use awk:
awk 'BEGIN{print 1.234+5.67}'

How can I get my bash script to work?

My bash script doesn't work the way I want it to:
#!/bin/bash
total="0"
count="0"
#FILE="$1" This is the easier way
for FILE in $*
do
# Start processing all processable files
while read line
do
if [[ "$line" =~ ^Total ]];
then
tmp=$(echo $line | cut -d':' -f2)
count=$(expr $count + 1)
total=$(expr $total + $tmp)
fi
done < $FILE
done
echo "The Total Is: $total"
echo "$FILE"
Is there another way to modify this script so that it reads arguments into $1 instead of $FILE? I've tried using a while loop:
while [ $1 != "" ]
do ....
done
Also when I implement that the code repeats itself. Is there a way to fix that as well?
Another problem that I'm having is that when I have multiple files hi*.txt it gives me duplicates. Why? I have files like hi1.txt hi1.txt~ but the tilde file is of 0 bytes, so my script shouldn't be finding anything.
What i have is fine, but could be improved. I appreciate your awk suggestions but its currently beyond my level as a unix programmer.
Strager: The files that my text editor generates automatically contain nothing..it is of 0 bytes..But yeah i went ahead and deleted them just to be sure. But no my script is in fact reading everything twice. I suppose its looping again when it really shouldnt. I've tried to silence that action with the exit commands..But wasnt successful.
while [ "$1" != "" ]; do
# Code here
# Next argument
shift
done
This code is pretty sweet, but I'm specifying all the possible commands at one time. Example: hi[145].txt
If supplied would read all three files at once.
Suppose the user enters hi*.txt;
I then get all my hi files read twice and then added again.
How can I code it so that it reads my files (just once) upon specification of hi*.txt?
I really think that this is because of not having $1.
It looks like you are trying to add up the totals from the lines labelled 'Total:' in the files provided. It is always a good idea to state what you're trying to do - as well as how you're trying to do it (see How to Ask Questions the Smart Way).
If so, then you're doing in about as complicated a way as I can see. What was wrong with:
grep '^Total:' "$#" |
cut -d: -f2 |
awk '{sum += $1}
END { print sum }'
This doesn't print out "The total is" etc; and it is not clear why you echo $FILE at the end of your version.
You can use Perl or any other suitable program in place of awk; you could do the whole job in Perl or Python - indeed, the cut work could be done by awk:
grep "^Total:" "$#" |
awk -F: '{sum += $2}
END { print sum }'
Taken still further, the whole job could be done by awk:
awk -F: '$1 ~ /^Total/ { sum += $2 }
END { print sum }' "$#"
The code in Perl wouldn't be much harder and the result might be quicker:
perl -na -F: -e '$sum += $F[1] if m/^Total:/; END { print $sum; }' "$#"
When iterating over the file name arguments provided in a shell script, you should use '"$#"' in place of '$*' as the latter notation does not preserve spaces in file names.
Your comment about '$1' is confusing to me. You could be asking to read from the file whose name is in $1 on each iteration; that is done using:
while [ $# -gt 0 ]
do
...process $1...
shift
done
HTH!
If you define a function, it'll receive the argument as $1. Why is $1 more valuable to you than $FILE, though?
#!/bin/sh
process() {
echo "doing something with $1"
}
for i in "$#" # Note use of "$#" to not break on filenames with whitespace
do
process "$i"
done
while [ "$1" != "" ]; do
# Code here
# Next argument
shift
done
On your problem with tilde files ... those are temporary files created by your text editor. Delete them if you don't want them to be matched by your glob expression (wildcard). Otherwise, filter them in your script (not recommended).

Resources