Monitor disk space in an `until` loop - bash

my first question here. Hope it's a good one.
So I'm hoping to create a script that kills another script running arecord when my disk gets to a certain usage. (I should point out, I'm not exactly sure how I got to that df filter... just kinda searched around...) My plan is to run both scripts (the one recording, and the one monitoring disk usage) in separate screens.
I'm doing this all on a Raspberry Pi, btw.
So this is my code so far:
#!/bin/bash
DISK=$(df / | grep / | awk '{ print $5}' | sed 's/%//g')
until [ $DISK -ge 50 ]
do
sleep 1
done
killall arecord
This code works when I play with the starting value ("50" changed to "30" or so). But it doesn't seem to "monitor" my disk the way I want it to. I have a bit of an idea what's going on: the variable DISK is only assigned once, not checked or redefined periodically.
In other words, I probably want something in my until loop that "gets" the disk usage from df, right? What are some good ways of going about it?
=
PS I'd be super interested in hearing how I might incorporate this whole script's purpose into the script running arecord itself, but that's beyond me right now... and another question...

You are only setting DISK once since it's done before the loop starts and not done as part of the looping process.
A simple fix is to incorporate the evaluation of the disk space into the actual while loop itself, something like:
#!/bin/bash
until [ $(df / | awk 'NR==2 {print $5}' | tr -d '%') -ge 50 ] ; do
sleep 1
done
killall arecord
You'll notice I've made some minor mods to the command as well, specifically:
You can use awk itself to get the relevant line from the ps output, no need to use grep in a separate pipeline stage.
I prefer tr for deleting single characters, sed can do it but it's a bit of overkill.

Related

More efficient way to loop through lines in shell

I've come to learn that looping through lines in bash by
while read line; do stuff; done <file
Is not the most efficient way to do it. https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice
What is a more time/resource efficient method?
Here's a time'd example using Bash and awk. I have 1 million records in a file:
$ wc -l 1M
1000000 1M
Counting it's records with bash, using while read:
$ time while read -r line ; do ((i++)) ; done < 1M ; echo $i
real 0m12.440s
user 0m11.548s
sys 0m0.884s
1000000
Using let "i++" took 15.627 secs (real) and NOPing with do : ; 10.466. Using awk:
$ time awk '{i++}END{print i}' 1M
1000000
real 0m0.128s
user 0m0.128s
sys 0m0.000s
As others have said, it depends on what you're doing.
The reason it's inefficient is that everything runs in its own process. Depending on what you are doing, that may or may not be a big deal.
If what you want to do in the loop is run another shell process, you won't get any gain from eliminating the loop. If you can do what you need without the need for a loop, you could get a gain.
awk? Perl? C(++)? Of course it depends on if you're interested in CPU time or programmer time, and the latter depends on what the programmer is used to using.
The top answer to the question you linked to pretty much explains that the biggest problem is spawning external processes for simple text processing tasks. E.g. running an instance of awk or a pipeline of sed and cut for each single line just to get a part of the string is silly.
If you want to stay in shell, use the string processing parameter expansions (${var#word}, ${var:n:m}, ${var/search/replace} etc.) and other shell features as much as you can. If you see yourself running a set of commands for each input line, it's time to think the structure of the script again. Most of the text processing commands can process a whole file with one execution, so use that.
A trivial/silly example:
while read -r line; do
x=$(echo "$line" | awk '{print $2}')
somecmd "$x"
done < file
would be better as
awk < file '{print $2}' | while read -r x ; do somecmd "$x" ; done
Choose between awk or perl both are efficient

What can I do to speed up this bash script?

The code I have goes through a file and multiplies all the numbers in the first column by a number. The code works, but I think its somewhat slow. It takes 26.676s (walltime) to go through a file with 2302 lines in it. I'm using a 2.7 GHz Intel Core i5 processor. Here is the code.
#!/bin/bash
i=2
sed -n 1p data.txt > data_diff.txt #outputs the header (x y)
while [ $i -lt 2303 ]; do
NUM=`sed -n "$i"p data.txt | awk '{print $1}'`
SEC=`sed -n "$i"p data.txt | awk '{print $2}'`
NNUM=$(bc <<< "$NUM*0.000123981")
echo $NNUM $SEC >> data_diff.txt
let i=$i+1
done
Honestly, the biggest speedup you can get will come from using a single language that can do the whole task itself. This is mostly because your script invokes 5 extra processes for each line, and invoking extra processes is slow, but also text processing in bash is really not that well optimized.
I'd recommend awk, given that you have it available:
awk '{ print $1*0.000123981, $2 }'
I'm sure you can improve this to skip the header line and print it without modification.
You can also do this sort of thing with Perl, Python, C, Fortran, and many other languages, though it's unlikely to make much difference for such a simple calculation.
Your script runs 4603 separate sed processes, 4602 separate awk processes, and 2301 separate bc processes. If echo were not a built-in then it would also run 2301 echo processes. Starting a process has relatively large overhead. Not so large that you would ordinarily notice it, but you are running over 11000 short processes. The wall time consumption doesn't seem unreasonable for that.
MOREOVER, each sed that you run processes the whole input file anew, selecting from it just one line. This is horribly inefficient.
The solution is to reduce the number of processes you are running, and especially to perform only a single run through the whole input file. A fairly easy way to do that would be to convert to an awk script, possibly with a bash wrapper. That might look something like this:
#!/bin/bash
awk '
NR==1 { print; next }
NR>=2303 { exit }
{ print $1 * 0.000123981, $2 }
' data.txt > data_diff.txt
Note that the line beginning with NR>=2303 artificially stops processing the input file when it reaches the 2303rd line, as your original script does; you could omit that line of the script altogether to let it simply process all the lines, however many there are.
Note, too, that that uses awk's built-in FP arithmetic instead of running bc. If you actually need the arbitrary-precision arithmetic of bc then I'm sure you can figure out how to modify the script to get that.
As an example of how to speed up the bash script (without implying that this is the right solution)
#!/bin/bash
{ IFS= read -r header
echo "$header"
# You can drop the third name "rest" if your input file
# only has two columns.
while read -r num sec rest; do
nnum=$( bc <<< "$num * 0.000123981" )
echo "$nnum $sec"
done
} < data.txt > data_diff.txt
Now you only have one extra call to bc per data line, necessitated by the fact that bash doesn't do floating-point arithmetic. The right answer is to use a single call to program that can do floating-point arithmetic, as pointed out by David Z.

Performance Tuning an AWK?

I've written a simple parser in BASH to take apart csv files and dump to a (temp) SQL-input file. The performance on this is pretty terrible; when running on a modern system I'm barely cracking 100 lines per second. I realize the ultimate answer is to rewrite this in a more performance oriented language, but as a learning opportunity, I'm curious where I can improve my BASH skills.
I suspect there are gains to be made by writing to an ram instead of to a file, then flushing all the text at once to the file, but I'm not clear on where/when BASH gets upset about memory usage (largest files I've parsed have been under 500MB).
The following code-block seems to eat most of the cycles, and as I understand, needs to be processed linearly due to checking timestamps (the data has a timestamp, but no timedate stamp, so I was forced ask the user for the start-day and check if the timestamp has cycled 24:00 -> 0:00), so parallel processing didn't seem like an option.
while read p; do
linetime=`printf "${p}" | awk '{printf $1}'`
# THE DATA LACKS FULL DATESTAMPS, SO FORCED TO ASK USER FOR START-DAY & CHECK IF THE DATE HAS CYCLED
if [[ "$lastline" > "$linetime" ]]
then
experimentdate=$(eval $datecmd)
fi
lastline=$linetime
printf "$p" | awk -v varout="$projname" -v experiment_day="$experimentdate " -v singlequote="$cleanquote" '{printf "insert into tool (project,project_datetime,reported_time,seconds,intensity) values ("singlequote""varout""singlequote","singlequote""experiment_day $1""singlequote","singlequote""$1""singlequote","$2","$3");\n"}' >> $sql_input_file
Ignore the singlequote nonsense, I needed this to run on both OSX & 'nix, so I had to workaround some issues with OSX's awk and singlequotes.
Any suggestions for how I can improve performance?
You do not want to start awk for every line you process in a loop. Replace your loop with awk or replace awk with builtin commands.
Both awk's are only used for printing. Replace these lines with additional parameters to the printf command.
I did not understand the codeblock for datecmd (not using $linetime but using the output variable experimentdate), but this one should be optimised: Can you use regular expressions or some other trick?
So you do not have the tune awk, but decide to use awk completely or get it out of your while-loop.
Your performance would improve if you did all the processing with awk. Awk can read your input file directly, express conditionals, and run external commands.
Awk is not the only one either. Perl and Python would be well suited to this task.

awk command stacks in top

total_pool=1
2
3
4
.
.
.
above is my variable name "total_pool" it has values of thousands in a single column.and It continuously change before every time I fire this script.
I want to parse each single entity of it in a loop..
The problem is this scripts runs from a crontab every 5 minutes..
and in my output of top command some times this query stacks !
like /bin/awk -vRS= -vFS="\n" "{print $1} for a long long time..
How to stop this behavior ? any better approch ?
NOTE:I cannot use array as I have too old bash version.Which do not have array support.
So any better approch to grep data from a column variable one by one ??
#!/bin/sh
row=1
for POOL in ${total_pool} ;
do
poolid=$(/bin/echo "$total_pool" | /bin/awk -vRS= -vFS="\n" "{print \$$row}"
/usr/local/rrd/bin/rrdtool update /var/graphs/p${poolid}.rrd `NOW`:$upload
row=`expr $row + 1`
done
Sounds like echo's standard output is being buffered. If stdbuf from coreutils is an option, you may want to use it to disable echo's output buffering.

Very slow loop using grep or fgrep on large datasets

I’m trying to do something pretty simple; grep from a list, an exact match for the string, on the files in a directory:
#try grep each line from the files
for i in $(cat /data/datafile); do
LOOK=$(echo $i);
fgrep -r $LOOK /data/filestosearch >>/data/output.txt
done
The file with the matches to grep with has 20 million lines, and the directory has ~600 files, with a total of ~40Million lines
I can see that this is going to be slow but we estimated it will take 7 years. Even if I use 300 cores on our HPC splitting the job by files to search, it looks like it could take over a week.
there are similar questions:
Loop Running VERY Slow
:
Very slow foreach loop
here and although they are on different platforms, I think possibly if else might help me.
or fgrep which is potentially faster (but seems to be a bit slow as I'm testing it now)
Can anyone see a faster way to do this?
Thank you in advance
sounds like the -f flag for grep would be suitable here:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
so grep can already do what your loop is doing, and you can replace the loop with:
grep -F -r -f /data/datafile /data/filestosearch >>/data/output.txt
Now I'm not sure about the performance of 20 million patterns, but at least you aren't starting 20 million processes this way so it's probably significantly faster.
As Martin has already said in his answer, you should use the -f option instead of looping. I think it should be faster than looping.
Also, this looks like an excellent use case for GNU parallel. Check out this answer for usage examples. It looks difficult, but is actually quite easy to set up and run.
Other than that, 40 million lines should not be a very big deal for grep if there was only one string to match. It should be able to do it in a minute or two on any decent machine. I tested that 2 million lines takes 6 s on my laptop. So 40 mil lines should take 2 mins.
The problem is with the fact that there are 20 million strings to be matched. I think it must be running out of memory or something, especially when you run multiple instances of it on different directories. Can you try splitting the input match-list file? Try splitting it into chunks of 100000 words each for example.
EDIT: Just tried parallel on my machine. It is amazing. It automatically takes care of splitting the grep on to several cores and several machines.
Here's one way to speed things up:
while read i
do
LOOK=$(echo $i)
fgrep -r $LOOK /deta/filetosearch >> /data/output.txt
done < /data/datafile
When you do that for i in $(cat /data/datafile), you first spawn another process, but that process must cat out all of those lines before running the rest of the script. Plus, there's a good possibility that you'll overload the command line and lose some of the files on the end.
By using q while read loop and redirecting the input from /data/datafile, you eliminate the need to spawn a shell. Plus, your script will immediately start reading through the while loop without first having to cat out the entire /data/datafile.
If $i are a list of directories, and you are interested in the files underneath, I wonder if find might be a bit faster than fgrep -r.
while read i
do
LOOK=$(echo $i)
find $i -type f | xargs fgrep $LOOK >> /data/output.txt
done < /data/datafile
The xargs will take the output of find, and run as many files as possible under a single fgrep. The xargs can be dangerous if file names in those directories contain whitespace or other strange characters. You can try (depending upon the system), something like this:
find $i -type f -print0 | xargs --null fgrep $LOOK >> /data/output.txt
On the Mac it's
find $i -type f -print0 | xargs -0 fgrep $LOOK >> /data/output.txt
As others have stated, if you have the GNU version of grep, you can give it the -f flag and include your /data/datafile. Then, you can completely eliminate the loop.
Another possibility is to switch to Perl or Python which actually will run faster than the shell will, and give you a bit more flexibility.
Since you are searching for simple strings (and not regexp) you may want to use comm:
comm -12 <(sort find_this) <(sort in_this.*) > /data/output.txt
It takes up very little memory, whereas grep -f find_this can gobble up 100 times the size of 'find_this'.
On a 8 core this takes 100 sec on these files:
$ wc find_this; cat in_this.* | wc
3637371 4877980 307366868 find_this
16000000 20000000 1025893685
Be sure to have a reasonably new version of sort. It should support --parallel.
You can write perl/python script, that will do the job for you. It saves all the forks you need to do when you do this with external tools.
Another hint: you can combine strings that you are looking for in one regular expression.
In this case grep will do only one pass for all combined lines.
Example:
Instead of
for i in ABC DEF GHI JKL
do
grep $i file >> results
done
you can do
egrep "ABC|DEF|GHI|JKL" file >> results

Resources