I have a short bash script to check to see if a Python program is running. The program writes out a PID file when it runs, so comparing this to the current list of running processes gives me what I need. But I'm having a problem with a variable being changed and then apparently changing back! Here's the script:
#!/bin/bash
# Test whether Home Server is currently running
PIDFILE=/tmp/montSvr.pid
isRunning=0
# does a pid file exist?
if [ -f "$PIDFILE" ]; then
# pid file exists
# now get contents of pid file
cat $PIDFILE | while read PID; do
if [ $PID != "" ]; then
PSGREP=$(ps -A | grep $PID | awk '{print $1}')
if [ -n "$PSGREP" ]; then
isRunning=1
echo "RUNNING: $isRunning"
fi
fi
done
fi
echo "Running: $isRunning"
exit $isRunning
The output I get, when the Python script is running, is:
RUNNING: 1
Running: 0
And the exit value of the bash script is 0. So isRunning is getting changed within all those if statements (ie, the code is performing as expected), but then somehow isRunning reverts to 0 again. Confused...
Commands after a pipe | are run in a subshell. Changes to variable values in a subshell do not propagate to the parent shell.
Solution: change your loop to
while read PID; do
# ...
done < $PIDFILE
It's the pipe that is the problem. Using a pipe in this way means that the loop runs in a sub-shell, with its own environment. Kill the cat, use this syntax instead:
while read PID; do
if [ $PID != "" ]; then
PSGREP=$(ps -A | grep $PID | awk '{print $1}')
if [ -n "$PSGREP" ]; then
isRunning=1
echo "RUNNING: $isRunning"
fi
fi
done < "$PIDFILE"
Related
I am monitoring a log file and if PATTERN didn't appear in it within THRESHOLD seconds, the script should print "error", otherwise, it should print "clear". The script is working fine, but only if the log is rolling.
I've tried reading 'timeout' but didn't work.
log_file=/tmp/app.log
threshold=120
tail -Fn0 ${log_file} | \
while read line ; do
echo "${line}" | awk '/PATTERN/ { system("touch pattern.tmp") }'
code to calculate how long ago pattern.tmp touched and same is assigned to DIFF
if [ ${diff} -gt ${threshold} ]; then
echo "Error"
else
echo "Clear"
done
It is working as expected only when there is 'any' line printed in the app.log.
If the application got hung for any reason and the log stopped rolling, there won't be any output by the script.
Is there a way to detect the 'no output' of tail and do some command at that time?
It looks like the problem you're having is that the timing calculations inside your while loop never get a chance to run when read is blocking on input. In that case, you can pipe the tail output into a while true loop, inside of which you can do if read -t $timeout:
log_file=/tmp/app.log
threshold=120
timeout=10
tail -Fn0 "$log_file" | while true; do
if read -t $timeout line; then
echo "${line}" | awk '/PATTERN/ { system("touch pattern.tmp") }'
fi
# code to calculate how long ago pattern.tmp touched and same is assigned to diff
if [ ${diff} -gt ${threshold} ]; then
echo "Error"
else
echo "Clear"
fi
done
As Ed Morton pointed out, all caps variable names are not a good idea in bash scripts, so I used lowercase variable names.
How about something simple like:
sleep "$threshold"
grep -q 'PATTERN' "$log_file" && { echo "Clear"; exit; }
echo "Error"
If that's not all you need then edit your question to clarify your requirements. Don't use all upper case for non exported shell variable names btw - google it.
To build further on your idea, it might be beneficial to run the awk part in the background and a continuous loop to do the checking.
#!/usr/bin/env bash
log_file="log.txt"
# threshold in seconds
threshold=10
# run the following process in the background
stdbuf -oL tail -f0n "$log_file" \
| awk '/PATTERN/{system("touch "pattern.tmp") }' &
while true; do
match=$(find . -type f -iname "pattern.tmp" -newermt "-${threshold} seconds")
if [[ -z "${match}" ]]; then
echo "Error"
else
echo "Clear"
fi
done
This looks to me like a watchdog timer. I've implemented something like this by forcing a background process to update my log, so I don't have to worry about read -t. Here's a working example:
#!/usr/bin/env bash
threshold=10
grain=2
errorstate=0
while sleep "$grain"; do
date '+[%F %T] watchdog timer' >> log
done &
trap "kill -HUP $!" 0 HUP INT QUIT TRAP ABRT TERM
printf -v lastseen '%(%s)T'
tail -F log | while read line; do
printf -v now '%(%s)T'
if (( now - lastseen > threshold )); then
echo "ERROR"
errorstate=1
else
if (( errorstate )); then
echo "Recovered, yay"
errorstate=0
fi
fi
if [[ $line =~ .*PATTERN.* ]]; then
lastseen=$now
fi
done
Run this in one window, wait $threshold seconds for it to trigger, then in another window echo PATTERN >> log to see the recovery.
While this can be made as granular as you like (I've set it to 2 seconds in the example), it does pollute your log file.
Oh, and note that printf '%(%s)T' format requires bash version 4 or above.
I have a script that I only want to be running one time. If the script gets called a second time I'm having it check to see if a lockfile exists. If the lockfile exists then I want to see if the process is actually running.
I've been messing around with pgrep but am not getting the expected results:
#!/bin/bash
COUNT=$(pgrep $(basename $0) | wc -l)
PSTREE=$(pgrep $(basename $0) ; pstree -p $$)
echo "###"
echo $COUNT
echo $PSTREE
echo "###"
echo "$(basename $0) :" `pgrep -d, $(basename $0)`
echo sleeping.....
sleep 10
The results I'm getting are:
$ ./test.sh
###
2
2581 2587 test.sh(2581)---test.sh(2587)---pstree(2591)
###
test.sh : 2581
sleeping.....
I don't understand why I'm getting a "2" when only one process is actually running.
Any ideas? I'm sure it's the way I'm calling it. I've tried a number of different combinations and can't quite seem to figure it out.
SOLUTION:
What I ended up doing was doing this (portion of my script):
function check_lockfile {
# Check for previous lockfiles
if [ -e $LOCKFILE ]
then
echo "Lockfile $LOCKFILE already exists. Checking to see if process is actually running...." >> $LOGFILE 2>&1
# is it running?
if [ $(ps -elf | grep $(cat $LOCKFILE) | grep $(basename $0) | wc -l) -gt 0 ]
then
abort "ERROR! - Process is already running at PID: $(cat $LOCKFILE). Exitting..."
else
echo "Process is not running. Removing $LOCKFILE" >> $LOGFILE 2>&1
rm -f $LOCKFILE
fi
else
echo "Lockfile $LOCKFILE does not exist." >> $LOGFILE 2>&1
fi
}
function create_lockfile {
# Check for previous lockfile
check_lockfile
#Create lockfile with the contents of the PID
echo "Creating lockfile with PID:" $$ >> $LOGFILE 2>&1
echo -n $$ > $LOCKFILE
echo "" >> $LOGFILE 2>&1
}
# Acquire lock file
create_lockfile >> $LOGFILE 2>&1 \
|| echo "ERROR! - Failed to acquire lock!"
The argument for pgrep is an extended regular expression pattern.
In you case the command pgrep $(basename $0) will evaluate to pgrep test.sh which will match match any process that has test followed by any character and lastly followed by sh. So it wil match btest8sh, atest_shell etc.
You should create a lock file. If the lock file exists program should exit.
lock=$(basename $0).lock
if [ -e $lock ]
then
echo Process is already running with PID=`cat $lock`
exit
else
echo $$ > $lock
fi
You are already opening a lock file. Use it to make your life easier.
Write the process id to the lock file. When you see the lock file exists, read it to see what process id it is supposedly locking, and check to see if that process is still running.
Then in version 2, you can also write program name, program arguments, program start time, etc. to guard against the case where a new process starts with the same process id.
Put this near the top of your script...
pid=$$
script=$(basename $0)
guard="/tmp/$script-$(id -nu).pid"
if test -f $guard ; then
echo >&2 "ERROR: Script already runs... own PID=$pid"
ps auxw | grep $script | grep -v grep >&2
exit 1
fi
trap "rm -f $guard" EXIT
echo $pid >$guard
And yes, there IS a small window for a race condition between the test and echo commands, which can be fixed by appending to the guard file, and then checking that the first line is indeed our own PID. Also, the diagnostic output in the if can be commented out in a production version.
I'm writing a routine that will identify if a process stops running and will do something once the processes targeted is gone.
I came up with this code (as a test for my future code):
#!/bin/bash
value="aaa"
ls | grep $value
while [ $? = 0 ];
do
sleep 5
ls | grep $value
echo $?
done;
echo DONE
My problem is that for some reason, the loop never stops and echoes 1 after I delete the file "aaa".
0
0 >>> I delete the file at that point (in another terminal)
1
1
1
1
.
.
.
I would expect the output to be "DONE" as soon as I delete the file...
What's the problem?
SOLUTION:
#!/bin/bash
value="aaa"
ls | grep $value
while [ $? = 0 ];
do
sleep 5
ls | grep $value
done;
echo DONE
The value of $? changes very easily. In the current version of your code, this line:
echo $?
prints the status of the previous command (grep) -- but then it sets $? to 0, the status of the echo command.
Save the value of $? in another variable, one that won't be clobbered next time you execute a command:
#!/bin/bash
value="aaa"
ls | grep $value
status=$?
while [ $status = 0 ];
do
sleep 5
ls | grep $value
status=$?
echo $status
done;
echo DONE
If the ls | grep aaa is intended to check whether a file named aaa exists, this:
while [ ! -f aaa ] ; ...
is a cleaner way to do it.
$? is the return code of the last command, in this case your sleep.
You can rewrite that loop in much simpler way like this:
while [ -f aaa ]; do
sleep 5;
echo "sleeping...";
done
You ought not duplicate the command to be tested. You can always write:
while cmd; do ...; done
instead of
cmd
while [ $? = 0 ]; do ...; cmd; done
In your case, you mention in a comment that the command you are testing is parsing the output of ps. Although there are very good arguments that you ought not do that, and that the followon processing should be done by the parent of the command for which you are waiting, we'll ignore that issue at the moment. You can simply write:
while ps -ef | grep -v "grep mysqldump" |
grep mysqldump > /dev/null; do sleep 1200; done
Note that I changed the order of your pipe, since grep -v will return true if it
matches anything. In this case, I think it is not necessary, but I believe is more
readable. I've also discarded the output to clean things up a bit.
Presumably your objective is to wait until a filename containing the string $value is present in the local directory and not necessarily a single filename.
try:
#!/bin/bash
value="aaa"
while ! ls *$value*; do
sleep 5
done
echo DONE
Your original code failed because $?is filled with the return code of the echo command upon every iteration following the first.
BTW, if you intend to use ps instead of ls in the future, you will pick up your own grep unless you are clever. Use ps -ef | grep [m]ysqlplus.
I've written (well, remixed to arrive at) this Bash script
# pkill.sh
trap onexit 1 2 3 15 ERR
function onexit() {
local exit_status=${1:-$?}
echo Problem killing $kill_this
exit $exit_status
}
export kill_this=$1
for X in `ps acx | grep -i $1 | awk {'print $1'}`; do
kill $X;
done
it works fine but any errors are shown to the display. I only want the echo Problem killing... to show in case of error. How can I "catch" (hide) the error when executing the kill statement?
Disclaimer: Sorry for the long example, but when I make them shorter I inevitably have to explain "what I'm trying to do."
# pkill.sh
trap onexit 1 2 3 15 ERR
function onexit() {
local exit_status=${1:-$?}
echo Problem killing $kill_this
exit $exit_status
}
export kill_this=$1
for X in `ps acx | grep -i $1 | awk {'print $1'}`; do
kill $X 2>/dev/null
if [ $? -ne 0 ]
then
onexit $?
fi
done
You can redirect stderr and stdout to /dev/null via something like pkill.sh > /dev/null 2>&1. If you only want to suppress the output from the kill command, only apply it to that line, e.g., kill $X > /dev/null 2>&1;
What this does is take send the standard output (stdout) from kill $X to /dev/null (that's the > /dev/null), and additionally send stderr (the 2) into stdout (the 1).
For my own notes, here's my new code using Paul Creasey's answer:
# pkill.sh: this is dangerous and should not be run as root!
trap onexit 1 2 3 15 ERR
#--- onexit() -----------------------------------------------------
# #param $1 integer (optional) Exit status. If not set, use `$?'
function onexit() {
local exit_status=${1:-$?}
echo Problem killing $kill_this
exit $exit_status
}
export kill_this=$1
for X in `ps acx | grep -i "$1" | awk {'print $1'}`; do
kill $X 2>/dev/null
done
Thanks all!
I have a Cygwin bash script that I need to watch and terminate under certain conditions - specifically, after a certain file has been created. I'm having difficulty figuring out how exactly to terminate the script with the same level of completeness that Ctrl+C does, however.
Here's a simple script (called test1) that does little more than wait around to be terminated.
#!/bin/bash
test -f kill_me && rm kill_me
touch kill_me
tail -f kill_me
If this script is run in the foreground, Ctrl+C will terminate both the tail and the script itself. If the script is run in the background, a kill %1 (assuming it is job 1) will also terminate both tail and the script.
However, when I try to do the same thing from a script, I'm finding that only the bash process running the script is terminated, while tail hangs around disconnected from its parent. Here's one way I tried (test2):
#!/bin/bash
test -f kill_me && rm kill_me
(
touch kill_me
tail -f kill_me
) &
while true; do
sleep 1
test -f kill_me && {
kill %1
exit
}
done
If this is run, the bash subshell running in the background is terminated OK, but tail still hangs around.
If I use an explicitly separate script, like this, it still doesn't work (test3):
#!/bin/bash
test -f kill_me && rm kill_me
# assuming test1 above is included in the same directory
./test1 &
while true; do
sleep 1
test -f kill_me && {
kill %1
exit
}
done
tail is still hanging around after this script is run.
In my actual case, the process creating files is not particularly instrumentable, so I can't get it to terminate of its own accord; by finding out when it has created a particular file, however, I can at that point know that it's OK to terminate it. Unfortunately, I can't use a simple killall or equivalent, as there may be multiple instances running, and I only want to kill the specific instance.
/bin/kill (the program, not the bash builtin) interprets a negative PID as “kill the process group” which will get all the children too.
Changing
kill %1
to
/bin/kill -- -$$
works for me.
Adam's link put me in a direction that will solve the problem, albeit not without some minor caveats.
The script doesn't work unmodified under Cygwin, so I rewrote it, and with a couple more options. Here's my version:
#!/bin/bash
function usage
{
echo "usage: $(basename $0) [-c] [-<sigspec>] <pid>..."
echo "Recursively kill the process tree(s) rooted by <pid>."
echo "Options:"
echo " -c Only kill children; don't kill root"
echo " <sigspec> Arbitrary argument to pass to kill, expected to be signal specification"
exit 1
}
kill_parent=1
sig_spec=-9
function do_kill # <pid>...
{
kill "$sig_spec" "$#"
}
function kill_children # pid
{
local target=$1
local pid=
local ppid=
local i
# Returns alternating ids: first is pid, second is parent
for i in $(ps -f | tail +2 | cut -b 10-24); do
if [ ! -n "$pid" ]; then
# first in pair
pid=$i
else
# second in pair
ppid=$i
(( ppid == target && pid != $$ )) && {
kill_children $pid
do_kill $pid
}
# reset pid for next pair
pid=
fi
done
}
test -n "$1" || usage
while [ -n "$1" ]; do
case "$1" in
-c)
kill_parent=0
;;
-*)
sig_spec="$1"
;;
*)
kill_children $1
(( kill_parent )) && do_kill $1
;;
esac
shift
done
The only real downside is the somewhat ugly message that bash prints out when it receives a fatal signal, namely "Terminated", "Killed" or "Interrupted" (depending on what you send). However, I can live with that in batch scripts.
This script looks like it'll do the job:
#!/bin/bash
# Author: Sunil Alankar
##
# recursive kill. kills the process tree down from the specified pid
#
# foreach child of pid, recursive call dokill
dokill() {
local pid=$1
local itsparent=""
local aprocess=""
local x=""
# next line is a single line
for x in `/bin/ps -f | sed -e '/UID/d;s/[a-zA-Z0-9_-]\{1,\}
\{1,\}\([0-9]\{1,\}\) \{1,\}\([0-9]\{1,\}\) .*/\1 \2/g'`
do
if [ "$aprocess" = "" ]; then
aprocess=$x
itsparent=""
continue
else
itsparent=$x
if [ "$itsparent" = "$pid" ]; then
dokill $aprocess
fi
aprocess=""
fi
done
echo "killing $1"
kill -9 $1 > /dev/null 2>&1
}
case $# in
1) PID=$1
;;
*) echo "usage: rekill <top pid to kill>";
exit 1;
;;
esac
dokill $PID