checking per ssh if a specific program is still running, in parallel - bash

I have several machines where I have a program running. Every 30 seconds or so I want to check if those programs are still running. I use the following command to do that.
ssh ${USER}#${HOSTS[i]} "bash -c 'if [[ -z \"\$(pgrep -u ${USER} program)\" ]]; then exit 1; else exit 0; fi'"
Now running this on >100 machines takes a long time and I want to speed that up by checking in parallel. I am aware of '&' and 'parallel', but I am unsure how to retreive the return value (task completed or not).

The following lets all connections complete before starting any in the next batch, and thus can potentially wait for more than 30 seconds -- but should give you a good idea of how to do what you're looking for:
hosts=( host1 host2 host3 )
user=someuser
script="script you want to run on each remote host"
last_time=$(( SECONDS - 30 ))
while (( ( SECONDS - last_time ) >= 30 )) || \
sleep $(( 30 - (SECONDS - last_time) )); do
last_time=$SECONDS
declare -A pids=( )
for host in "${hosts[#]}"; do
ssh "${user}#${host}" "$script" & pids[$!]="$host"
done
for pid in "${!pids[#]}"; do
wait "$pid" || {
echo "Failure monitoring host ${pids[$pid]} at time $SECONDS" >&2
}
done
done
Now, bigger picture: Don't do that.
Almost every operating system has a process supervision framework. Ubuntu has Upstart; Fedora and CentOS 7 have systemd; MacOS X has launchd; runit, daemontools, and others can be installed anywhere (and are very, very easy to use -- look at the run scripts at http://smarden.org/runit/runscripts.html for examples).
Using these tools are the Right Way to monitor a process and ensure that it restarts whenever it exits: Unlike this (very high-overhead) solution they have almost no overhead at all, since they rely on the operating system notifying a process's parent when that process exits, rather than doing the work of polling for a process (and that only after all the overhead of connecting via SSH, negotiating a pair of session keys, starting a shell to run your script, etc, etc, etc).
Yes, this may be a small private project. Still, you're making extra complexity (and thus, extra bugs) for yourself -- and if you learn to use the tools to do this right, you'll know how to do things right when you have something that isn't a small private project.

Related

Shell script: How to loop run two programs?

I'm running an Ubuntu server to mine crypto. It's not a very stable coin yet and their main node gets disconnected sometimes. When this happens it crashes the program through fatal error.
At first I wrote a loop script so it would keep running after a crash and just try again after 15 seconds:
while true;
do ./miner <somecodetoconfiguretheminer> &&break;
sleep 15
done;
This works, but is inefficient. Sometimes the loop will keep running for 30 minutes until the main node is back up - which costs me 30 minutes of hashing power unused. So I want it to run a second miner for 15 minutes to mine another coin, then check the first miner again if its working yet.
So basically: Start -> Mine coin 1 -> if crash -> Mine coin 2 for 15 minutes -> go to Start
I tried the script below but the server just becomes unresponsive once the first miner disconnects:
while true;
do ./miner1 <somecodetoconfiguretheminer> &&break;
timeout 900 ./miner2
sleep 15
done;
Ive read through several topics / questions on how &&break works, timeout works and how while true works but I can't figure out what I'm missing here.
Thanks in advance for the help!
A much simpler solution would be to run both of the programs all the time, and lower the priority of the less-preferred one. On Linux and similar systems, that is:
nice -10 ./miner2loop.sh &
./miner1loop.sh
Then the scripts can be similar to your first one.
Okay, so after trial and error - and some help - I found out that there is nothing wrong with my initial code. Timeout appears to behave differently on my linux instance when used in terminal than in a bash script. If used in Terminal it behaves as it should, it counts down and then kills the process it started. If used in bash however - it acts as if I typed 'sleep' and then after counting down stops.
Apparently this has to do with my Ubuntu instance (running on a VPS). Even though I installed latest versions of coreutils, have all the latest versions installed through apt-get update etc. This is the case for me on Digital Ocean as well as Google Compute.
The solution is to use the Timeout code as a function within the bash script, as found on another thread in stackoverflow. I named the function timeout2 as to not confuse the system in triggering the not properly working timeout command:
#!/bin/bash
# Executes command with a timeout
# Params:
# $1 timeout in seconds
# $2 command
# Returns 1 if timed out 0 otherwise
timeout2() {
time=$1
# start the command in a subshell to avoid problem with pipes
# (spawn accepts one command)
command="/bin/sh -c "$2""
expect -c "set echo "-noecho"; set timeout $time; spawn -noecho
$command; expect timeout { exit 1 } eof { exit 0 }"
if [ $? = 1 ] ; then
echo "Timeout after ${time} seconds"
fi
}
while true;
do
./miner1 <parameters for miner> && break;
sleep 5
timeout2 300 ./miner2 <parameters for miner>
done;

How to free up memory after running a process in a shell

I have a script that runs a JAVA process that loads data into a database every 10 secs using a loop. This script seems to work perfectly, but after a couple of days I start getting Memory issues. If I stop the script everything frees up, I can start it again and it will run happily for another couple of days.
RUNME=Y
PROPERTIES=someproprties.properties
CHECKFILE=somelockfile.lock
touch $CHECKFILE
while [ "$RUNME" = "Y" ]; do
if [ -f $CHECKFILE ]
then
#Run Process
$DR_HOME/bin/dr -cp $CP_PLUGIN -Xmx64g --engine parallelism=1 --runjson $HOME_DIR/workflows/some_dataflow.dr --overridefile $PROPERTIES 1> /dev/null 2>> $LOG_FILE
#Give Process a little time to finish up before moving on
sleep 10s
else
RUNME=N
fi
done
I had assumed that once the process had run it would make any memory that it had allocated for the process available again, so that the next iteration of the loop could use this. Given that this does not seem to be the case, is there a way I can force the release of memory post the running the process. I appreciate that this may be something that I need to address in the actual JAVA Process rather than in a Shell - but as this is the area I have more control over, I thought I would at least ask.
To check the processes which are running and memory used
sid=$(ps -p $$ -osid=)
while ....
ps --sid $sid -opid,tty,cpu,vsz,etime,command
vsz shows the virtual size used by the process
Then if it's really bash process, it may be environment which is growing, but from the script it can't be that.

Introduce timeout in a bash for-loop

I have a task that is very well inside of a bash for loop. The situation is though, that a few of the iterations seem to not terminate. What I'm looking for is a way to introduce a timeout that if that iteration of command hasn't terminated after e.g. two hours it will terminate, and move on to the next iteration.
Rough outline:
for somecondition; do
while time-run(command) < 2h do
continue command
done
done
One (tedious) way is to start the process in the background, then start another background process that attempts to kill the first one after a fixed timeout.
timeout=7200 # two hours, in seconds
for somecondition; do
command & command_pid=$!
( sleep $timeout & wait; kill $command_pid 2>/dev/null) & sleep_pid=$!
wait $command_pid
kill $sleep_pid 2>/dev/null # If command completes prior to the timeout
done
The wait command blocks until the original command completes, whether naturally or because it was killed after the sleep completes. The wait immediately after sleep is used in case the user tries to interrupt the process, since sleep ignores most signals, but wait is interruptible.
If I'm understanding your requirement properly, you have a process that needs to run, but you want to make sure that if it gets stuck it moves on, right? I don't know if this will fully help you out, but here is something I wrote a while back to do something similar (I've since improved this a bit, but I only have access to a gist at present, I'll update with the better version later).
#!/bin/bash
######################################################
# Program: logGen.sh
# Date Created: 22 Aug 2012
# Description: parses logs in real time into daily error files
# Date Updated: N/A
# Developer: #DarrellFX
######################################################
#Prefix for pid file
pidPrefix="logGen"
#output direcory
outDir="/opt/Redacted/logs/allerrors"
#Simple function to see if running on primary
checkPrime ()
{
if /sbin/ifconfig eth0:0|/bin/grep -wq inet;then isPrime=1;else isPrime=0;fi
}
#function to kill previous instances of this script
killScript ()
{
/usr/bin/find /var/run -name "${pidPrefix}.*.pid" |while read pidFile;do
if [[ "${pidFile}" != "/var/run/${pidPrefix}.${$}.pid" ]];then
/bin/kill -- -$(/bin/cat ${pidFile})
/bin/rm ${pidFile}
fi
done
}
#Check to see if primary
#If so, kill any previous instance and start log parsing
#If not, just kill leftover running processes
checkPrime
if [[ "${isPrime}" -eq 1 ]];then
echo "$$" > /var/run/${pidPrefix}.$$.pid
killScript
commands && commands && commands #Where the actual command to run goes.
else
killScript
exit 0
fi
I then set this script to run on cron every hour. Every time the script is run, it
creates a lock file named after a variable that describes the script that contains the pid of that instance of the script
calls the function killScript which:
uses the find command to find all lock files for that version of the script (this lets more than one of these scripts be set to run in cron at once, for different tasks). For each file it finds, it kills the processes of that lock file and removes the lock file (it automatically checks that it's not killing itself)
Starts doing whatever it is I need to run and not get stuck (I've omitted that as it's hideous bash string manipulation that I've since redone in python).
If this doesn't get you squared let me know.
A few notes:
the checkPrime function is poorly done, and should either return a status, or just exit the script itself
there are better ways to create lock files and be safe about it, but this has worked for me thus far (famous last words)

ssh and bring up many processes in parallel + solaris

I have many processes and each of them take a lot of time to come up (5-10 min).
I am running my script in abc#abc1 and ssh to xyz#xyz1 to bring up the daemons.
There in the other machine(xyz#xyz1) I want to bring up 10 processes in parallel (call there startup scripts).
Then after 10 min I will check there status are they up or down.
I am doing this because I want the execution time of (my) script to be minimum.
How to do this using shell script with minimum amount of time ?
Thanks
Something like this should get your processes started:
for cmd in bin/proc1 bin/proc2 bin/procn; do
logfile=var/${cmd#bin/}.out
ssh xyz#xyz1 "bash -c '$cmd > $logfile 2>&1 &' && echo 'started $cmd in the background. See $logfile for its output.'"
done

Why can't I use job control in a bash script?

In this answer to another question, I was told that
in scripts you don't have job control
(and trying to turn it on is stupid)
This is the first time I've heard this, and I've pored over the bash.info section on Job Control (chapter 7), finding no mention of either of these assertions. [Update: The man page is a little better, mentioning 'typical' use, default settings, and terminal I/O, but no real reason why job control is particularly ill-advised for scripts.]
So why doesn't script-based job-control work, and what makes it a bad practice (aka 'stupid')?
Edit: The script in question starts a background process, starts a second background process, then attempts to put the first process back into the foreground so that it has normal terminal I/O (as if run directly), which can then be redirected from outside the script. Can't do that to a background process.
As noted by the accepted answer to the other question, there exist other scripts that solve that particular problem without attempting job control. Fine. And the lambasted script uses a hard-coded job number — Obviously bad. But I'm trying to understand whether job control is a fundamentally doomed approach. It still seems like maybe it could work...
What he meant is that job control is by default turned off in non-interactive mode (i.e. in a script.)
From the bash man page:
JOB CONTROL
Job control refers to the ability to selectively stop (suspend)
the execution of processes and continue (resume) their execution at a
later point.
A user typically employs this facility via an interactive interface
supplied jointly by the system’s terminal driver and bash.
and
set [--abefhkmnptuvxBCHP] [-o option] [arg ...]
...
-m Monitor mode. Job control is enabled. This option is on by
default for interactive shells on systems that support it (see
JOB CONTROL above). Background processes run in a separate
process group and a line containing their exit status is
printed upon their completion.
When he said "is stupid" he meant that not only:
is job control meant mostly for facilitating interactive control (whereas a script can work directly with the pid's), but also
I quote his original answer, ... relies on the fact that you didn't start any other jobs previously in the script which is a bad assumption to make. Which is quite correct.
UPDATE
In answer to your comment: yes, nobody will stop you from using job control in your bash script -- there is no hard case for forcefully disabling set -m (i.e. yes, job control from the script will work if you want it to.) Remember that in the end, especially in scripting, there always are more than one way to skin a cat, but some ways are more portable, more reliable, make it simpler to handle error cases, parse the output, etc.
You particular circumstances may or may not warrant a way different from what lhunath (and other users) deem "best practices".
Job control with bg and fg is useful only in interactive shells. But & in conjunction with wait is useful in scripts too.
On multiprocessor systems spawning background jobs can greatly improve the script's performance, e.g. in build scripts where you want to start at least one compiler per CPU, or process images using ImageMagick tools parallely etc.
The following example runs up to 8 parallel gcc's to compile all source files in an array:
#!bash
...
for ((i = 0, end=${#sourcefiles[#]}; i < end;)); do
for ((cpu_num = 0; cpu_num < 8; cpu_num++, i++)); do
if ((i < end)); then gcc ${sourcefiles[$i]} & fi
done
wait
done
There is nothing "stupid" about this. But you'll require the wait command, which waits for all background jobs before the script continues. The PID of the last background job is stored in the $! variable, so you may also wait ${!}. Note also the nice command.
Sometimes such code is useful in makefiles:
buildall:
for cpp_file in *.cpp; do gcc -c $$cpp_file & done; wait
This gives much finer control than make -j.
Note that & is a line terminator like ; (write command& not command&;).
Hope this helps.
Job control is useful only when you are running an interactive shell, i.e., you know that stdin and stdout are connected to a terminal device (/dev/pts/* on Linux). Then, it makes sense to have something on foreground, something else on background, etc.
Scripts, on the other hand, doesn't have such guarantee. Scripts can be made executable, and run without any terminal attached. It doesn't make sense to have foreground or background processes in this case.
You can, however, run other commands non-interactively on the background (appending "&" to the command line) and capture their PIDs with $!. Then you use kill to kill or suspend them (simulating Ctrl-C or Ctrl-Z on the terminal, it the shell was interactive). You can also use wait (instead of fg) to wait for the background process to finish.
It could be useful to turn on job control in a script to set traps on
SIGCHLD. The JOB CONTROL section in the manual says:
The shell learns immediately whenever a job changes state. Normally,
bash waits until it is about to print a prompt before reporting
changes in a job's status so as to not interrupt any other output. If
the -b option to the set builtin command is enabled, bash reports
such changes immediately. Any trap on SIGCHLD is executed for each
child that exits.
(emphasis is mine)
Take the following script, as an example:
dualbus#debian:~$ cat children.bash
#!/bin/bash
set -m
count=0 limit=3
trap 'counter && { job & }' CHLD
job() {
local amount=$((RANDOM % 8))
echo "sleeping $amount seconds"
sleep "$amount"
}
counter() {
((count++ < limit))
}
counter && { job & }
wait
dualbus#debian:~$ chmod +x children.bash
dualbus#debian:~$ ./children.bash
sleeping 6 seconds
sleeping 0 seconds
sleeping 7 seconds
Note: CHLD trapping seems to be broken as of bash 4.3
In bash 4.3, you could use 'wait -n' to achieve the same thing,
though:
dualbus#debian:~$ cat waitn.bash
#!/home/dualbus/local/bin/bash
count=0 limit=3
trap 'kill "$pid"; exit' INT
job() {
local amount=$((RANDOM % 8))
echo "sleeping $amount seconds"
sleep "$amount"
}
for ((i=0; i<limit; i++)); do
((i>0)) && wait -n; job & pid=$!
done
dualbus#debian:~$ chmod +x waitn.bash
dualbus#debian:~$ ./waitn.bash
sleeping 3 seconds
sleeping 0 seconds
sleeping 5 seconds
You could argue that there are other ways to do this in a more
portable way, that is, without CHLD or wait -n:
dualbus#debian:~$ cat portable.sh
#!/bin/sh
count=0 limit=3
trap 'counter && { brand; job & }; wait' USR1
unset RANDOM; rseed=123459876$$
brand() {
[ "$rseed" -eq 0 ] && rseed=123459876
h=$((rseed / 127773))
l=$((rseed % 127773))
rseed=$((16807 * l - 2836 * h))
RANDOM=$((rseed & 32767))
}
job() {
amount=$((RANDOM % 8))
echo "sleeping $amount seconds"
sleep "$amount"
kill -USR1 "$$"
}
counter() {
[ "$count" -lt "$limit" ]; ret=$?
count=$((count+1))
return "$ret"
}
counter && { brand; job & }
wait
dualbus#debian:~$ chmod +x portable.sh
dualbus#debian:~$ ./portable.sh
sleeping 2 seconds
sleeping 5 seconds
sleeping 6 seconds
So, in conclusion, set -m is not that useful in scripts, since
the only interesting feature it brings to scripts is being able to
work with SIGCHLD. And there are other ways to achieve the same thing
either shorter (wait -n) or more portable (sending signals yourself).
Bash does support job control, as you say. In shell script writing, there is often an assumption that you can't rely on the fact that you have bash, but that you have the vanilla Bourne shell (sh), which historically did not have job control.
I'm hard-pressed these days to imagine a system in which you are honestly restricted to the real Bourne shell. Most systems' /bin/sh will be linked to bash. Still, it's possible. One thing you can do is instead of specifying
#!/bin/sh
You can do:
#!/bin/bash
That, and your documentation, would make it clear your script needs bash.
Possibly o/t but I quite often use nohup when ssh into a server on a long-running job so that if I get logged out the job still completes.
I wonder if people are confusing stopping and starting from a master interactive shell and spawning background processes? The wait command allows you to spawn a lot of things and then wait for them all to complete, and like I said I use nohup all the time. It's more complex than this and very underused - sh supports this mode too. Have a look at the manual.
You've also got
kill -STOP pid
I quite often do that if I want to suspend the currently running sudo, as in:
kill -STOP $$
But woe betide you if you've jumped out to the shell from an editor - it will all just sit there.
I tend to use mnemonic -KILL etc. because there's a danger of typing
kill - 9 pid # note the space
and in the old days you could sometimes bring the machine down because it would kill init!
jobs DO work in bash scripts
BUT, you ... NEED to watch for the spawned staff
like:
ls -1 /usr/share/doc/ | while read -r doc ; do ... done
jobs will have different context on each side of the |
bypassing this may be using for instead of while:
for `ls -1 /usr/share/doc` ; do ... done
this should demonstrate how to use jobs in a script ...
with the mention that my commented note is ... REAL (dunno why that behaviour)
#!/bin/bash
for i in `seq 7` ; do ( sleep 100 ) & done
jobs
while [ `jobs | wc -l` -ne 0 ] ; do
for jobnr in `jobs | awk '{print $1}' | cut -d\[ -f2- |cut -d\] -f1` ; do
kill %$jobnr
done
#this is REALLY ODD ... but while won't exit without this ... dunno why
jobs >/dev/null 2>/dev/null
done
sleep 1
jobs

Resources