Get ssh remote command to terminate so that xargs with parallel option can continue - bash

I'm running a command similar to the following
getHosts | xargs -I{} -P3 -n1 ssh {} 'startServer; sleep 5; grep -m 1 "server up" <(tail -f log)'
The problem is that it seems like ssh hangs for a while sometimes even well after the server has come up. Is there any problem with this command that might cause it not to terminate so that parallel execution can continue? When I run the command in a remote shell, the check for the server coming up seems reliable and closes punctually when "server up" is written to the logs.

Instead of the remote command being
startServer; sleep 5; grep -m 1 "server up" <(tail -f log)
I'd use
grep -m 1 "server up" <(tail -F log -n 0) & startServer ; wait
Differences:
Start tailing the log before attempting to restart the server, so that we don't miss any messages. We start at the end of the log so we don't see any previous "server up" messages.
Use tail's -F option instead of -f, so that if the log file is rotated we will follow the new file, instead of continuing to uselessly follow the old file.

Two ways I could see it failing to terminate:
Remote end hangs on startServer
The server generates so many messages after "server up", tail -f doesn't catch that line and waits forever (since tail will, by default, take the last 10 lines)
ssh could also fail to connect for a variety of reasons: host down, keys lost, etc. I would add some error checking conditions in the form of writing to a log and/or having
|| echo "Failed to do stuff" | mail -s SUBJECT TO#WHO.com

Related

How to wait for message to appear in log in shell

Could you please provide neat solution to block execution of the script until text snippet appear in the given file?
Wait forever
grep -q 'ProducerService started' <(tail -f logs/batch.log)
Wait with timeout
timeout 30s grep -q 'ProducerService started' <(tail -f logs/batch.log)
Wait with timeout, notify error
timeout 30s grep -q 'ProducerService started' <(tail -f logs/batch.log) || exit 1
Use inotifywait
inotifywait efficiently waits for changes to files
example:-
kill the process to be blocked
inotifywait -q -e modify /path/to/file/containing/snippet
check for the changes in the file
if the change matches then restart the script

Captured output of command on remote host (SSH via Cron) is blank

Below is a script which logs into a remote host (a Cisco IOS-XR router) and runs a single command via SSH. The idea is to grab the result of the command (an integer) so that it can be graphed by Cacti. Cacti runs this script every 5 minutes when it runs it's normal poll routine:
#!/bin/bash
if [[ -z $1 ]]
then
exit 1
fi
HOST="$1"
USER="cact-ssh-user"
TIMEOUT=10s
export SSHPASS="aaaaaaaaaaaaa"
CMD="show controllers np struct IPV4-LEAF-FAST-P np0 | in Entries"
RAW_OUTPUT=$(timeout $TIMEOUT sshpass -e ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null $USER#$HOST "$CMD" 2>/dev/null)
GRT_UCASTV4_USED=$(echo -n "$RAW_OUTPUT" | grep "Entries" | awk '{print $6}' | tr -d "," | tr -d " ")
echo -n "ucastv4_used:$GRT_UCASTV4_USED"
This command works fine via an interactive shell (when I run the script on the Cacti server using /path/to/script/script.sh 10.0.0.1. However when the Cacti cronjob runs the output is simply blank. So on my SSH session to the Cacti server the output is:
$ ./script 10.0.0.1
ucastv4_used:1234
In the Cacti log the output is: 05/22/2017 03:35:21 PM - SPINE: Poller[0] Host[69] TH[1] DS[6837] SCRIPT: /opt/scripts/cacti-scripts/asr9001-get-tcam-ucast-usage.sh 10.0.0.1, output: ucastv4_used:
I have su'ed to the Cacti user and the script works just fine. So this seems to be specific to it runnings as a cronjob, the ouput from the SSH command is being redirected somewhere magically and I don't know where or why.
To try and debug this I have added the following lines to the script (directly under #!/bin/bash) and wait for the Cacti 5 minute poll interval to run (I can see in the Cacti log when the script is called every 5 minutes);
exec >/tmp/stdout.log 2>/tmp/stderr.log
set -x
The stdout.log just contains ucastv4_used: the same as cacti.log and the stderr.log file contains the login banner for the remote SSH host and nothing else. Where has the SSH output gone?
I have tired to change the SSH line in the script to output to a file and then read from there:
timeout $TIMEOUT sshpass -e ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null $USER#$HOST "$CMD" > /tmp/output 2>/dev/null
GRT_UCASTV4_USED=$(grep "Entries" /tmp/output | awk '{print $6}' | tr -d "," | tr -d " ")
The file /tmp/output is empty and so the GRT_UCASTV4_USED variable is empty also. stdout.log ends up being the same as before: ucastv4_used:
I have also tried to change #!/bin/bash to #!/bin/bash -i to force an interactive session. This kind of works in that with -i if I add echo $PS1 to the script I can see in the stdout.log file that $PS1 is set and without -i it prints nothing. However there is still no output from the SSH command. Where is the command of the SSH output going?
I have also tried to use ssh ..... | tee /tmp/output so that the output should show up in /tmp/output and /tmp/stdout.log but both are blank.
I can see on the remote router that the SSH session is comming in and running the command. This is from debug ssh server:
RP/0/RSP0/CPU0:May 22 14:52:57.976 UTC: SSHD_[65909]: (open_master_file) command added show controllers np struct IPV4-LEAF-FAST-P np0 | in Entries
Also since this is working via my interactive session with the Cacti server I am guessing the issue is on there and not the router. I am also confident that Cacti it's self is not the problem, I can trigger spine to poll this router host from my interactive SSH session and the script works fine (further pointing to the issue that some how in a non-interactive shell the SSH output is evaporating):
$ cd /usr/local/spine/bin
$ ./spine -V 7 69 69
...
05/22/2017 04:06:56 PM - SPINE: Poller[0] Host[69] TH[1] DS[6837] SCRIPT: /opt/scripts/cacti-scripts/asr9001-get-tcam-ucast-usage.sh 10.0.0.1, output: ucastv4_used:658809
So it seems that the SSH output is being redirected somewhere and I can't "get it" or the router somehow knows this is a non-interactive SSH client and isn't sending anything back. How else can I debug this?
Update 1
Using debug ssh server on the Cisco router I have captured the debug logs when I am running the script via my interactive SSH session to the Cacti server and when it runs via Cacti's poll interval/cron job. I have diff'ed the output and the only interesting looking difference I can find (besides stuff like the SSH PID changing and ephemeral source port of the Cacti server changing etc.) is the following:
*** 132,145 ****
(sshd_interactive_shell) *** removing alarm
sshd_interactive_shell - ptyfd = 46
event_contex_init done
! sshd_ptytonet - Channel 1 Received EOT (bytes:1)
! sshd_ptytonet - Channel 1 exec command executed sending CHANNEL_CLOSE
! (close_channel), pid:182260085, sig rcvd:1, state:10 chan_id:1
! addrem_ssh_info_tuple: REMOVE Inside the critical Section %pid:182260085
! Cleanup sshd process 182260085, session id 1, channel_id 1
! addrem_ssh_info_tuple: REMOVE exiting the Critical Section %pid:182260085
close_channel: Accounting stopped: scriptaccount
! In delete channel code, pid:182260085, sig rcvd:1, state:10 chan_id:1
Sending Exit Status: 0 sig: 1
Sending Channel EOF msg
Sending Channel close msg for remote_chan_id = 0 chan_id = 1
--- 134,147 ----
(sshd_interactive_shell) *** removing alarm
sshd_interactive_shell - ptyfd = 46
event_contex_init done
! Pad_len = 6, Packlen = 12
! sshd_nettopty: EOF received. Disconnecting session
! (close_channel), pid:182329717, sig rcvd:1, state:10 chan_id:1
! addrem_ssh_info_tuple: REMOVE Inside the critical Section %pid:182329717
! Cleanup sshd process 182329717, session id 1, channel_id 1
! addrem_ssh_info_tuple: REMOVE exiting the Critical Section %pid:182329717
close_channel: Accounting stopped: scriptaccount
! In delete channel code, pid:182329717, sig rcvd:1, state:10 chan_id:1
Sending Exit Status: 0 sig: 1
Sending Channel EOF msg
Sending Channel close msg for remote_chan_id = 0 chan_id = 1
The top half is my interactive session with the Cacti server. I note in the top hald sshd_ptytonet - Channel 1 Received EOT (bytes:1) whereas via the cronjob the debug shows sshd_nettopty: EOF received. Disconnecting session. Is the non-interactive session simply passing my SSH command to the remote host and quiting as quickly as possible (so it's not waiting for the SSH server to respond with the command output)?
First, tell SSH client to not to allocate a PTY with -T option, because obviously cron doesn't have one.
Then give it something infinite on stdin, so it will keep running until stdout
is open, we have /dev/zero exactly for this purpose.
RAW_OUTPUT=$(timeout $TIMEOUT sshpass -e ssh -T -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null $USER#$HOST "$CMD" </dev/zero 2>/dev/null)

Docker kill an infinite process in a container after X amount of time

I am using the code found in this docker issue to basically start a container run a process within 20 seconds and if the process completes / does not complete / fails to execute / times out the container is killed regardless.
The code I am using currently is this:
#!/bin/bash
set -e
to=$1
shift
cont=$(docker run -d "$#")
code=$(timeout "$to" docker wait "$cont" || true)
docker kill $cont &> /dev/null
echo -n 'status: '
if [ -z "$code" ]; then
echo timeout
else
echo exited: $code
fi
echo output:
# pipe to sed simply for pretty nice indentation
docker logs $cont | sed 's/^/\t/'
docker rm $cont &> /dev/null
Which is almost perfect however if you run an infinite process (for example this python infinite loop):
while True:
print "inifinte loop"
The whole system jams up and the app crashes, after reading around a bit I think it has something to do with the STDOUT Buffer but I have absolutely no idea what that means?
The problem you have is with a process that is writing massive amounts of data to stdout.
These messages get logged into a file which grows infinitely.
Have a look at (depending on your system's location for log files):
sudo find /var/lib/docker/containers/ -name '*.log' -ls
You can remove old log files if they are of no interest.
One possibility is to start your docker run -d daemon
under a ulimit restriction on the max size a file can be.
Add to the start of your script, for example:
ulimit -f 20000 -c 0
This limits file sizes to 20000*1024 bytes, and disables core file dumps, which you expect
to get from infinite loops where writes are forced to fail.
Please add & at the end of
cont=$(docker run -d "$#")&
It will run the process in background.
I don't know dockers but if it still fail to stop you may also add just after this line the following :
mypid=$!
sleep 20 && kill $mypid
Regards

bash script parallel ssh remote command

i have a script that fires remote commands on several different machines through ssh connection. Script goes something like:
for server in list; do
echo "output from $server"
ssh to server execute some command
done
The problem with this is evidently the time, as it needs to establish ssh connection, fire command, wait for answer, print it. What i would like is to have script that would try to establish connections all at once and return echo "output from $server" and output of command as soon as it gets it, so not necessary in the list order.
I've been googling this for a while but didn't find an answer. I cannot cancel ssh session after command run as one thread suggested, because i need an output and i cannot use parallel gnu suggested in other threads. Also i cannot use any other tool, i cannot bring/install anything on this machine, only useable tool is GNU bash, version 4.1.2(1)-release.
Another question is how are ssh sessions like this limited? If i simply paste 5+ or so lines of "ssh connect, do some command" it actually doesn't do anything, or execute only on first from list. (it works if i paste 3-4 lines). Thank you
Have you tried this?
for server in list; do
ssh user#server "command" &
done
wait
echo finished
Update: Start subshells:
for server in list; do
(echo "output from $server"; ssh user#server "command"; echo End $server) &
done
wait
echo All subshells finished
There are several parallel SSH tools that can handle that for you:
http://code.google.com/p/pdsh/
http://sourceforge.net/projects/clusterssh/
http://code.google.com/p/sshpt/
http://code.google.com/p/parallel-ssh/
Also, you could be interested in configuration deployment solutions such as Chef, Puppet, Ansible, Fabric, etc (see this summary ).
A third option is to use a terminal broadcast such as pconsole
If you only can use GNU commands, you can write your script like this:
for server in $servers ; do
( { echo "output from $server" ; ssh user#$server "command" ; } | \
sed -e "s/^/$server:/" ) &
done
wait
and then sort the output to reconcile the lines.
I started with the shell hacks mentionned in this thread, then proceeded to something somewhat more robust : https://github.com/bearstech/pussh
It's my daily workhorse, and I basically run anything against 250 servers in 20 seconds (it's actually rate limited otherwise the connection rate kills my ssh-agent). I've been using this for years.
See for yourself from the man page (clone it and run 'man ./pussh.1') : https://github.com/bearstech/pussh/blob/master/pussh.1
Examples
Show all servers rootfs usage in descending order :
pussh -f servers df -h / |grep /dev |sort -rn -k5
Count the number of processors in a cluster :
pussh -f servers grep ^processor /proc/cpuinfo |wc -l
Show the processor models, sorted by occurence :
pussh -f servers sed -ne "s/^model name.*: //p" /proc/cpuinfo |sort |uniq -c
Fetch a list of installed package in one file per host :
pussh -f servers -o packages-for-%h dpkg --get-selections
Mass copy a file tree (broadcast) :
tar czf files.tar.gz ... && pussh -f servers -i files.tar.gz tar -xzC /to/dest
Mass copy several remote file trees (gather) :
pussh -f servers -o '|(mkdir -p %h && tar -xzC %h)' tar -czC /src/path .
Note that the pussh -u feature (upload and execute) was the main reason why I programmed this, no tools seemed to be able to do this. I still wonder if that's the case today.
You may like the parallel-ssh project with the pssh command:
pssh -h servers.txt -l user command
It will output one line per server when the command is successfully executed. With the -P option you can also see the output of the command.

How do I kill a backgrounded/detached ssh session?

I am using the program synergy together with an ssh tunnel
It works, i just have to open an console an type these two commands:
ssh -f -N -L localhost:12345:otherHost:12345 otherUser#OtherHost
synergyc localhost
because im lazy i made an Bash-Script which is run with one mouseclick on an icon:
#!/bin/bash
ssh -f -N -L localhost:12345:otherHost:12345 otherUser#OtherHost
synergyc localhost
the Bash-Script above works as well, but now i also want to kill synergy and the ssh tunnel via one mouseclick, so i have to save the PIDs of synergy and ssh into file to kill them later:
#!/bin/bash
mkdir -p /tmp/synergyPIDs || exit 1
rm -f /tmp/synergyPIDs/ssh || exit 1
rm -f /tmp/synergyPIDs/synergy || exit 1
[ ! -e /tmp/synergyPIDs/ssh ] || exit 1
[ ! -e /tmp/synergyPIDs/synergy ] || exit 1
ssh -f -N -L localhost:12345:otherHost:12345 otherUser#OtherHost
echo $! > /tmp/synergyPIDs/ssh
synergyc localhost
echo $! > /tmp/synergyPIDs/synergy
But the files of this script are empty.
How do I get the PIDs of ssh and synergy?
(I try to avoid ps aux | grep ... | awk ... | sed ... combinations, there has to be an easier way.)
With all due respect to the users of pgrep, pkill, ps | awk, etc, there is a much better way.
Consider that if you rely on ps -aux | grep ... to find a process you run the risk of a collision. You may have a use case where that is unlikely, but as a general rule, it's not the way to go.
SSH provides a mechanism for managing and controlling background processes. But like so many SSH things, it's an "advanced" feature, and many people (it seems, from the other answers here) are unaware of its existence.
In my own use case, I have a workstation at home on which I want to leave a tunnel that connects to an HTTP proxy on the internal network at my office, and another one that gives me quick access to management interfaces on co-located servers. This is how you might create the basic tunnels, initiated from home:
$ ssh -fNT -L8888:proxyhost:8888 -R22222:localhost:22 officefirewall
$ ssh -fNT -L4431:www1:443 -L4432:www2:443 colocatedserver
These cause ssh to background itself, leaving the tunnels open. But if the tunnel goes away, I'm stuck, and if I want to find it, I have to parse my process list and home I've got the "right" ssh (in case I've accidentally launched multiple ones that look similar).
Instead, if I want to manage multiple connections, I use SSH's ControlMaster config option, along with the -O command-line option for control. For example, with the following in my ~/.ssh/config file,
host officefirewall colocatedserver
ControlMaster auto
ControlPath ~/.ssh/cm_sockets/%r#%h:%p
the ssh commands above, when run, will leave spoor in ~/.ssh/cm_sockets/ which can then provide access for control, for example:
$ ssh -O check officefirewall
Master running (pid=23980)
$ ssh -O exit officefirewall
Exit request sent.
$ ssh -O check officefirewall
Control socket connect(/home/ghoti/.ssh/cm_socket/ghoti#192.0.2.5:22): No such file or directory
And at this point, the tunnel (and controlling SSH session) is gone, without the need to use a hammer (kill, killall, pkill, etc).
Bringing this back to your use-case...
You're establishing the tunnel through which you want syngergyc to talk to syngergys on TCP port 12345. For that, I'd do something like the following.
Add an entry to your ~/.ssh/config file:
Host otherHosttunnel
HostName otherHost
User otherUser
LocalForward 12345 otherHost:12345
RequestTTY no
ExitOnForwardFailure yes
ControlMaster auto
ControlPath ~/.ssh/cm_sockets/%r#%h:%p
Note that the command line -L option is handled with the LocalForward keyword, and the Control{Master,Path} lines are included to make sure you have control after the tunnel is established.
Then, you might modify your bash script to something like this:
#!/bin/bash
if ! ssh -f -N otherHosttunnel; then
echo "ERROR: couldn't start tunnel." >&2
exit 1
else
synergyc localhost
ssh -O exit otherHosttunnel
fi
The -f option backgrounds the tunnel, leaving a socket on your ControlPath to close the tunnel later. If the ssh fails (which it might due to a network error or ExitOnForwardFailure), there's no need to exit the tunnel, but if it did not fail (else), synergyc is launched and then the tunnel is closed after it exits.
You might also want to look in to whether the SSH option LocalCommand could be used to launch synergyc from right within your ssh config file.
Quick summary: Will not work.
My first idea is that you need to start the processes in the background to get their PIDs with $!.
A pattern like
some_program &
some_pid=$!
wait $some_pid
might do what you need... except that then ssh won't be in the foreground to ask for passphrases any more.
Well then, you might need something different after all. ssh -f probably spawns a new process your shell can never know from invoking it anyway. Ideally, ssh itself would offer a way to write its PID into some file.
just came across this thread and wanted to mention the "pidof" linux utility:
$ pidof init
1
You can use lsof to show the pid of the process listening to port 12345 on localhost:
lsof -t -i #localhost:12345 -sTCP:listen
Examples:
PID=$(lsof -t -i #localhost:12345 -sTCP:listen)
lsof -t -i #localhost:12345 -sTCP:listen >/dev/null && echo "Port in use"
well i dont want to add an & at the end of the commands as the connection will die if the console wintow is closed ... so i ended up with an ps-grep-awk-sed-combo
ssh -f -N -L localhost:12345:otherHost:12345 otherUser#otherHost
echo `ps aux | grep -F 'ssh -f -N -L localhost' | grep -v -F 'grep' | awk '{ print $2 }'` > /tmp/synergyPIDs/ssh
synergyc localhost
echo `ps aux | grep -F 'synergyc localhost' | grep -v -F 'grep' | awk '{ print $2 }'` > /tmp/synergyPIDs/synergy
(you could integrate grep into awk, but im too lazy now)
You can drop the -f, which makes it run it in background, then run it with eval and force it to the background yourself.
You can then grab the pid. Make sure to put the & within the eval statement.
eval "ssh -N -L localhost:12345:otherHost:12345 otherUser#OtherHost & "
tunnelpid=$!
Another option is to use pgrep to find the PID of the newest ssh process
ssh -fNTL 8073:localhost:873 otherUser#OtherHost
tunnelPID=$(pgrep -n -x ssh)
synergyc localhost
kill -HUP $tunnelPID
This is more of a special case for synergyc (and most other programs that try to daemonize themselves). Using $! would work, except that synergyc does a clone() syscall during execution that will give it a new PID other than the one that bash thought it has. If you want to get around this so that you can use $!, then you can tell synergyc to stay in the forground and then background it.
synergyc -f -n mydesktop remoteip &
synergypid=$!
synergyc also does a few other things like autorestart that you may want to turn off if you are trying to manage it.
Based on the very good answer of #ghoti, here is a simpler script (for testing) utilising the SSH control sockets without the need of extra configuration:
#!/bin/bash
if ssh -fN -MS /tmp/mysocket -L localhost:12345:otherHost:12345 otherUser#otherHost; then
synergyc localhost
ssh -S /tmp/mysocket -O exit otherHost
fi
synergyc will be only started if tunnel has been established successfully, which itself will be closed as soon as synergyc returns.
Albeit the solution lacks proper error reporting.
You could look out for the ssh proceess that is bound to your local port, using this line:
netstat -tpln | grep 127\.0\.0\.1:12345 | awk '{print $7}' | sed 's#/.*##'
It returns the PID of the process using port 12345/TCP on localhost. So you don't have to filter all ssh results from ps.
If you just need to check, if that port is bound, use:
netstat -tln | grep 127\.0\.0\.1:12345 >/dev/null 2>&1
Returns 1 if none bound or 0 if someone is listening to this port.
There are many interesting answers here, but nobody mentioned that the manpage of SSH does describe this exact case! (see TCP FORWARDING section). And the solution they offer is much simpler:
ssh -fL 12345:localhost:12345 user#remoteserver sleep 10
synergyc localhost
Now in details:
First we start SSH with a tunnel; thanks to -f it will initiate the connection and only then fork to background (unlike solutions with ssh ... &; pid=$! where ssh is sent to background and next command is executed before the tunnel is created). On the remote machine it will run sleep 10 which will wait 10 seconds and then end.
Within 10 seconds, we should start our desired command, in this case synergyc localhost. It will connect to the tunnel and SSH will then know that the tunnel is in use.
After 10 seconds pass, sleep 10 command will finish. But the tunnel is still in use by synergyc, so SSH will not close the underlying connection until the tunnel is released (i.e. until synergyc closes socket).
When synergyc is closed, it will release the tunnel, and SSH in turn will terminate itself, closing a connection.
The only downside of this approach is that if the program we use will close and re-open connection for some reason then SSH will close the tunnel right after connection is closed, and the program won't be able to reconnect. If this is an issue then you should use an approach described in #doak's answer which uses control socket to properly terminate SSH connection and uses -f to make sure tunnel is created when SSH forks to the background.

Resources