How to get the PID of a process in a pipeline - bash

Consider the following simplified example:
my_prog|awk '...' > output.csv &
my_pid="$!" #Gives the PID for awk instead of for my_prog
sleep 10
kill $my_pid #my_prog still has data in its buffer that awk never saw. Data is lost!
In bash, $my_pid points to the PID for awk. However, I need the PID for my_prog. If I kill awk, my_prog does not know to flush it's output buffer and data is lost. So, how would one obtain the PID for my_prog? Note that ps aux|grep my_prog will not work since there may be several my_prog's going.
NOTE: changed cat to awk '...' to help clarify what I need.

Just had the same issue. My solution:
process_1 | process_2 &
PID_OF_PROCESS_2=$!
PID_OF_PROCESS_1=`jobs -p`
Just make sure process_1 is the first background process. Otherwise, you need to parse the full output of jobs -l.

I was able to solve it with explicitly naming the pipe using mkfifo.
Step 1: mkfifo capture.
Step 2: Run this script
my_prog > capture &
my_pid="$!" #Now, I have the PID for my_prog!
awk '...' capture > out.csv &
sleep 10
kill $my_pid #kill my_prog
wait #wait for awk to finish.
I don't like the management of having a mkfifo. Hopefully someone has an easier solution.

Here is a solution without wrappers or temporary files. This only works for a background pipeline whose output is captured away from stdout of the containing script, as in your case. Suppose you want to do:
cmd1 | cmd2 | cmd3 >pipe_out &
# do something with PID of cmd2
If only bash could provide ${PIPEPID[n]}!! The replacement "hack" that I found is the following:
PID=$( { cmd1 | { cmd2 0<&4 & echo $! >&3 ; } 4<&0 | cmd3 >pipe_out & } 3>&1 | head -1 )
If needed, you can also close the fd 3 (for cmd*) and fd 4 (for cmd2) with 3>&- and 4<&-, respectively. If you do that, for cmd2 make sure you close fd 4 only after you redirect fd 0 from it.

Add a shell wrapper around your command and capture the pid. For my example I use iostat.
#!/bin/sh
echo $$ > /tmp/my.pid
exec iostat 1
Exec replaces the shell with the new process preserving the pid.
test.sh | grep avg
While that runs:
$ cat my.pid
22754
$ ps -ef | grep iostat
userid 22754 4058 0 12:33 pts/12 00:00:00 iostat 1
So you can:
sleep 10
kill `cat my.pid`
Is that more elegant?

Improving #Marvin's and #Nils Goroll's answers with a oneliner that extract the pids for all commands in the pipe into a shell array variable:
# run some command
ls -l | rev | sort > /dev/null &
# collect pids
pids=(`jobs -l % | egrep -o '^(\[[0-9]+\]\+| ) [ 0-9]{5} ' | sed -e 's/^[^ ]* \+//' -e 's! $!!'`)
# use them for something
echo pid of ls -l: ${pids[0]}
echo pid of rev: ${pids[1]}
echo pid of sort: ${pids[2]}
echo pid of first command e.g. ls -l: $pids
echo pid of last command e.g. sort: ${pids[-1]}
# wait for last command in pipe to finish
wait ${pids[-1]}
In my solution ${pids[-1]} contains the value normally available in $!. Please note the use of jobs -l % which outputs just the "current" job, which by default is the last one started.
Sample output:
pid of ls -l: 2725
pid of rev: 2726
pid of sort: 2727
pid of first command e.g. ls -l: 2725
pid of last command e.g. sort: 2727
UPDATE 2017-11-13: Improved the pids=... command that works better with complex (multi-line) commands.

Based on your comment, I still can't see why you'd prefer killing my_prog to having it complete in an orderly fashion. Ten seconds is a pretty arbitrary measurement on a multiprocessing system whereby my_prog could generate 10k lines or 0 lines of output depending upon system load.
If you want to limit the output of my_prog to something more determinate try
my_prog | head -1000 | awk
without detaching from the shell. In the worst case, head will close its input and my_prog will get a SIGPIPE. In the best case, change my_prog so it gives you the amount of output you want.
added in response to comment:
In so far as you have control over my_prog give it an optional -s duration argument. Then somewhere in your main loop you can put the predicate:
if (duration_exceeded()) {
exit(0);
}
where exit will in turn properly flush the output FILEs. If desperate and there is no place to put the predicate, this could be implemented using alarm(3), which I am intentionally not showing because it is bad.
The core of your trouble is that my_prog runs forever. Everything else here is a hack to get around that limitation.

With inspiration from #Demosthenex's answer: using subshells:
$ ( echo $BASHPID > pid1; exec vmstat 1 5 ) | tail -1 &
[1] 17371
$ cat pid1
17370
$ pgrep -fl vmstat
17370 vmstat 1 5

My solution was to query jobs and parse it using perl.
Start two pipelines in the background:
$ sleep 600 | sleep 600 |sleep 600 |sleep 600 |sleep 600 &
$ sleep 600 | sleep 600 |sleep 600 |sleep 600 |sleep 600 &
Query background jobs:
$ jobs
[1]- Running sleep 600 | sleep 600 | sleep 600 | sleep 600 | sleep 600 &
[2]+ Running sleep 600 | sleep 600 | sleep 600 | sleep 600 | sleep 600 &
$ jobs -l
[1]- 6108 Running sleep 600
6109 | sleep 600
6110 | sleep 600
6111 | sleep 600
6112 | sleep 600 &
[2]+ 6114 Running sleep 600
6115 | sleep 600
6116 | sleep 600
6117 | sleep 600
6118 | sleep 600 &
Parse the jobs list of the second job %2. The parsing is probably error prone, but in these cases it works. We aim to capture the first number followed by a space. It is stored into the variable pids as an array using the parenthesis:
$ pids=($(jobs -l %2 | perl -pe '/(\d+) /; $_=$1 . "\n"'))
$ echo $pids
6114
$ echo ${pids[*]}
6114 6115 6116 6117 6118
$ echo ${pids[2]}
6116
$ echo ${pids[4]}
6118
And for the first pipeline:
$ pids=($(jobs -l %1 | perl -pe '/(\d+) /; $_=$1 . "\n"'))
$ echo ${pids[2]}
6110
$ echo ${pids[4]}
6112
We could wrap this into a little alias/function:
function pipeid() { jobs -l ${1:-%%} | perl -pe '/(\d+) /; $_=$1 . "\n"'; }
$ pids=($(pipeid)) # PIDs of last job
$ pids=($(pipeid %1)) # PIDs of first job
I have tested this in bash and zsh. Unfortunately, in bash I could not pipe the output of pipeid into another command. Probably because that pipeline is ran in a sub shell not able to query the job list??

I was desperately looking for good solution to get all the PIDs from a pipe job, and one promising approach failed miserably (see previous revisions of this answer).
So, unfortunately, the best I could come up with is parsing the jobs -l output using GNU awk:
function last_job_pids {
if [[ -z "${1}" ]] ; then
return
fi
jobs -l | awk '
/^\[/ { delete pids; pids[$2]=$2; seen=1; next; }
// { if (seen) { pids[$1]=$1; } }
END { for (p in pids) print p; }'
}

Related

How to evaluate results of pipes in a bash script

I need help for the following problem:
I'd like to kill all instances of a program, let's say xpdf.
At the prompt the following works as intended:
$ ps -e | grep xpdf | sed -n -e "s/^[^0-9]*\([0-9]*\)[^0-9]\{1,\}.*$/\1/p" | xargs kill -SIGTERM
(the sed-step is required to extract the PID).
However, there might be the case that no xpdf-process is running. Then it would be difficult, to embed the line into a script, because it aborts after it immediately with a message from kill. What can I do about it?
I tried (in a script)
#!/bin/bash
#
set -x
test=""
echo "test = < $test >"
test=`ps -e | grep xpdf | sed -n -e "s/^[^0-9]*\([0-9]*\)[^0-9]\{1,\}.*$/\1/p"`
echo "test = < $test >"
if [ -z "$test" ]; then echo "xpdf läuft nicht";
else echo "$test" | xargs -d" " kill -SIGTERM
fi
When running the script above I get
$ Kill_ps
+ test=
+ echo 'test = < >'
test = < >
++ ps -e
++ grep xpdf
++ sed -n -e 's/^[^0-9]*\([0-9]*\)[^0-9]\{1,\}.*$/\1/p'
+ test='21538
24654
24804
24805'
+ echo 'test = < 21538
24654
24804
24805 >'
test = < 21538
24654
24804
24805 >
+ '[' -z '21538
24654
24804
24805' ']'
+ xargs '-d ' kill -SIGTERM
+ echo '21538
24654
24804
24805'
kill: failed to parse argument: '21538
24654
24804
24805
Some unexpected happens: In test there are more PIDs then processes
At the prompt:
$ ps -e | grep xpd
21538 pts/3 00:00:00 xpdf.real
24654 pts/2 00:00:00 xpdf.real
When running the script again, the 24* PIDs change.
So here are my questions:
Where do the additional PIDs come from?
What can I do to handle the situation, in which no process I want to kill is running (or why does xargs not accept echo "$test" as input)? (I want my script not to be aborted)
Thanks in advance.
You can use pkill(1) or killall(1) instead of parsing ps output. (Parsing human-readable output for scripting is not recommended. This is an example.)
Usage:
pkill xpdf
killall xpdf # Note that on solaris, it kills all the processes that you can kill. https://unix.stackexchange.com/q/252349#comment435182_252356
pgrep xpdf # Lists pids
ps -C xpdf -o pid= # Lists pids
Note that tools like pkill, killall, pgrep will perform similar to ps -e | grep. So, if you do a pkill sh, it will try to kill sh, bash, ssh etc, since they all match the pattern.
But to answer your question about skipping xargs from running when there is nothing on the input, you can use -r option for (GNU) xargs.
-r, --no-run-if-empty
If the standard input does not contain any nonblanks, do not run the command.
Normally, the command is run once even if there is no input.
This option is a GNU extension.
This answer is for the case where killall or pkill (suggested in this answer) are not enough for you. For example if you really want to print "xpdf läuft nicht" if there is no pid to kill or applying kill -SIGTERM because you want to be sure of the signal you send to your pids or whatever.
You could use a bash loop instead of xargs and sed. It's pretty simple to iterate over CSV/column outputs:
count=0
while read -r uid pid ppid trash; do
kill -SIGTERM $pid
(( count++ ))
done < <(ps -ef|grep xpdf)
[[ $count -le 0 ]] && echo "xpdf läuft nicht"
There is a quicker way using pgrep (the previous one was for illustrate how to iterate over column-based outputs of commands with bash):
count=0
while read -r; do
kill -SIGTERM $REPLY
(( count++ ))
done < <(pgrep xpdf)
[[ $count -le 0 ]] && echo "xpdf läuft nicht"
If you're version of xargs can provide you the --no-run-if-empty you can still use it with pgrep (like suggested in this answer), but it's not available on the BSD or POSIX version. It is my case on macOS for example...
awk could also do the trick (with only one command after the pipe):
ps -ef|awk 'BEGIN {c="0"} {if($0 ~ "xpdf" && !($0 ~ "awk")){c++; system("kill -SIGTERM "$2)}} END {if(c <= 0){print "xpdf läuft nicht"}}'

Assigning Process IDs to a variable in a shell script

I want to assign all process ids to a variable.
For example the result for
pgrep abc
29845
29846
I want to assign these 2 ids to a variable like this
a = '29845 29845'.
The variable a should contain the 2 process ids separated by a space.
The whole purpose of this is to kill all the process ids
Thanks
Some like this
cat file
29845
29846
var=$(awk '{printf "%s ",$1}' t)
echo $var
29845 29846
You may skip grep and only use awk
I tested the commands by starting sleep 320 & three times.
You can assign the output of a command like this:
procs=$(pgrep sleep | tr '\n' ' ')
When you want to kill the processen, consider
pgrep sleep | xargs kill -9
or
pkill sleep

How to suspend the main command when piping the output to another delay command

I have two custom scripts to implement their own tasks, one for outputting some URLs (pretend as cat command below) and another for receiving a URL to parse it via network requests (pretend as sleep command below).
Here is the prototype:
Case 1:
cat urls.txt | xargs -I{} sleep 1 && echo "END: {}"
The output is END: {} and the sleep works.
Case 2:
cat urls.txt | xargs -I{} echo "BEGIN: {}" && sleep 1 && echo "END: {}"
The output is
BEGIN: https://www.example.com/1
BEGIN: https://www.example.com/2
BEGIN: https://www.example.com/3
END: {}
but it seems only sleep 1 second.
Q1: I'm a little confused, why are these outputs?
Q2: Are there any solutions to execute the full pipelined xargs delay command for every cat line output?
You can put the commands into a separate script:
worker.sh
#!/bin/bash
echo "BEGIN: $*" && sleep 1 && echo "END: $*"
set execute permission:
chmod +x worker.sh
and call it with xargs:
cat urls.txt | xargs -I{} ./worker.sh {}
output
BEGIN: https://www.example.com/1
END: https://www.example.com/1
BEGIN: https://www.example.com/2
END: https://www.example.com/2
BEGIN: https://www.example.com/3
END: https://www.example.com/3
Between BEGIN and END the script sleep for one second.
Thanks for shellter and UtLox's reminder, I found the xargs is the key.
Here is my finding, the shell/zsh interpreter splits the sleep 5 and echo END: {} as another serial of commands, so xargs didn't receive my expected two && inline commands as one utility command and replace the {} with value in the END expression. This could be proved by xargs -t.
cat urls.txt | xargs -I{} -t echo "BEGIN: {}" && sleep 1 && echo "END: {}"
Inspired by UtLox's the answer, I found I could join my expectation with sh -c in xargs.
cat urls.txt | xargs -I{} -P 5 sh -c 'echo "BEGIN: {}" && sleep 1 && echo "END: {}"'
For the -P 5, it makes the utility commmand ran with max specified subprocesses in parallel mode to make use of most bandwide resources.
Done!

In my bash script, how do I write a while loop that only exits if the output of "tail" doesn't contain a string?

I’m using Amazon Linux with bash shell. In my bash script, how do I construct a while loop that will spin so long as the command
tail -10 /usr/java/jboss/standalone/log/server.log
does not contain the string “FrameworkServlet ‘myprojectDispatcher': initialization completed”?
You can use:
tail -n 10 -f /usr/java/jboss/standalone/log/server.log |
awk '/FrameworkServlet.*myprojectDispatcher.*initialization completed/{exit} 1'
awk will exit when it encounters search string otherwise it will keep writing input to stdout.
However do keep in mind that the tail command is buffered and to avoid that behavior try stdbuf gnu utility:
stdbuf -i0 -o0 -e0 tail -n 10 -f /usr/java/jboss/standalone/log/server.log |
awk '/FrameworkServlet.*myprojectDispatcher.*initialization completed/{exit} 1'
I can try this:
#!/bin/bash
MATCH="FrameworkServlet ‘myprojectDispatcher': initialization completed"
while :
do
if tail /usr/java/jboss/standalone/log/server.log | grep -q "$MATCH"; then
exit 0
else
sleep 1
fi
done
while [ -nz grep -q "FrameworkServlet ‘myprojectDispatcher': initialization completed" /usr/java/jboss/standalone/log/server.log ]; do
# wait a second
sleep 1
done
# do the stuff
echo "we got it!"

restricting xargs from reading stdin to buffer

It looks like xargs reads input lines from stdin, even when it is already running maximum number of process that it can run.
Here is an example:
#!/bin/bash
function xTrigger()
{
for ii in `seq 1 100`; do echo $ii; sleep 2; done
}
function xRunner()
{
sleep 10;
echo $1;
}
export -f xTrigger
export -f xRunner
bash -c "xTrigger" | xargs -n 1 -P 1 -i bash -c "xRunner {}"
20 seconds after starting above process, I killall xTrigger, so but xargs has buffered everything xTrigger printed, so xRunner continued to print 1..10. What I want is for it to print only 1,2
Is there anyway by which we can change this behaviour and get xargs to read from stdin only when it wants to start a new command, so that xTrigger would wait at the echo statement until xargs reads from it? My stdin has very dynamic content so this would be very useful.
Trying to stick to xargs just because it would be stable and elegent. Want to write extra code only if there is no easy way of doing it with xargs.
Thanks for all your help!
Don't you have to kill the Bash PID of xTrigger()?
bash -c "echo $$; xTrigger" | xargs -n 1 -P 1 bash -c 'xRunner "$#"' _
kill -HUP <PID>
On my system, xargs will by default halt if one of the jobs it is running exits with a non-zero return code. Therefore, you should be sending the signal to the bash pid that is running XRunner.
Got xTrigger to to generate next trigger only when there are no 'bash -c xRunner' jobs running. Works great now:
#!/bin/bash
function xTrigger()
{
for ii in `seq 1 100`; do
echo $ii;
while [[ $(psgrep xRunner|grep -v xargs|wc -l) -ge 1 ]]; do
sleep 2;
done
done
}
function xRunner()
{
sleep 10;
echo $1;
}
export -f xTrigger
export -f xRunner
bash -c "xTrigger" | xargs -n 1 -P 1 -i bash -c "xRunner {}"

Resources