Rsync Multiple Sleeping Processes? - bash

My rsync script for creating daily incremental backups is working pretty well now. But I have noticed after a week or so that I am left with hundreds of Sleeping Rsync Processes running. Has this to do with my script? Is there a command I can add to the script to stop this?

Here is the Bash Script
#!/bin/bash
LinkDest=/home/backup/files/backupdaily/monday
WeekDay=$(date +%A)
case $WeekDay in
Monday)
rsync -avz --delete --exclude backup --exclude virtual_machines /home /home/backup/files/backupdaily/monday
;;
Tuesday|Wednesday|Thursday|Friday|Saturday)
rsync -avz --exclude backup --exclude virtual_machines --link- dest=$LinkDest /home /home/backup/files/backupdaily/$WeekDay
;;
Sunday)
exit 0
;;
esac
here is my entry in the crontab -e logged in as root
#Backup Schedule
# Daily
* 0 * * * /usr/local/src/backup/backup_daily_v3.sh

This is the Process View
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ COMMAND
1096 root 20 0 116M 1720 716 S 0.0 0.0 14:26.33 |- SCREEN
5169 root 20 0 105M 1428 1084 S 0.0 0.0 0:00.07 | |- /bin/bash
4012 root 20 0 105M 1188 968 S 0.0 0.0 0:00.00 | |- /bin/bash
1097 root 20 0 105M 980 676 S 0.0 0.0 0:00.34 | |- /bin/bash

Related

Gearman worker in shell hangs as a zombie

I have a Gearman worker in a shell script started with perp in the following way:
runuid -s gds \
/usr/bin/gearman -h 127.0.0.1 -t 1000 -w -f gds-rel \
-- xargs /home/gds/gds-rel-worker.sh < /dev/null 2>/dev/null
The worker only does some input validation and calls another shell script run.sh that invokes bash, curl, Terragrunt, Terraform, Ansible and gcloud to provision and update resources in GCP like this:
./run.sh --release 1.2.3 2>&1 >> /var/log/gds-release
The script is intended to run unattended. The problem I have is that after the job finishes successfully (that's both shell scripts run.sh and gds-rel-worker.sh) the Gearman job remains executing, because the child process becomes zombie (see last line below).
root 144748 1 0 Apr29 ? 00:00:00 perpboot -d /etc/perp
root 144749 144748 0 Apr29 ? 00:00:00 \_ tinylog -k 8 -s 100000 -t -z /var/log/perp/perpd-root
root 144750 144748 0 Apr29 ? 00:00:00 \_ perpd /etc/perp
root 2492482 144750 0 May14 ? 00:00:00 \_ tinylog (gearmand) -k 10 -s 100000000 -t -z /var/log/perp/gearmand
gearmand 2492483 144750 0 May14 ? 00:00:08 \_ /usr/sbin/gearmand -L 127.0.0.1 -p 4730 --verbose INFO --log-file stderr --keepalive --keepalive-idle 120 --keepalive-interval 120 --keepalive-count 3 --round-robin --threads 36 --worker-wakeup 3 --job-retries 1
root 2531800 144750 0 May14 ? 00:00:00 \_ tinylog (gds-rel-worker) -k 10 -s 100000000 -t -z /var/log/perp/gds-rel-worker
gds 2531801 144750 0 May14 ? 00:00:00 \_ /usr/bin/gearman -h 127.0.0.1 -t 1000 -w -f gds-rel -- xargs /home/gds/gds-rel-worker.sh
gds 2531880 2531801 0 May14 ? 00:00:00 \_ [xargs] <defunct>
So far I have traced the problem to run.sh, because if I replace its call with something simpler (e.g. echo "Hello"; sleep 5) the worker does not hang. Unfortunately, I have no clue what is causing the problem. The script run.sh is rather long and complex, but has been working without a problem so far. Tracing the worker process I see this:
getpid() = 2531801
write(2, "gearman: ", 9) = 9
write(2, "gearman_worker_work", 19) = 19
write(2, " : ", 3) = 3
write(2, "gearman_wait(GEARMAN_TIMEOUT) ti"..., 151) = 151
write(2, "\n", 1) = 1
sendto(5, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(5, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
sendto(5, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}], 2, 1000) = 1 ([{fd=5, revents=POLLIN}])
sendto(5, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(5, "\0RES\0\0\0\6\0\0\0\0\0RES\0\0\0(\0\0\0QH:terra-"..., 8192, MSG_NOSIGNAL, NULL, NULL) = 105
pipe([6, 7]) = 0
pipe([8, 9]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fea38480a50) = 2531880
close(6) = 0
close(9) = 0
write(7, "1.2.3\n", 18) = 6
close(7) = 0
read(8, "which: no terraform-0.14 in (/us"..., 1024) = 80
read(8, "Identity added: /home/gds/.ssh/i"..., 1024) = 54
read(8, 0x7fff6251f5b0, 1024) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2531880, si_uid=1006, si_status=0, si_utime=0, si_stime=0} ---
read(8,
So the worker continues reading standard output even though the child has finished successfully and presumably closed it. Any ideas how to catch what causes this problem?
I was able to solve it. The script run.sh was starting ssh-agent, which opens a socket and since Gearman redirects all outputs the worker continued reading the open file descriptor even after the script successfully completed.
I found it by examining the open file descriptors for the Gearman worker process after it hang:
# ls -l /proc/2531801/fd/*
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/0 -> /dev/null
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/1 -> 'pipe:[9356665]'
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/2 -> 'pipe:[9356665]'
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/3 -> 'pipe:[9357481]'
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/4 -> 'pipe:[9357481]'
lrwx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/5 -> 'socket:[9357482]'
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/8 -> 'pipe:[9369888]'
Then identified the processes using file node for the pipe in file descriptor 8 that German worker continued reading:
# lsof | grep 9369888
gearman 2531801 gds 8r FIFO 0,13 0t0 9369888 pipe
ssh-agent 2531899 gds 9w FIFO 0,13 0t0 9369888 pipe
And finally listed files opened by ssh-agent and found what stands behind file descriptor 3:
# ls -l /proc/2531899/fd/*
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/0 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/1 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/2 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/3 -> 'socket:[9346577]'
# lsof | grep 9346577
ssh-agent 2531899 gds 3u unix 0xffff89016fd34000 0t0 9346577 /tmp/ssh-0b14coFWhy40/agent.2531898 type=STREAM
As a solution I added kill of the ssh-agent before exit from run.sh script and now there are no jobs hanging due to zombie process.

verifying where 'kworker/n:n' (in ps -aux) is invoked from

In the result of 'ps -aux', I couldn't find how to verify that 'kworker/...' are created from and what module/functions are related to it.
Please let me know how I find out kworkers are from with pid or else.
I've try to check files in /proc, nothing is shown about this.
$ ps -aux | grep kworker
root 15 0.0 0.0 0 0 ? S Aug12 0:00 [kworker/1:0]
root 16 0.0 0.0 0 0 ? S< Aug12 0:00 [kworker/1:0H]
root 85 0.0 0.0 0 0 ? S< Aug12 0:09 [kworker/0:1H]
root 3562 0.0 0.0 0 0 ? S< Aug12 0:00 [kworker/0:2H]
root 5578 0.0 0.0 0 0 ? S 11:13 0:01 [kworker/0:0]
root 5579 0.0 0.0 0 0 ? S 11:13 0:00 [kworker/u4:1]
root 8789 0.1 0.0 0 0 ? S 12:19 0:10 [kworker/0:2]
root 30236 0.0 0.0 0 0 ? S 08:39 0:01 [kworker/u4:0]
A good solution for these kinds of problems that I'm familiar with is to use the perf tool (It's not always enabled by default and you may need to install perf on your device).
Step 1: Set perf to record workqueue events:
perf record -e 'workqueue:*' -ag -T
Step 2: Run it as long as you think you need to catch the event (10 seconds should be ok if this event is frequent enough, but you can let it run longer, depending on the available free space you have left on your device) and then stop it with Ctrl + C.
Step 3: Print the captured events (on Linux versions < 4.1 I think it should be -f and not -F):
perf script -F comm,pid,tid,time,event,trace
This will display something like this: 
task-name pid/tid timestamp event
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
turtle   9201/9201 1473.339166:  workqueue:workqueue_queue_work: work struct=0xef20d4c4 function=pm_runtime_work workqueue=0xef1cb600 req_cpu=8 cpu=1
turtle   9201/9201 1473.339176: workqueue:workqueue_activate_work: work struct 0xef20d4c4
kworker/0:3  24223/24223 1473.339221: workqueue:workqueue_execute_start: work struct 0xef20d4c4: function pm_runtime_work
kworker/0:3  24223/24223 1473.339248:  workqueue:workqueue_execute_end: work struct 0xef20d4c4
Step 4: Analyzing the table above:
In the first row, a task named turtle (pid 9201) is pushing the work pm_runtime_work to the workqueue.
In the third row, we can see that the kworker/0:3 (pid 24223) is executing that work.
Summary: Now back to your questions, we see that kworker/0:3 has been requested by turtle task to run the pm_runtime_work function.
Now, if you want to dig further, you'll have step into the code and see what the pm_runtime_work function does. Good luck !!!

If elif else not working in bash ChromeOS

I am trying to make a bash script which basically takes a bunch of .debs, unpacks them and place binaries and libs in /usr/local/opt/{lib}bin.
The script checks whether / is mounted as ro or rw, and if mounted as ro to remount it as rw.
On chromebooks however, in order to mount / as rw you need to remove_rootfs_verification for the partition in question. The script fails to echo what stated above when rootfs_verification is enabled for /, and should exit 1, instead it carries on.
Here is the part of the script I am referring to
### ChromeOS's Specific!!!
# The following assumes rootfs_verification for / has already been removed
if grep $rootfs /proc/mounts | grep ro; then
mount -o remount,rw / &> mount.out
elif
grep -iw 'read-write' mount.out; then
echo '\nrootfs_verification for the root partition must to be removed in order to remount,rw /
To remove rootfs_verification run the following command and than reboot the system:
"sudo /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions 4"'
else exit 1
fi
The entire WIP script can be found here https://pastebin.com/ekEPSvYy
This is what happen when I execute it
localhost /usr/local # ./kvm_install.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 124 0 124 0 0 322 0 --:--:-- --:--:-- --:--:-- 345
100 135 100 135 0 0 170 0 --:--:-- --:--:-- --:--:-- 170
100 60384 100 60384 0 0 57950 0 0:00:01 0:00:01 --:--:-- 344k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 143 0 143 0 0 407 0 --:--:-- --:--:-- --:--:-- 412
100 154 100 154 0 0 202 0 --:--:-- --:--:-- --:--:-- 202
100 1298k 100 1298k 0 0 929k 0 0:00:01 0:00:01 --:--:-- 3020k
/dev/root / ext2 ro,seclabel,relatime,block_validity,barrier,user_xattr,acl 0 0
./kvm_install.sh: line 31: /etc/env.d/30kvm: Read-only file system
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 66802 100 66802 0 0 69657 0 --:--:-- --:--:-- --:--:-- 74555
./kvm_install.sh: line 39: ar: command not found
tar (child): control.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
md5sum: md5sums: No such file or directory
Basically what happens here is ar cannot be found as the script was unable to add the PATH variables to /etc/env.d/30kvm since the root partition cannot be mounted because roots_verification is enabled on /.
I tried adding the elif "grep" command in [[ as some suggested here, but that didn't work and adds further syntax issues.
I am in the process of learnign the basics of bash scripting. I apologize if the script is written poorly.
Thanks
I ultimately ended up doing this.
if grep $rootfs /proc/mounts | grep 'ro,'; then
mount -o remount,rw / &> mount.out
if
grep 'read-write' mount.out; then
echo 'something to echo' && exit 1
fi
fi
It is not pretty, but it works until I find/ learn a better way to implement the loop.
to make the output of a command a variable do:
$variable="$(command)"
if you want to use grep use the syntax:
command | grep text
if you want to do if statements use the syntax:
if [ some text ]; text
commands
elif [ some text ]
commands
else
commands
fi
for the some text check this cart
grep -iw is not valid on chromebook
rootfs has a different path depending on a lot of things if you want to save it as $rootfs do this command
rootfs="$(rootdev -s)"
also you made some mistake is the "|" and "&"
command1 | command to edit command1
command1 || run this command if command1 fails
command1 && run this command if command1 succeeds

How to get bash to print the output without the fields with zero size when running smem command?

Here is the 'smem' command I run on the Redhat/CentOS Linux system. I expect the output be printed without the fields with zero size however I would expect the heading columns.
smem -kt -c "pid user command swap"
PID User Command Swap
7894 root /sbin/agetty --noclear tty1 0
9666 root ./nimbus /opt/nimsoft 0
7850 root /sbin/auditd 236.0K
7885 root /usr/sbin/irqbalance --fore 0
11205 root nimbus(hdb) 0
10701 root nimbus(spooler) 0
8446 trapsanalyzer1 /opt/traps/analyzerd/analyz 0
50316 apache /usr/sbin/httpd -DFOREGROUN 0
50310 apache /usr/sbin/httpd -DFOREGROUN 0
3971 root /usr/sbin/lvmetad -f 36.0K
63988 root su - 0
7905 ntp /usr/sbin/ntpd -u ntp:ntp - 4.0K
7876 dbus /usr/bin/dbus-daemon --syst 44.0K
9672 root nimbus(controller) 0
7888 root /usr/lib/systemd/systemd-lo 0
63990 root -bash 0
59978 postfix pickup -l -t unix -u 0
3977 root /usr/lib/systemd/systemd-ud 736.0K
9016 postfix qmgr -l -t unix -u 0
50303 root /usr/sbin/httpd -DFOREGROUN 0
3941 root /usr/lib/systemd/systemd-jo 52.0K
8199 root //usr/lib/vmware-caf/pme/bi 0
8598 daemon /opt/quest/sbin/.vasd -p /v 0
8131 root /usr/sbin/vmtoolsd 0
7881 root /usr/sbin/NetworkManager -- 8.0K
8364 root /opt/puppetlabs/puppet/bin/ 0
8616 daemon /opt/quest/sbin/.vasd -p /v 0
23290 root /usr/sbin/rsyslogd -n 3.8M
64091 root python /bin/smem -kt -c pid 0
7887 polkitd /usr/lib/polkit-1/polkitd - 0
8363 root /usr/bin/python2 -Es /usr/s 0
53606 root /usr/share/metricbeat/bin/m 0
24631 nagios /usr/local/ncpa/ncpa_passiv 0
24582 nagios /usr/local/ncpa/ncpa_listen 0
7886 root /opt/traps/bin/authorized 76.0K
7872 root /opt/traps/bin/pmd 12.0K
8374 root /opt/puppetlabs/puppet/bin/ 0
7883 root /opt/traps/bin/trapsd 64.0K
----------------------------------------------------
54 10 5.1M
Like this?:
$ awk '$NF!=0' file
PID User Command Swap
7850 root /sbin/auditd 236.0K
...
7883 root /opt/traps/bin/trapsd 64.0K
----------------------------------------------------
54 10 5.1M
But instead of using the form awk ... file you'd probably like to smem ... | awk '$NF!=0'.
Could you please try following, for extra precautions removing the space from last fields(in case it is there).
smem -kt -c "pid user command swap" | awk 'FNR==1{print;next} {sub(/[[:space:]]+$/,"")} $NF==0{next} 1'

Parse command output with awk and count results

I have an output from the 'multipath -ll' command
From RHEL:
mpath114 (3600507680283095ea8000000000004fa) dm-28 IBM,2145
[size=200G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=50][active]
\_ 19:0:0:40 sdea 128:32 [active][ready]
\_ 20:0:1:40 sdeb 128:48 [active][ready]
\_ 20:0:1:41 sdec 128:16 [failed][faulty]
\_ round-robin 0 [prio=10][enabled]
\_ 20:0:0:40 sdba 67:64 [active][ready]
\_ 19:0:1:40 sdgg 131:192 [active][ready]
mpath131 (3600507680283095ea800000000000504) dm-39 IBM,2145
[size=10G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=50][active]
\_ 20:0:1:1 sdbl 67:240 [active][ready]
\_ 19:0:0:1 sdc 8:32 [active][ready]
\_ round-robin 0 [prio=10][enabled]
\_ 19:0:1:1 sdet 129:80 [active][ready]
\_ 20:0:0:1 sdk 8:160 [active][ready]
[..]
Or from SLES server:
mpathmzp (36005076801c7061ef800000000000089) dm-0 IBM,2145
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=enabled
| `- 67:0:2:0 sde 8:64 active ready running
| `- 68:0:0:1 sdl 8:76 failed faulty running
`-+- policy='round-robin 0' prio=10 status=enabled
|- 67:0:3:0 sdc 8:32 active ready running
`- 68:0:0:0 sdd 8:48 active ready running
[..]
I would like to parse it (preferably with awk or bash), to display summary of the configuration.
It should print the pseudo multipath device and the number of active paths and the failed (if any)
Sample:
dm-39, 10G, Total: 4 paths, active: 4, failed: 0
dm-28, 200G, Total: 5 paths, active: 4, failed: 1
Same for the SLES:
dm-0, 10G, Total: 4 paths, active: 3, failed: 1
If also possible, I'd like to sort the output so that the paths with no failed and most active paths are on top, and end with the devicess with the failed paths.
Thanks for helping!
This awk should do:
multipath -ll | awk 'NR>1 {r=f=0;for (i=1;i<=NF;i++) if ($i~/ready/) r++; else if ($i~/faulty/) f++;split($5,a,"=|]");print $3,a[2]"\tTotal: "r+f" paths, active: "r,"failed: "f}' RS="mpath" OFS=", "
dm-28, 200G Total: 5 paths, active: 4, failed: 1
dm-39, 10G Total: 4 paths, active: 4, failed: 0

Resources