I have a Gearman worker in a shell script started with perp in the following way:
runuid -s gds \
/usr/bin/gearman -h 127.0.0.1 -t 1000 -w -f gds-rel \
-- xargs /home/gds/gds-rel-worker.sh < /dev/null 2>/dev/null
The worker only does some input validation and calls another shell script run.sh that invokes bash, curl, Terragrunt, Terraform, Ansible and gcloud to provision and update resources in GCP like this:
./run.sh --release 1.2.3 2>&1 >> /var/log/gds-release
The script is intended to run unattended. The problem I have is that after the job finishes successfully (that's both shell scripts run.sh and gds-rel-worker.sh) the Gearman job remains executing, because the child process becomes zombie (see last line below).
root 144748 1 0 Apr29 ? 00:00:00 perpboot -d /etc/perp
root 144749 144748 0 Apr29 ? 00:00:00 \_ tinylog -k 8 -s 100000 -t -z /var/log/perp/perpd-root
root 144750 144748 0 Apr29 ? 00:00:00 \_ perpd /etc/perp
root 2492482 144750 0 May14 ? 00:00:00 \_ tinylog (gearmand) -k 10 -s 100000000 -t -z /var/log/perp/gearmand
gearmand 2492483 144750 0 May14 ? 00:00:08 \_ /usr/sbin/gearmand -L 127.0.0.1 -p 4730 --verbose INFO --log-file stderr --keepalive --keepalive-idle 120 --keepalive-interval 120 --keepalive-count 3 --round-robin --threads 36 --worker-wakeup 3 --job-retries 1
root 2531800 144750 0 May14 ? 00:00:00 \_ tinylog (gds-rel-worker) -k 10 -s 100000000 -t -z /var/log/perp/gds-rel-worker
gds 2531801 144750 0 May14 ? 00:00:00 \_ /usr/bin/gearman -h 127.0.0.1 -t 1000 -w -f gds-rel -- xargs /home/gds/gds-rel-worker.sh
gds 2531880 2531801 0 May14 ? 00:00:00 \_ [xargs] <defunct>
So far I have traced the problem to run.sh, because if I replace its call with something simpler (e.g. echo "Hello"; sleep 5) the worker does not hang. Unfortunately, I have no clue what is causing the problem. The script run.sh is rather long and complex, but has been working without a problem so far. Tracing the worker process I see this:
getpid() = 2531801
write(2, "gearman: ", 9) = 9
write(2, "gearman_worker_work", 19) = 19
write(2, " : ", 3) = 3
write(2, "gearman_wait(GEARMAN_TIMEOUT) ti"..., 151) = 151
write(2, "\n", 1) = 1
sendto(5, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(5, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
sendto(5, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}], 2, 1000) = 1 ([{fd=5, revents=POLLIN}])
sendto(5, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(5, "\0RES\0\0\0\6\0\0\0\0\0RES\0\0\0(\0\0\0QH:terra-"..., 8192, MSG_NOSIGNAL, NULL, NULL) = 105
pipe([6, 7]) = 0
pipe([8, 9]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fea38480a50) = 2531880
close(6) = 0
close(9) = 0
write(7, "1.2.3\n", 18) = 6
close(7) = 0
read(8, "which: no terraform-0.14 in (/us"..., 1024) = 80
read(8, "Identity added: /home/gds/.ssh/i"..., 1024) = 54
read(8, 0x7fff6251f5b0, 1024) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2531880, si_uid=1006, si_status=0, si_utime=0, si_stime=0} ---
read(8,
So the worker continues reading standard output even though the child has finished successfully and presumably closed it. Any ideas how to catch what causes this problem?
I was able to solve it. The script run.sh was starting ssh-agent, which opens a socket and since Gearman redirects all outputs the worker continued reading the open file descriptor even after the script successfully completed.
I found it by examining the open file descriptors for the Gearman worker process after it hang:
# ls -l /proc/2531801/fd/*
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/0 -> /dev/null
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/1 -> 'pipe:[9356665]'
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/2 -> 'pipe:[9356665]'
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/3 -> 'pipe:[9357481]'
l-wx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/4 -> 'pipe:[9357481]'
lrwx------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/5 -> 'socket:[9357482]'
lr-x------. 1 gds devops 64 May 17 11:26 /proc/2531801/fd/8 -> 'pipe:[9369888]'
Then identified the processes using file node for the pipe in file descriptor 8 that German worker continued reading:
# lsof | grep 9369888
gearman 2531801 gds 8r FIFO 0,13 0t0 9369888 pipe
ssh-agent 2531899 gds 9w FIFO 0,13 0t0 9369888 pipe
And finally listed files opened by ssh-agent and found what stands behind file descriptor 3:
# ls -l /proc/2531899/fd/*
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/0 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/1 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/2 -> /dev/null
lrwx------. 1 root root 64 May 17 11:14 /proc/2531899/fd/3 -> 'socket:[9346577]'
# lsof | grep 9346577
ssh-agent 2531899 gds 3u unix 0xffff89016fd34000 0t0 9346577 /tmp/ssh-0b14coFWhy40/agent.2531898 type=STREAM
As a solution I added kill of the ssh-agent before exit from run.sh script and now there are no jobs hanging due to zombie process.
when starting the tcl script a directory is created via a bash command. at the end of my script i want to read the directory name of the latest dirs. but my script does not find the newest directory but only the 2nd newest
bind pub "-|-" !aa pub:aaa
proc pub:aaa {nick host handle channel arg} {
set home "/home/user"
set bb [exec bash -c "start.sh"]
after 3000
set latest [exec bash -c "ls -td $home/jpg/*/ | head -n1"]
putnow "PRIVMSG $channel :$latest"
}
before starting it has the following folders in the directory:
drwxr-xr-x 2 user user 4096 Jun 24 18:30 aaa
drwxr-xr-x 2 user user 4096 Jun 24 18:14 bbb
after starting it has the following folders in the directory
drwxr-xr-x 2 user user 4096 Jun 24 18:30 aaa
drwxr-xr-x 2 user user 4096 Jun 24 18:14 bbb
drwxr-xr-x 2 user user 4096 Jun 24 18:35 ccc
output is :
<#testbot> aaa
it should be so
<#testbot> ccc
he finds the directory created during which the tcl script is not running
how can I display the newest, newly created directory?
regards
Instead of trying to exec out to a shell to find the most recently modified directory, I'd do it in pure tcl:
proc latest_directory {path {time mtime}} {
set dirs {}
foreach dir [glob -nocomplain -type d $path/*] {
file stat $dir s
lappend dirs $s($time) $dir
}
if {[llength $dirs] == 0} {
error "No directories found in $path"
} else {
return [lindex [lsort -integer -decreasing -stride 2 $dirs] 1]
}
}
# Then in pub:aaa
set latest [latest_directory $home/jpg]
As for why you're not getting ccc... hard to say for sure without seeing your start.sh script, but if it ends up running stuff in the background that continues after it exits, maybe it takes more than 3 seconds to create that directory?
Here is the 'smem' command I run on the Redhat/CentOS Linux system. I expect the output be printed without the fields with zero size however I would expect the heading columns.
smem -kt -c "pid user command swap"
PID User Command Swap
7894 root /sbin/agetty --noclear tty1 0
9666 root ./nimbus /opt/nimsoft 0
7850 root /sbin/auditd 236.0K
7885 root /usr/sbin/irqbalance --fore 0
11205 root nimbus(hdb) 0
10701 root nimbus(spooler) 0
8446 trapsanalyzer1 /opt/traps/analyzerd/analyz 0
50316 apache /usr/sbin/httpd -DFOREGROUN 0
50310 apache /usr/sbin/httpd -DFOREGROUN 0
3971 root /usr/sbin/lvmetad -f 36.0K
63988 root su - 0
7905 ntp /usr/sbin/ntpd -u ntp:ntp - 4.0K
7876 dbus /usr/bin/dbus-daemon --syst 44.0K
9672 root nimbus(controller) 0
7888 root /usr/lib/systemd/systemd-lo 0
63990 root -bash 0
59978 postfix pickup -l -t unix -u 0
3977 root /usr/lib/systemd/systemd-ud 736.0K
9016 postfix qmgr -l -t unix -u 0
50303 root /usr/sbin/httpd -DFOREGROUN 0
3941 root /usr/lib/systemd/systemd-jo 52.0K
8199 root //usr/lib/vmware-caf/pme/bi 0
8598 daemon /opt/quest/sbin/.vasd -p /v 0
8131 root /usr/sbin/vmtoolsd 0
7881 root /usr/sbin/NetworkManager -- 8.0K
8364 root /opt/puppetlabs/puppet/bin/ 0
8616 daemon /opt/quest/sbin/.vasd -p /v 0
23290 root /usr/sbin/rsyslogd -n 3.8M
64091 root python /bin/smem -kt -c pid 0
7887 polkitd /usr/lib/polkit-1/polkitd - 0
8363 root /usr/bin/python2 -Es /usr/s 0
53606 root /usr/share/metricbeat/bin/m 0
24631 nagios /usr/local/ncpa/ncpa_passiv 0
24582 nagios /usr/local/ncpa/ncpa_listen 0
7886 root /opt/traps/bin/authorized 76.0K
7872 root /opt/traps/bin/pmd 12.0K
8374 root /opt/puppetlabs/puppet/bin/ 0
7883 root /opt/traps/bin/trapsd 64.0K
----------------------------------------------------
54 10 5.1M
Like this?:
$ awk '$NF!=0' file
PID User Command Swap
7850 root /sbin/auditd 236.0K
...
7883 root /opt/traps/bin/trapsd 64.0K
----------------------------------------------------
54 10 5.1M
But instead of using the form awk ... file you'd probably like to smem ... | awk '$NF!=0'.
Could you please try following, for extra precautions removing the space from last fields(in case it is there).
smem -kt -c "pid user command swap" | awk 'FNR==1{print;next} {sub(/[[:space:]]+$/,"")} $NF==0{next} 1'
It's a simple question, I'm writing a bash script called from cron grouping files in tar file and classifing into a dir structure.
These dir needs a special owner and permissions, and I call mkdir command thru su:
#!/bin/bash
... # shortened code
$PERMS=750
$DIR=/home/luser/0/01/012/0123
$OWNER=luser
... # shortened code
su -c "mkdir -m $PERMS -p $DIR" $OWNER
Output for ll -R /home/luser/0
/home/luser/0:
total 4
drwxr-xr-x 3 luser luser 4096 Jan 7 18:13 01
/home/luser/0/01:
total 4
drwxr-xr-x 3 luser luser 4096 Jan 7 18:13 012
/home/luser/0/01/012:
total 4
drwxr-x--- 2 luser luser 4096 Jan 7 18:13 0123
/home/luser/0/01/012/0123:
total 0
Only the deepest dir has permissions (750) setting rightly.
I don't know how deep it's the last directory and set permissions for all home's file it's too hard (too much files).
PS: I'm googled about that, but I find nothing.
You can restrict the permissions on the parent directories via umask. Here is an example:
PERMS=750
UMASK=$(echo "$PERMS" | tr "01234567" "76543210")
DIR=/home/luser/0/01/012/0123
OWNER=luser
su -c "umask $UMASK; mkdir -m $PERMS -p $DIR" $OWNER
In action:
> PERMS=750
> UMASK=$(echo "$PERMS" | tr "01234567" "76543210")
> (umask $UMASK; mkdir -m $PERMS -p 1/2/3/4)
> ll -R .
.:
drwxr-x--- 3 luser luser 4096 Jan 7 1:38 1/
./1:
drwxr-x--- 3 luser luser 4096 Jan 7 1:38 2/
./1/2:
drwxr-x--- 3 luser luser 4096 Jan 7 1:38 3/
./1/2/3:
drwxr-x--- 2 luser luser 4096 Jan 7 1:38 4/
I am trying to get my Capistrano deploy script working, but it is not doing the symlinking as it is configured to do as shown below.
set :linked_files, %w{config/database.yml}
set :linked_dirs, %w{log tmp vendor/bundle public/system}
When it runs the related command, I get the following:
WARN [SKIPPING] No Matching Host for /usr/bin/env [ -f /path/to/shared/config/database.yml ]
If I run this command on the server, either through ssh or through logging onto the server and running the command, I get no response from the command.
user: ~
$ [ -f /path/to/shared/config/database.yml ]
user: ~
$
The file does exist in the specified location and has permissions.
user: ~
$ ll /path/to/shared/config/
total 4.0K
drwxrwxr-x 2 user group 33 Nov 30 10:58 .
drwxrwxr-x 7 user group 89 Nov 30 10:58 ..
-rwxrwxr-x 1 user group 805 Nov 30 10:58 database.yml
user: ~
Shouldn't this return a true or a false, instead of nothing? Is there a configuration I may have changed that suppresses the output? I get no response at all whether the file exists or not.
In your response to the actual question you ask, test (which is what [ is an alias for) does in fact not return output to stdout. It returns an exit code.
user: ~
$ [ -f /path/to/shared/config/database.yml ] # if the file exists
user: ~
$ echo $?
0
user: ~
$ [ -f /path/to/shared/config/database.yml ] # if the file does not exist
user: ~
$ echo $?
1
test -f /path/to/file (or [ -f /path/to/file ]) yields an exit code of 0 if the file exists or 1 if it does not. If you want to check that a file is there and echo the path to it, try:
[ -f /path/to/file ] && echo "/path/to/file"