Orphaned process when pacemaker kills main monitor script(LSB) due timeout - cluster-computing

In our pacemaker + corosync cluster
Last updated: Thu Oct 22 21:16:33 2015
Last change: Thu Oct 22 17:25:13 2015 via cibadmin on aws015
Stack: corosync
Current DC: aws015 (2887647247) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
16 Resources configured
We have follow situation. We write python LSB script, that check status of some application, and make it as a resource:
primitive pm2_app_gardenscapesDynamo_lsb lsb:pm2_app_gardenscapesDynamo \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s" \
op monitor interval="30s" timeout="60s" on-fail="restart" \
meta failure-timeout="10s" migration-threshold="1"
This check is made by utility that can hung (LSB script launch that utility, and wait for reply from it). So when pacemaker reach timeout, it kill our python script, but hung utility still exists in memory, and doesn't dies.
Is it possible to prevent this situation?

You need to upgrade to pacemaker 1.1.12 or more recent.
The reason this happens is because pacemaker starts resource agents in their own process group. When an operation times out, pacemaker (1.1.10) kills the RA only, leaving any child processes it might have started as "orphaned".
Version 1.1.12 instead kills the entire process group.
The relevant code is in lib/common/mainloop.c, function child_kill_helper

Related

Avoid waiting for user when checking the Apache Tomcat status

As part of a bash script I check the recently installed Apache Tomcat status with
sudo systemctl status tomcat
The output is as follows
● tomcat.service
Loaded: loaded (/etc/systemd/system/tomcat.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2023-01-30 16:25:48 UTC; 3min 9s ago
Process: 175439 ExecStart=/opt/tomcat/bin/startup.sh (code=exited, status=0/SUCCESS)
Main PID: 175447 (java)
Tasks: 30 (limit: 4546)
Memory: 253.0M
CPU: 9.485s
CGroup: /system.slice/tomcat.service
└─175447 /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -Djava.util.logging.config.file=/opt/tomcat/conf/logging.properties -Djava.uti>
Jan 30 16:25:48 vps-06354c04 systemd[1]: Starting tomcat.service...
Jan 30 16:25:48 vps-06354c04 startup.sh[175439]: Tomcat started.
Jan 30 16:25:48 vps-06354c04 systemd[1]: Started tomcat.service.
Jan 30 16:25:48 vps-06354c04 systemd[1]: /etc/systemd/system/tomcat.service:1: Assignment outside of section. Ignoring.
Jan 30 16:25:48 vps-06354c04 systemd[1]: /etc/systemd/system/tomcat.service:2: Assignment outside of section. Ignoring.
This is the info I expect to see, but after printing it, systemctl keeps waiting for the user to type a key, breaking the automation I expect to deliver.
How can I avoid this behaviour?
I'm pretty sure the --no-pager option would keep that from happening. I just confirmed that on my own system on a different service. Otherwise, it goes interactive.
I don't recall ever seeing systemctl status asking for input, so perhaps it's the sudo used in this command doing that, in which case you could ask your system administrator to enable passwordless sudo on the account that runs this command.
A general solution for automating user input in shell scripts is to use expect, but for a simple case where you only need to send a single value one time, you can often get by with using echo and piping the value to the command (e.g., echo 'foo' | sudo systemctl status tomcat), although you should never do this to pass sensitive information such as passwords because that will potentially be accessible to other users on that system.

ansible wait until output is received from remote host

I want to automate patching my servers, and using the following playbook:
- name: Patch Upgrade
block:
- name: Patch upgrade process
ansible.netcommon.cli_command:
command: patch install {{ node_patch }} patches_repository
check_all: True
prompt:
- "[yes] ?"
- "[yes] ?"
answer:
- 'yes'
- 'yes'
register: result
until: result.stdout.find("The system is going down for reboot NOW") != -1
During patching the output is similar to this:
ISE/admin#patch install ise-patchbundle-10.1.0.0-Ptach3-19110111.SPA.x86_64.tar.gz FTP_repository
% Warning: Patch installs only on this node. Install with Primary Administration node GUI to install on all nodes in deployment. Continue? (yes/no) [yes] ? yes
Save the current ADE-OS run configuration? (yes/no) [yes] ? yes
Generating configuration...
Saved the ADE-OS run Configuration to startup successfully
Initiating Application Patch installation...
Getting bundle to local machine...
Unbundling Application Package...
Verifying Application Signature...
patch successfully installed
% This application Install or Upgrade requires reboot, rebooting now...
Broadcast message from root#ISE (pts/1) (Fri Feb 14 01:06:21 2020):
Trying to stop processes gracefully. Reload lasts approximately 3 mins
Broadcast message from root#ISE (pts/1) (Fri Feb 14 01:06:21 2020):
Trying to stop processes gracefully. Reload takes approximately 3 mins
Broadcast message from root#ISE (pts/1) (Fri Feb 14 01:06:41 2020):
The system is going down for reboot NOW
Broadcast message from root#ISE (pts/1) (Fri Feb 14 01:06:41 2020):
The system is going down for reboot NOW
Each line is sent one after the other, and there is no specific wait time, the prompts are handled without issues as the patching starts, I want the upgrade task to keep running until the line The system is going down for reboot NOW is received then it should proceed to another task where it waits for the host to get back up.
Unfortunately it's not working as I am getting this instead:
fatal: [serv-1]: FAILED! =>
msg: 'The conditional check ''result.stdout.find("The system is going down for reboot NOW") != -1'' failed. The error was: error while evaluating conditional (result.stdout.find("The system is going down for reboot NOW") != -1): ''dict object'' has no attribute ''stdout'''
How can I fix this?
The broadcast notification "The system is going down for reboot NOW" is triggered and owned by journald (in old OS it was syslog), not the patch command, so they won't be reported to the result of the command; the error happens when the host is restarting, as result will have stderr instead of stdout.
An option would be to monitor the journal (or syslog) for the case that the restart occurs; but there are some caveats:
Those broadcast messages can be disabled or routed to a different output than the log or the console, here is an example of how that can be done
Not all the patches will trigger a reboot, in that case waiting for the "reboot" message won't appear

Can't start Cloudera Manager, site not reachable

I have a small cluster with three nodes on my home server for learning purpose.
It was working fine after it was initially set up.
I haven't used it for a month and today when I try to use it, I found Cloudera Manager GUI cannot be accessed, I checked the network between the 3 nodes are good, they can ping to each other.
On master node where CM is installed, I tried service cloudera-scm-server start, it shows me [OK] in green; when I check the status it shows the following info:
[root#pocnnr1n1 ~]# service cloudera-scm-server status -l
● cloudera-scm-server.service - LSB: Cloudera SCM Server
Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server; bad; vendor preset: disabled)
Active: active (exited) since Fri 2017-09-15 20:58:24 EDT; 18min ago
Docs: man:systemd-sysv-generator(8)
Process: 107428 ExecStop=/etc/rc.d/init.d/cloudera-scm-server stop (code=exited, status=1/FAILURE)
Process: 107467 ExecStart=/etc/rc.d/init.d/cloudera-scm-server start (code=exited, status=0/SUCCESS)
Sep 15 20:58:19 pocnnr1n1.raymond.com systemd[1]: Starting LSB: Cloudera SCM Server...
Sep 15 20:58:19 pocnnr1n1.raymond.com su[107494]: (to cloudera-scm) root on none
Sep 15 20:58:24 pocnnr1n1.raymond.com cloudera-scm-server[107467]: Starting cloudera-scm-server: [ OK ]
Sep 15 20:58:24 pocnnr1n1.raymond.com systemd[1]: Started LSB: Cloudera SCM Server.
So, is the Cloudera Manager service started or stopped?
When I try to access CM through GUI, it shows below in chrome:
This site can’t be reached
192.168.211.251 refused to connect. Search Google for 192 168 211 251 7180 ERR_CONNECTION_REFUSED
Can anyone help me to fix it? Thank you very much.
This indicates the Cloudera Manager startup runs into an error. What you should do is to check the log file of your Cloudera Manager, which should be located at /var/log/cloudera-scm-server directory. Since this is a POC cluster, I assume that when you set it up, you did not use the external database like MySQL. Instead, you probably used the embedded postgresql database. If that's the case, please make sure the embedded database process is running while you start up the Cloudera Manager Server. To check the status of embedded db, you can do
service cloudera-scm-server-db status
The error when I attempted to start mariadb and failed was because there are dead processes, could be related to previous failed attempt, I killed those failed processes, and restart the mariadb with success, after that, cloudera-scm-server starts successfully.
Thank you. I hope this help for later viewers.

Jmeter distributed testing 2 slave systems

I am running a jmeter test with one master and two slave systems.
the values I provided in master system are:
no of threads: 750
ramp up: 420 seconds
loop count: 1
when I press ctrl+shift+R, the test execution begins on both "A" & "B" remote systems and the message
"Starting the test on host XXX.XXX.X.XXX # Mon Feb 8 08:08:21 IST 2016
"
is displayed on cmd prompt of both systems.
But after sometime I found that there is no response from server. I checked if there is any activity in the "summary listener", but there is no activity.
I checked the generated "summary.xlsx" file and found all the requests from system "A" have been served and only some of the requests from system "B" were served.
When I checked system A's cmd prompt it says
"Finished the test on host XXX.XXX.X.XXX # Mon Feb 8 08:08:21 IST 2016
".
(I think it is ok, because all its requests were served).
When I checked system B's cmd prompt I DIDN'T find the message
"Finished the test on host XXX.XXX.X.XXX # Mon Feb 8 08:08:21 IST 2016
".
Hoping that the requests of system B would be executed eventually, I left it for 8 hours.
But to my surprise when I checked it in the morning it was just, where I have last seen it.
No further requests from system B were executed, checked the server log no response there either. And I also didn't find the message
"Finished the test on host XXX.XXX.X.XXX # Mon Feb 8 08:08:21 IST 2016
"
on system B.
Please suggest me how I can get all the requests from both slave systems served without the above problem.
I can bet that the issue is in different subnets. Read the following step by step manual, especially limitations section:
RMI cannot communicate across subnets without a proxy; therefore neither can jmeter without a proxy.
So, make sure that both A and B are in the same subnet with master.
I assume that you are able to run a standalone/non-distributed test in Slave B w/o issues. If you have not checked that, please ensure if it works fine.
In this case, read this site. https://cloud.google.com/compute/docs/tutorials/how-to-configure-ssh-port-forwarding-set-up-load-testing-on-compute-engine/. It has good information on the jmeter communication during distributed testing.
I would check if the RMI ports on slave B are open.

Apache won't start -- says httpd module is loaded but isn't running

So I've been working with several Virtual Hosts on OS X 10.8.2. I'm using the Apache2 installation and MySQL to run name-based virtual hosts. They have all been working perfectly fine until last night. Suddenly, all of my virtual hosts redirect to a "Cannot connect to" page.
After fiddling around and eventually checking the error logs, I've concluded that Apache is NOT actually running. For example, ps aux | grep apache only returns the grep process. However, if I try sudo /usr/sbin/apachectl start I get "org.apache.httpd: Already loaded" in response.
I've checked my httpd.conf file and it looks perfectly fine. I can't see any changes to it. I also ran the syntax check command (which escapes my brain at the exact moment), and it returned OK. The only thing I found in my error logs, the last thing, was from yesterday, Feb 21, and it says: "[Thu Feb 21 21:46:02 2013] [notice] caught SIGTERM, shutting down"
Ever since then, my Apache errors logs contain nothing (because it's not running). I've restarted, tried restarting apache; I'm at a total loss as to why it thinks it's running even though it is not.
Any ideas?
In /var/logs/system.log when I try to start and restart Apache:
Feb 23 09:27:00 Baileys-MacBook-Pro com.apple.launchd[1] (org.apache.httpd[8766]): Exited with code: 1
Feb 23 09:27:00 Baileys-MacBook-Pro com.apple.launchd[1] (org.apache.httpd): Throttling respawn: Will start in 10 seconds
Feb 23 09:27:10 Baileys-MacBook-Pro com.apple.launchd[1] (org.apache.httpd[8767]): Exited with code: 1
Feb 23 09:27:10 Baileys-MacBook-Pro com.apple.launchd[1] (org.apache.httpd): Throttling respawn: Will start in 10 seconds
Feb 23 09:27:16 Baileys-MacBook-Pro.local sudo[8769]: bailey : TTY=ttys000 ; PWD=/private/var/log ; USER=root ; COMMAND=/usr/sbin/apachectl start
Feb 23 09:27:20 Baileys-MacBook-Pro com.apple.launchd[1] (org.apache.httpd[8772]): Exited with code: 1
Feb 23 09:27:20 Baileys-MacBook-Pro com.apple.launchd[1] (org.apache.httpd): Throttling respawn: Will start in 10 seconds
Feb 23 09:27:20 Baileys-MacBook-Pro.local sudo[8773]: bailey : TTY=ttys000 ; PWD=/private/var/log ; USER=root ; COMMAND=/usr/sbin/apachectl restart
Feb 23 09:27:20 Baileys-MacBook-Pro com.apple.launchd[1] (org.apache.httpd[8777]): Exited with code: 1
Feb 23 09:27:20 Baileys-MacBook-Pro com.apple.launchd[1] (org.apache.httpd): Throttling respawn: Will start in 10 seconds
Feb 23 09:27:26 Baileys-MacBook-Pro.local sudo[8778]: bailey : TTY=ttys000 ; PWD=/private/var/log ; USER=root ; COMMAND=/usr/bin/vi system.log
This problem persists after rebooting. Ever since the other day, it will not start but believes the httpd module is loaded.
I'm trying to find out via Google, but -- does anyone know how Apache checks if it's loaded? I know a lot of services lock files to run; is it possible Apache has a lock file somewhere that's still locked despite Apache not currently running?
NOTE: I've posted this on ServerFault, as well -- I'm posting this here as well because so far I'm not getting anything on ServerFault and I've been looking at Apache posts on StackOverflow, so I'm assuming Apache questions are fine for Stack.
I can reproduce the issue (kinda) by starting Apache when there's another process already listening on the same port that Apache wants to bind to (usually that's port 80). So check if there's perhaps another process listening on that port:
sudo lsof -i tcp:80 | grep LISTEN
EDIT: Perhaps easier: you can start Apache manually in debug mode to see what the reason is it won't start:
sudo /usr/sbin/httpd -k start -e Debug -E /dev/stdout
In my case (something already listening on port 80), it will produce:
(48)Address already in use: make_sock: could not bind to address 0.0.0.0:80
In my case I got:
(2)No such file or directory: httpd: could not open error log file
/private/var/log/apache2/error_log. Unable to open logs
Creating the directory apache2 made it running.
Do not know if this is relevant, but since I faced the same problem and I found an alternate solution, let me put in my 2c anyway.
Looked into this post when I got the same issue. Turns out that the httpd.conf file was the culprit. I had changed it to install something. Although I removed the installer files, I forgot to change the httpd.conf back. I hope you did not face the same problem.
Regarding question on port 80, I had seen skype hog the port as well as 443, (God knows for what) and I had better results after I turned it off. Make sure you do no have skype running on port 80 .
robertklep's pointer:
sudo /usr/sbin/httpd -k start -e Debug -E /dev/stdout
solved a related problem for me. Same symptoms, different cause, I think.
I set up a test virtual host with SSL & a self-signed certificate.
I had generated a private key with a passphrase.
So httpd was waiting for a passphrase (which I wasn't supplying).
When I started with the debug option, I got the prompt, supplied the passphrase & httpd started up.
So, will redo the private key without a passphrase...

Resources