When there are problems while running a recipe and the client run is hanging half way, the installed Chef client will be unusable.
You can then exit the machine, reboot, clean up chef pid files and so forth but each time the Chef client is started the following message is shown:
Chef client is running, will wait for it to finish and then run.
Chef should be able to recover from this when a reboot is performed but this is not the case.
What is the best way to recover from a client run that hangs half way? Currently I delete the VM and create a new one but that is not a real solution.
Is it possible to recover when it hangs half way?
Stop the service if any running : sudo service chef-client stop
if the issue persists
find out if nay chef client process/service is running: ps aux | grep chef
kill the process
if the issue persists
look up your chef client settings under /etc/chef/client.rb and/or your etc/init.d/chef-client
then locate the pid_file and lockfile path
delete the files
if the issues persists, you might run like me a old version of chef-client
you need to look in your chef cache folder
I had to delete /var/chef/cache/chef-client-running.pid
If it timeout takes place - converging should work fine over and over again. Well, if you need to remove client - you can run sudo rm -rf /etc/chef on client machine. All the options are described here in details.
I ran into the same issue where the kitchen converge process would get hung, typically due to an improper msi installation in my case. I would then have to reboot the kitchen machine.
I noticed that an elevated scheduled task was running which when stopping, it would kill the process, but the issue returned when I ran it again ("chef client 2092 is running, will wait for it to finish and then run").
It then occurred to me that the last attempt to reboot, which normally resolves the issue, didn't succeed because of an open file preventing the reboot. Rebooting the machine did resolve the issue in my case. It's not a perfect solution but it does work as a work-around in my case.
Related
The com.canonical.multipassd service is constantly logging errors on my Mac and multipass won't work at all, even after reinstalling, rebooting, and updating my Mac.
In an attempt to use my GPU in a Linux VM through multipass, I tried to install the AMDGPU driver for my card (Radeon Pro 5300 4GB). I had installed multipass through brew and made some progress, but the ./amdgpu-install process was returning various errors as a result of missing dependencies. Having started to resolve the missing dependencies, in an attempt to build the driver again, the build just stopped halfway through and I couldn't terminate the process or get the VM to respond at all (didn't take a screenshot sorry).
Because of this, I closed the VM shell and tried to get multipass to shut down the VM. Multipass stopped responding altogether - the application just spun, and it didn't respond at all in terminal. I force quit multipass in Activity Monitor. That still didn't fix it, so I (somewhat stupidly) force quit 'hyperkit' and 'multipassd'. This is where everything went really wrong.
Having force quit 'multipassd', I tried to re-open multipass, but it returned the error below
list failed: cannot connect to the multipass socket
Please ensure multipassd is running and '/var/run/multipass_socket' is accessible
I looked this up and tried a few suggested solutions. I uninstalled multipass with Brew. I deleted the application, and reinstalled with brew. I also tried brew remove multipass, and tried installing using the .pkg from the multipass website. When that didn't fix it, I restarted my computer and reset NVRAM on startup. That also didn't make a difference, so I have just updated my Mac to MacOS 11.4, and it is still not fixed.
The console logs suggest that multipassd is still doing something, as it is continually logging in the system.log:
May 26 09:39:15 <myName> com.apple.xpc.launchd[1] (com.canonical.multipassd[2131]): Service exited with abnormal code: 1
May 26 09:39:15 <myName> com.apple.xpc.launchd[1] (com.canonical.multipassd): Service only ran for 0 seconds. Pushing respawn out by 10 seconds.
In the multipass log, this message is also being generated about once every 10 seconds:
[error] [daemon] Caught an unhandled exception: Invalid MAC address
[warning] [Qt] QMutex: destroying locked mutex
These messages are being generated even after resetting NVRAM and rebooting. I think they're the cause of my issue launching multipass, but I haven't found any solution to stop them, and I can't identify any process that is still running related to multipass. As far as brew is concerned, multipass is not installed, but it's logs are still filling up...
Happy to provide console or terminal output if needed - nothing else on my Mac seems to be broken, I just can't use multipass now. I do have a time machine backup, so if that is guaranteed to fix it, I might just resort to the backup, but I'm not sure that would necessarily fix it, and I would rather find an alternative solution.
As this has probably made clear, I'm very new to Linux and VMs... any solutions greatly appreciated!
Fixed it!! I hadn't properly uninstalled it - the 'proper' uninstall script can be run using
sudo sh "/Library/Application Support/com.canonical.multipass/uninstall.sh"
Reinstalling multipass after running this command worked fine.
I was trying to start appscale on ec2 instance for deploying my python app. First of all I installed appscale-tools and initiated cluster which made AppScaleFile. Then I used appscale up command which is stuck at this point Waiting for head node to initialize.... Here is the screenshot:
Screenshot
It is some sort of an infinite loop. Any help?
If using Ubuntu you may be running into an issue with the version of monit that they distribute, as described here:
https://bugs.launchpad.net/ubuntu/+source/monit/+bug/1786910
If so, then this AppScale blog has more information on workarounds:
https://blog.appscale.com/monit-bug-impacting-appscale-deployments
To check if you are using the broken monit (version 1:5.16-2ubuntu0.1):
# dpkg -l monit
To downgrade (on all hosts running AppScale):
# sudo apt-get install monit=1:5.16-2
# sudo apt-mark hold monit
Keep in mind that this is reverting a security fix, so may not always be an acceptable solution.
Im having failure to install PIAF 3 or PIAF 3.1.6. It hangs at the 'stopping ntp services'. Both times I tried clean install from scratch. After 5 mins I ctrl-C then it picks back up, but fails later when again it hangs 'stopping ntp services'.
Any ideas anyone?
CentOS 6.7 64Bit minimal
Green 3.1.6 033015
running on a DELL desktop T20.
Open new terminal.
service ntpd stop
You will see installation will continue where it was stuck.
Repeat it until when ever script hang while 'stopping ntp services'
It works perfectly file with me, hope it will solve your problem as well. Keep me posted.
Very often when running ansible-playbook on the Vagrant VM from Windows, I need to stop in the middle of something by pressing Ctrl+C. This happens if ansible becomes unresponsive or there is some bug we need to fix asap, so there is no point of waiting until provisioner completes.
The probem is that Ctrl+C does not work, some 2 ruby.exe processes get stuck in process tree. Any subsequent vagrant commands fail until you manually kill these ruby processes.
I also use to kill all stucked python ansible processes on the VM before running new provision.
Any way to handle it more jently?
I found this problem as well on Windows and using Puppet Apply. The only way I can happily kill it by opening another terminal/cmd and then vagrant ssh -- sudo pkill puppet. That gracefully terminates the process, and allows me to regain control of my first terminal again.
In short the solution is:
Take a terminal that works.
I find one working gitbash v2.32.0.windows..
The latest available gitbash currently is v2.38.1. But only the old one is working correctly with Vagrant(Oracle VM). The strange thing is that the latest one (gitbash v2.38.1) is working fine with SSH connections to AWS EC2 instances.
Alternatively. Windows PowerShell is working fine with Vagrant(Oracle VM).
If someone needs my bad experience, here it is.
The following terminals DON'T WORK
gitbash v2.38.1 (latest for now)
gitbash V2.36.0
ConEmu v220807 Alhpa (latest for now)
cmder v1.3.20.1282 (latest for now)
I propose using vagrant halt.
I'm working on a bash setup script for CentOS 6.4 machines. On a brand new install I'm running into an issue that seems to be reproducible, but the scenario is unusual.
The setup script is run with root. The first step is to run yum update with no options:
yum update
This completes successfully with a zero exit code. The next step is to retrieve the EPEL rpm using wget:
wget http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
However, this is consistently failing when resolving the host name every time this is run from a clean CentOS install:
wget: unable to resolve host address “dl.fedoraproject.org”
When executing these commands in succession from the command line however, no issues are encountered and wget is able to retrieve the EPEL rpm:
sudo yum update
sudo wget http://dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
Is there anything that happens during the yum update that could cause the DNS lookup to fail without exiting the script first? If I rerun the script after the first failure, it passes on the second time around.
It's possible when the Time to Live of the domain name expires on the system or on a cache DNS server before the next instance of wget and the next attempt to resolve the domain name from the authorative server fails. See http://en.wikipedia.org/wiki/Time_to_live#DNS_records. Of course it's also possible that the cache DNS server becomes inaccessible.