How do I fix startup delays in chef-solo - performance

Using chef solo, calling it directly - with logging set as high as possible, this is what I see in a tail:
[2015-03-24T12:21:48+11:00] INFO: Forking chef instance to converge...
[2015-03-24T12:21:48+11:00] DEBUG: Fork successful. Waiting for new chef pid: 5571
[2015-03-24T12:21:48+11:00] DEBUG: Forked instance now converging
[2015-03-24T12:21:49+11:00] INFO: *** Chef 11.8.2 ***
[2015-03-24T12:21:49+11:00] INFO: Chef-client pid: 5571
[2015-03-24T12:25:41+11:00] DEBUG: Building node object for localhost.localdomain
...
and then it continues and goes on to work perfectly. Notice in between the last two lines, there is a delay which varies between 3 and 5 minutes! With no explanation of what it's doing, nothing obvious on netstat or top - I'm at a loss for how to troubleshoot this.
I thought it might be a proxy thing, but setting the correct proxies in /etc/chef/client.rb changed nothing. Any ideas how I get rid of this delay?

The first thing that Chef does when it starts - whether it's chef-solo or chef-client, is profile the system with ohai.
A main difference between chef-solo and chef-client is that debug log level will show the ohai output with chef-client, but it does not with chef-solo.
Depending on your system's configuration, this can take a long time to do, as it runs through a plethora of plugins. In particular, if you have a Linux system that is connected to Active Directory, it can take awhile to retrieve the user/group records via AD, which is why Ohai supports disabling plugins. Also, if you're running Chef and Ohai on a Windows system, it can take a long time.
To disable plugins, you need to edit the appropriate application's configuration file.
chef-solo uses /etc/chef/solo.rb by default
chef-client uses /etc/chef/client.rb by default
Add the following line to the appropriate config:
Ohai::Config[:disabled_plugins] = [:Passwd]
to disable the user/group lookup that might use Active Directory.
Also, I see from the output that you're using Chef 11.8.2 which came out December 3, 2013 (over a year ago as of this answer). It's possible that a performance improvement was introduced since then.
However, if you're not specifically beholden to chef-solo, you might try using chef-client with local mode. There is more information about how to switch in a blog post by Julian Dunn on Chef's site. If you need further assistance I strongly suggest the Chef irc channel on irc.freenode.net, or the chef mailing list.

Related

how to unlock a vagrant machine while it is being provisioned

Our vagrant box takes ~1h to provision thus when vagrant up is run for the first time, at the very end of provisioning process I would like to package the box to an image in a local folder so it can be used as a base box next time it needs to be rebuilt. I'm using vagrant-triggers plugin to place the code right at the end of :up process.
Relevant (shortened) Vagrantfile:
pre_built_box_file_name = 'image.vagrant'
pre_built_box_path = 'file://' + File.join(Dir.pwd, pre_built_box_file_name)
pre_built_box_exists = File.file?(pre_built_box_path)
Vagrant.configure(2) do |config|
config.vm.box = 'ubuntu/trusty64'
config.vm.box_url = pre_built_box_path if pre_built_box_exists
config.trigger.after :up do
if not pre_built_box_exists
system("echo 'Building gett vagrant image for re-use...'; vagrant halt; vagrant package --output #{pre_built_box_file_name}; vagrant up;")
end
end
end
The problem is that vagrant locks the machine while the current (vagrant up) process is running:
An action 'halt' was attempted on the machine 'gett',
but another process is already executing an action on the machine.
Vagrant locks each machine for access by only one process at a time.
Please wait until the other Vagrant process finishes modifying this
machine, then try again.
I understand the dangers of two processes provisioning or modifying the machine at one given time, but this is a special case where I'm certain the provisioning has completed.
How can I manually "unlock" vagrant machine during provisioning so I can run vagrant halt; vagrant package; vagrant up; from within config.trigger.after :up?
Or is there at least a way to start vagrant up without locking the machine?
vagrant
This issue has been fixed in GH #3664 (2015). If this still happening, probably it's related to plugins (such as AWS). So try without plugins.
vagrant-aws
If you're using AWS, then follow this bug/feature report: #428 - Unable to ssh into instance during provisioning, which is currently pending.
However there is a pull request which fixes the issue:
Allow status and ssh to run without a lock #457
So apply the fix manually, or waits until it's fixed in the next release.
In case you've got this error related to machines which aren't valid, then try running the vagrant global-status --prune command.
Definitely a bit more of a hack than a solution, but I'd rather a hack than nothing.
I ran into this issue and nothing that was suggested here was working for me. Even though this is 6 years old, it's what came up on a google (along with precious little else), I thought I'd share what solved it for me in case anyone else lands here.
My Setup
I'm using vagrant with ansible-local provisioner on a local virtualbox VM, which provisions remote AWS EC2 instances. (i.e. the ansible-local runs on the virtualbox instance, vagrant provisions the virtualbox instance, ansible handles the cloud). This setup is largely because my host OS is Windows and it's a little easier to take Microsoft out of the equation on this one.
My Mistake
Ran an ansible shell task with a command that doesn't terminate without user input (and did not run it with the & to run in the background).
My Frustration
Even in the linux subsystem, trying a ps aux | grep ruby or ps aux | grep vagrant was unhelpful because the PID would change every time. Probably a reason for this, likely has something to do with how the subsystem works, but I don't know what that reason is.
My Solution
Just kill the AWS EC2 instances manually. In the console, in the CLI, pick your flavor. Your terminal where you were running vagrant provision or vagrant up should then finally complete and spit out the summary output, even if you ctrl + C'd out of the command.
Hoping this helps someone!

vagrant / puppet init.d script reports start when no start occurred

So, struggling with a fairly major problem, i've tried multiple different workarounds to try and get this working but there is something happening between puppet and the actual server that is just boggling my mind.
Basically, I have an init.d script /etc/init.d/rserve which is copied over correctly and when used from the command-line on the server works perfectly (i.e. sudo service rserve start|stop|status), the service returns correct error codes based on testing using echo $? on the different commands.
The puppet service statement is as follows:
service { 'rserve':
ensure => running,
enable => true,
require => [File["/etc/init.d/rserve"], Package['r-base'], Exec['install-r-packages']]
}
When puppet hits this service, it runs it's status method, sees that it isn't running and sets it to running and presumably starts the service, the output from puppet is below:
==> twine: debug: /Schedule[weekly]: Skipping device resources because running on a host
==> twine: debug: /Schedule[puppet]: Skipping device resources because running on a host
==> twine: debug: Service[rserve](provider=upstart): Could not find rserve.conf in /etc/init
==> twine: debug: Service[rserve](provider=upstart): Could not find rserve.conf in /etc/init.d
==> twine: debug: Service[rserve](provider=upstart): Could not find rserve in /etc/init
==> twine: debug: Service[rserve](provider=upstart): Executing '/etc/init.d/rserve status'
==> twine: debug: Service[rserve](provider=upstart): Executing '/etc/init.d/rserve start'
==> twine: notice: /Stage[main]/Etl/Service[rserve]/ensure: ensure changed 'stopped' to 'running'
Now when I actually check for the service using sudo service rserve status or ps aux | grep Rserve the service is in fact NOT running and a quick sudo service rserve start shows the init.d script is working fine and starting rserve as the service starts and is visible with ps aux.
Is there something I'm missing here? I've even tried starting the service by creating a puppet Exec { "sudo service rserve start"} which still reports that it executed successfully but the service is still not running on the server.
tl;dr puppet says a service started when it hasn't and there's seemingly nothing wrong with the init.d script, its exit codes or otherwise.
Update 1
In the comments below you can see I tried isolating the service in it's own test.pp file and running it using puppet apply on the server with the same result.
Update 2
I've now tried creating an .sh file with the command to start Rserve using a separate vagrant provision and can finally see an error. However, the error is confusing as the error does not occur when simply running sudo service rserve start, something in the way that vagrant executes .sh commands, or the user it executes them under is causing an option to be removed from the command inside the init.d script when it's executed.
This error is R and Rserve specific but it is complaining about a missing flag --no-save needing to be passed to R when it is in fact present in the init.d script and being correctly passed when ssh'd into the vagrant box and using the init.d commands.
Update 3
I've managed to get the whole process working at this point, however, it's one of those situations where the steps to get it to work didn't really readily reveal any understanding of why the original problem existed. I'm going to replicate the broken version and see if I can figure out what exactly was going on using one of the methods mentioned in the comments so that I can potentially post an answer up that will help someone out later on. If anyone has insight into why this might have been happening feel free to answer in the meantime however. To clarify the situation a bit, here are some details:
The service's dependencies were installed correctly using puppet
The service used a script in /etc/init.d on ubuntu to start|stop the Rserve service
The software in question is R (r-base) and Rserve (a communication layer between other langs and R)
Running the command sudo service rserve start from the command-line worked as expected
The init.d script returned correct error codes
A service {} block was being used to start the service from puppet
Puppet reported starting the service when the service wasn't started
Adding a provision option to the Vagrantfile for an .sh file containing sudo service rserve start revealed that some arguments in the init.d were being ignored when run by vagrants provisioning but not by a user active on the shell.

Running Chef cookbooks on ExaData

I am trying to run a Chef Cookbook on an ExaData server and I'm running into issues. I was able to bootstrap my ExaData servers. However when I run chef-client on the target nodes, I get an error like this. Then I went back and did a verbose output of the error, and still don't have any idea of what the issue is. I am able to ping, traceroute, and nc to and from the ExaData server to the Chef Server. None of the files transfer from the cookbook, or none of the files download from the remote Zabbix repository. The Chef run completes the role, and recipes but nothing is installed. Is there something different about ExaData from regular RHEL distributions that would cause issues?
--EDIT - 2013-07-15--
From looking at a "successful" chef-client run on a regular RHEL 6.2 OS, where as ExaData runs RHEL 5.8, I saw fewer errors. There does seem to be a lot of libraries missing from ExaData in order to run chef-client. From what I have heard, and read in other posts, was that ExaData is a stripped version of RHEL 5.8, using only what is needed to run databases.
According to a comment on the Chef IRC Logs the 404 message is because the client is attempting to use a feature that your server version doesn't support.
If you add the setting enable_reporting false to your client.rb file it should disable the request to the /reports URL.

Vagrant box broken after sleep/shutdown

This has happened several times now, the scenario is as follows: I create/provision a vagrant box with puppet. I work on it for some time, a couple of days, sometimes a week. At the end of my day I either close the lid on my MacBook (putting it to sleep), or I shut it down. At a certain point, vagrant up gives an error:
[default] Mounting shared folders...
[default] -- v-root: /vagrant
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!
mkdir -p /vagrant
Provisioning, reloading, halt/up all don't work at this point. I have to destroy and build the box again, which costs some time and becomes very annoying.
I found this post which describes this problem, and states a ntp service should fix it. So I've added that to my puppet config, but the problem still occurs.
I also found a similar issue on Github, which is fixed, but I'm running a different OS than described there, so it's not the same issue. I did post my problem there, without response so far.
The debug log is saved as a gist: https://gist.github.com/pkruithof/5116426
Does anyone know what this problem might be, and how I can fix it?
UPDATE
I think this is fixed somewhere along the road in Vagrant, because I haven't had this issue in about 6 months now. Therefore I'm closing this question.
sudo vim /etc/NetworkManager/NetworkManager.conf
Add the following lines:
[keyfile]
unmanaged-devices=interface-name:vboxnet0
Do vargrant reload

Smartfoxserver 2X linux 64 running on EC2 via dotcloud - how to install?

I am currently trying to deploy smartfoxserver 2X on EC2 using dotcloud. I have been able to detect the private ip of the amazon web instance, and using the dotcloud tools I have been able to determine the correct port. However, I have difficulty installing the server proper via the command line so that I can log into it using the AdminTool.
My postinstall is fairly straightforward:
./SFS2X/sfs2x-service start-launchd
I find that on 'dotcloud push' there is a fair amount of promising output in my cygwin terminal, but the push hangs after saying that the sfs2x-service has been launched correctly, until timeout.
Consequently, my question is, has anyone found a way to install SFS2X on EC2 via dotcloud successfully? I managed to have partial success with SFS Pro, with a complete push to dotcloud, by calling ./jre/bin/java -jar installer.jar in my postinstall. Do I need to do extra legwork and build an installer jar for SFS2X? Is there a way that would be best to do this?
I do understand that there is a standard approach to deployment with SFS2X using RightScale on EC2, however I am interested in deployment using the dotcloud platform.
Thanks in advance.
The reason why it is hanging is because you are trying to start your process in the postinstall, and this is not the correct place to do that. The postinstall script is suppose to finish, if it doesn't the deployment will time out, and then get cancelled.
Once the postinstall script is finished, it will then finish the rest of your deployment.
See this page for more information about dotCloud postinstall script:
http://docs.dotcloud.com/0.9/guides/hooks/#post-install
Pay attention to this warning at the end.
Warning:
If your post-install script returns an error (non-zero exit code), or if it runs for more than 10 minutes, the platform will consider that your build has failed, and the new version of your code will not be deployed.
Instead of putting this in the postinstall script, you should add it as a background process, so that it starts up once the deployment process is complete.
See this page for more information on adding background processes to dotCloud services:
http://docs.dotcloud.com/0.9/guides/daemons/
TL;DR: You need to create a supervisord.conf file, and add it to the root of your project, and add your service to that.
Example (you will need to change to fit your situation):
[program:smartfoxserver]
command = /home/dotcloud/current/SFS2X/sfs2x-service start-launchd
Also, make sure you have the correct dotCloud service specified in your dotcloud.yml in order to have the correct binary and libraries installed for what your smartfoxserver application.

Resources