Test Kitchen Fails Almost Every Time on Transferring Files - amazon-ec2

Almost every time I run kitchen converge with ec2 driver it is able to create the server and establish an ssh connection but then after detecting the chef omnibus installation it tries to transfer files but fails with an unhelpful error. I've tried using different versions of net-ssh and reinstalling chefdk. I've gotten it to successfully converge once out of maybe 30 times and can't figure out what the difference is.
Has anybody else run into this problem?
-----> Starting Kitchen (v1.10.2)
-----> Creating <default-rhel7>...
If you are not using an account that qualifies under the AWS
free-tier, you may be charged to run these suites. The charge
should be minimal, but neither Test Kitchen nor its maintainers
are responsible for your incurred costs.
Instance <i-167bf188> requested.
Polling AWS for existence, attempt 0...
Attempting to tag the instance, 0 retries
EC2 instance <i-167bf188> created.
Waited 0/600s for instance <i-167bf188> to become ready.
Waited 5/600s for instance <i-167bf188> to become ready.
Waited 10/600s for instance <i-167bf188> to become ready.
Waited 15/600s for instance <i-167bf188> to become ready.
Waited 20/600s for instance <i-167bf188> to become ready.
Waited 25/600s for instance <i-167bf188> to become ready.
Waited 30/600s for instance <i-167bf188> to become ready.
Waited 35/600s for instance <i-167bf188> to become ready.
Waited 40/600s for instance <i-167bf188> to become ready.
Waited 45/600s for instance <i-167bf188> to become ready.
Waited 50/600s for instance <i-167bf188> to become ready.
Waited 55/600s for instance <i-167bf188> to become ready.
EC2 instance <i-167bf188> ready.
Waiting for SSH service on 10.254.105.26:22, retrying in 3 seconds
Waiting for SSH service on 10.254.105.26:22, retrying in 3 seconds
Waiting for SSH service on 10.254.105.26:22, retrying in 3 seconds
[SSH] Established
Finished creating <default-rhel7> (1m56.70s).
-----> Converging <default-rhel7>...
Preparing files for transfer
Preparing dna.json
Preparing current project directory as a cookbook
Removing non-cookbook files before transfer
Preparing validation.pem
Preparing client.rb
-----> Chef Omnibus installation detected (install only if missing)
Transferring files to <default-rhel7>
C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/ruby_compat.rb:25:in `select': closed stream (IOError)
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/ruby_compat.rb:25:in `io_select'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/packet_stream.rb:75:in `available_for_read?'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/packet_stream.rb:87:in `next_packet'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:193:in `block in poll_message'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:188:in `loop'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/transport/session.rb:188:in `poll_message'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:474:in `dispatch_incoming_packets'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:225:in `preprocess'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:206:in `process'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `block in loop'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `loop'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:170:in `loop'
from C:/Users/AlexKiaie/AppData/Local/chefdk/gem/ruby/2.1.0/gems/net-ssh-3.2.0/lib/net/ssh/connection/session.rb:119:in `close'
from C:/opscode/chefdk/embedded/lib/ruby/gems/2.1.0/gems/test-kitchen-1.10.2/lib/kitchen/transport/ssh.rb:115:in `close'
from C:/opscode/chefdk/embedded/lib/ruby/gems/2.1.0/gems/test-kitchen-1.10.2/lib/kitchen/transport/ssh.rb:97:in `cleanup!'
from C:/opscode/chefdk/embedded/lib/ruby/gems/2.1.0/gems/test-kitchen-1.10.2/lib/kitchen/instance.rb:274:in `cleanup!'
from C:/opscode/chefdk/embedded/lib/ruby/gems/2.1.0/gems/test-kitchen-1.10.2/lib/kitchen/command.rb:209:in `run_action_in_thread'
from C:/opscode/chefdk/embedded/lib/ruby/gems/2.1.0/gems/test-kitchen-1.10.2/lib/kitchen/command.rb:173:in `block (2 levels) in run_action'
from C:/opscode/chefdk/embedded/lib/ruby/gems/2.1.0/gems/logging-2.1.0/lib/logging/diagnostic_context.rb:450:in `call'
from C:/opscode/chefdk/embedded/lib/ruby/gems/2.1.0/gems/logging-2.1.0/lib/logging/diagnostic_context.rb:450:in `block in create_with_logging_context'

I had a similar issue. After a lot of digging, I found in the /var/log/secure an interesting message -
"localhost sshd[1081]: error: no more sessions".
By default SSHD has 10 sessions allowed; these aren't logged in sessions. If for some reason there are sessions which either aren't closed properly or are open, you will receive this error.
I then went into my .kitchen.yml and added:
max_ssh_sessions: 1
to the transport section. So it now looks like:
transport:
ssh_key: ./kitchen.pem
# need to get this key from vault, then place it on the kitchen ecs container
connection_timeout: 10
connection_retries: 5
max_ssh_sessions: 1
username: centos
It is noticeably slower when I run test kitchen. However, it works 100% of the time. What I think is happening is that kitchen is opening multiple SSH sessions to speed up the installation of the tools required. Eg yum for ansible/git/whatever and the /tmp/install.sh for chef.
Hope this helps someone. It took me a little while to find out.

Related

Next.js build command gets stuck sometimes in Ubuntu 20.04 AWS

When I run next build in my server, it gets stuck here, and the website also crashes and turns into a 504 Gateway Timeout error:
info - Loaded env from /home/ubuntu/MYPROJECT/.env
info - Using webpack 4. Reason: future.webpack5 option not enabled https://nextjs.org/docs/messages/webpack5
info - Checking validity of types
info - Using external babel configuration from /home/ubuntu/MYPROJECT/.babelrc
info - Creating an optimized production build..
This does not happen all the time, maybe 1 time every 5 to 10 deployments.
I then have to stop and start the EC2 instance. Right after the build command always runs very quick. Which could mean I have memory issues in the server? How could I resolve this?
I have a t2.micro instance with 8GB of storage.
Usage of /: 76.2% of 7.69GB
Memory usage: 50%
Could be that upgrading resolves this issue?

Chef reboot resource causes chef run to time out before Windows server is able to start back up after update and resume recipe

I am trying to create a domain on a windows 2012 R2 server and it requires a reboot before the recipe can proceed:
reboot "reboot server" do
reason "init::chef - continue provisioning after reboot"
action :reboot_now
end
I receive the following error, which indicates a timeout + it occurs before I see the OS comes back to life after the update:
Failed to complete #converge action: [WinRM::WinRMAuthorizationError] on default-windows2012r2
Does anyone out there know how to make the chef server continue to run after the OS is back up? I hear that :restart_now is supposed to do the trick... ^^^ but as you can see, it isn't :)
P.S. this also causes windows to update... Goal: get chef to resume after the update is complete and the server is back up
Update: The server actually seems to be rebooting twice and exiting the chef run on the second reboot. If I remove the ONE reboot resource block that I have then it does not reboot at all (that makes no sense to me)... here is output from the chef run:
Chef Client finished, 2/25 resources updated in 19 seconds
[2018-10-29T08:04:11-07:00] WARN: Rebooting server at a recipe's request. Details: {:delay_mins=>0, :reason=>"init::chef - continue provisioning after reboot", :timestamp=>2018-10-29 08:04:11 -0700, :requested_by=>"reboot server"}
Running handlers:
[2018-10-29T08:04:11-07:00] ERROR: Running exception handlers
Running handlers complete
[2018-10-29T08:04:11-07:00] ERROR: Exception handlers complete
Chef Client failed. 2 resources updated in 20 seconds
[2018-10-29T08:04:11-07:00] FATAL: Stacktrace dumped to C:/Users/vagrant/AppData/Local/Temp/kitchen/cache/chef-stacktrace.out
[2018-10-29T08:04:11-07:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2018-10-29T08:04:11-07:00] FATAL: Chef::Exceptions::Reboot: Rebooting server at a recipe's request. Details: {:delay_mins=>0, :reason=>"init::chef - continue provisioning after reboot", :timestamp=>2018-10-29 08:04:11 -0700, :requested_by=>"reboot server"}
^^^ that repeats twice ^^^
Update #2: I even commented out every line except for the reboot block and am experiencing the same issue... this is ridiculous and I'm confident that it isn't my code that is the problem (considering that all I am using now is a reboot command).
Update #3: I generated an entirely new cookbook and called it "reboot"... it contains the following code:
reboot 'app_requires_reboot' do
action :request_reboot
reason 'Need to reboot when the run completes successfully.'
end
And unfortunately, it too reboots the Windows server twice... Here are the logs:
Recipe: reboot::default
* reboot[app_requires_reboot] action request_reboot[2018-10-29T10:21:41-07:00] WARN: Reboot requested:'app_requires_reboot'
- request a system reboot to occur if the run succeeds
Running handlers:
Running handlers complete
Chef Client finished, 1/1 resources updated in 03 seconds
[2018-10-29T10:21:41-07:00] WARN: Rebooting server at a recipe's request. Details: {:delay_mins=>0, :reason=>"Need to reboot when the run completes successfully.", :timestamp=>2018-10-29 10:21:41 -0700, :requested_by=>"app_requires_reboot"}
Running handlers:
[2018-10-29T10:21:41-07:00] ERROR: Running exception handlers
Running handlers complete
[2018-10-29T10:21:41-07:00] ERROR: Exception handlers complete
Chef Client failed. 1 resources updated in 03 seconds
[2018-10-29T10:21:41-07:00] FATAL: Stacktrace dumped to C:/Users/vagrant/AppData/Local/Temp/kitchen/cache/chef-stacktrace.out
[2018-10-29T10:21:41-07:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2018-10-29T10:21:41-07:00] FATAL: Chef::Exceptions::Reboot: Rebooting server at a recipe's request. Details: {:delay_mins=>0, :reason=>"Need to reboot when the run completes successfully.", :timestamp=>2018-10-29 10:21:41 -0700, :requested_by=>"app_requires_reboot"}
seems like an issue with chef now... this is bad... who has ever successfully rebooted windows with Chef before? and why does a single reboot block, reboot the server twice?
Update number 4 will be after I have thrown my computer out the window
The issue was resolved with the following logic:
reboot "reboot server" do
reason "init::chef - continue provisioning after reboot"
action :nothing
only_if {reboot_pending?}
end
Adding the only if statement allows the recipe to ski[p that step if the OS does not detect that there is a Windows update pending.
I had forgotten that windows actually does track whether a system update/reboot is required.
As part of my poweshell_script block, I included the following: notifies :reboot_now, 'reboot[reboot server]', :immediately

Can not re-register mesos agent

After updating properties (isolation) of mesos-slave it fails to re-register:
6868 status_update_manager.cpp:177] Pausing sending status updates
6877 slave.cpp:915] New master detected at master#192.168.1.1:5050
6867 status_update_manager.cpp:177] Pausing sending status updates
6877 slave.cpp:936] No credentials provided. Attempting to register without authentication
6877 slave.cpp:947] Detecting new master
6869 slave.cpp:1217] Re-registered with master master#192.168.1.1:5050
6866 status_update_manager.cpp:184] Resuming sending status updates
6869 slave.cpp:1253] Forwarding total oversubscribed resources {}
6874 slave.cpp:4141] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration.
6874 slave.cpp:904] Re-detecting master
6874 slave.cpp:947] Detecting new master
6874 status_update_manager.cpp:177] Pausing sending status updates
6869 status_update_manager.cpp:177] Pausing sending status updates
6871 slave.cpp:915] New master detected at master#192.168.1.1:5050
6871 slave.cpp:936] No credentials provided. Attempting to register without authentication
6871 slave.cpp:947] Detecting new master
6872 slave.cpp:1217] Re-registered with master master#192.168.1.1:5050
6872 slave.cpp:1253] Forwarding total oversubscribed resources {}
6871 status_update_manager.cpp:184] Resuming sending status updates
6871 slave.cpp:4141] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration.
It seems to be stuck in an infinite loop. Any idea how to start fresh slave? I've tried to remove work_dir and restart mesos-slave process but without any success.
The situation was caused by accidental rename of work_dir. After restarting mesos-slave it wasn't able to reconnect nor kill running tasks. I've tried to use cleanup on slaves:
echo 'cleanup' > /etc/mesos-slave/recover
service mesos-slave restart
# after recovery finishes
rm /etc/mesos-slave/recover
service mesos-slave restart
This partially helped, but there are still many zombie tasks in Marathon, as Mesos master is not able to retrieve any information about that task. When I'm looking at metrics I found out that some slaves are marked as "inactive".
UPDATE: in master logs appears following:
Cannot kill task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon) at
scheduler-e76665b1-de85-48a3-b9fd-5e736b64a9d8#192.168.1.10:52192
because the agent cac09818-0d75-46a9-acb1-4e17fdb9e328-S10 at
slave(1)#192.168.1.1:5051 (w10.example.net) is disconnected.
Kill will be retried if the agent re-registers
after restarting current mesos-master:
Cannot kill task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon)
at scheduler-9e9753be-99ae-40a6-ab2f-ad7834126c33#192.168.1.10:39972
because it is unknown; performing reconciliation
Performing explicit task state reconciliation for 1 tasks
of framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon)
at scheduler-9e9753be-99ae-40a6-ab2f-ad7834126c33#192.168.1.10:39972
Dropping reconciliation of task service_mesos-kafka_kafka.e0e3e128-ef0e-11e6-af93-fead7f32c37c
for framework ecd3a4be-d34c-46f3-b358-c4e26ac0d131-0000 (marathon)
at scheduler-9e9753be-99ae-40a6-ab2f-ad7834126c33#192.168.1.10:39972
because there are transitional agents
The split-brain situation was caused by having more than one work_dir. In most cases it might be enough to move data from the incorrect work_dir:
mv /tmp/mesos/slaves/* /var/lib/mesos/slaves/
Then force re-registration:
rm -rf /var/lib/mesos/meta/slaves/latest
service mesos-slave restart
Currently running tasks won't survive (won't be recovered). Tasks from old executors should be marked as TASK_LOST and scheduled for cleanup. Which will avoid problem with zombie tasks, that Mesos is unable to kill (because they were running in different work_dir).
If the mesos-slave is still registered as inactive, restart current Mesos master.

How do I fix startup delays in chef-solo

Using chef solo, calling it directly - with logging set as high as possible, this is what I see in a tail:
[2015-03-24T12:21:48+11:00] INFO: Forking chef instance to converge...
[2015-03-24T12:21:48+11:00] DEBUG: Fork successful. Waiting for new chef pid: 5571
[2015-03-24T12:21:48+11:00] DEBUG: Forked instance now converging
[2015-03-24T12:21:49+11:00] INFO: *** Chef 11.8.2 ***
[2015-03-24T12:21:49+11:00] INFO: Chef-client pid: 5571
[2015-03-24T12:25:41+11:00] DEBUG: Building node object for localhost.localdomain
...
and then it continues and goes on to work perfectly. Notice in between the last two lines, there is a delay which varies between 3 and 5 minutes! With no explanation of what it's doing, nothing obvious on netstat or top - I'm at a loss for how to troubleshoot this.
I thought it might be a proxy thing, but setting the correct proxies in /etc/chef/client.rb changed nothing. Any ideas how I get rid of this delay?
The first thing that Chef does when it starts - whether it's chef-solo or chef-client, is profile the system with ohai.
A main difference between chef-solo and chef-client is that debug log level will show the ohai output with chef-client, but it does not with chef-solo.
Depending on your system's configuration, this can take a long time to do, as it runs through a plethora of plugins. In particular, if you have a Linux system that is connected to Active Directory, it can take awhile to retrieve the user/group records via AD, which is why Ohai supports disabling plugins. Also, if you're running Chef and Ohai on a Windows system, it can take a long time.
To disable plugins, you need to edit the appropriate application's configuration file.
chef-solo uses /etc/chef/solo.rb by default
chef-client uses /etc/chef/client.rb by default
Add the following line to the appropriate config:
Ohai::Config[:disabled_plugins] = [:Passwd]
to disable the user/group lookup that might use Active Directory.
Also, I see from the output that you're using Chef 11.8.2 which came out December 3, 2013 (over a year ago as of this answer). It's possible that a performance improvement was introduced since then.
However, if you're not specifically beholden to chef-solo, you might try using chef-client with local mode. There is more information about how to switch in a blog post by Julian Dunn on Chef's site. If you need further assistance I strongly suggest the Chef irc channel on irc.freenode.net, or the chef mailing list.

How to handle postgresql connection problems

I have run into a number of cases where the postgresql connection is in a bad state. OK, I'll make this complicated by having a master with a couple of slave servers... And to complicate matters even more, the master is on OS X with postgresql installed via homebrew
So things like
pg_basebackup: could not connect to server: could not connect to server: Connection refused
Is the server running on host "XX.XX.XX.XX" and accepting
TCP/IP connections on port 5432?
crop up. The master postgres/server.log was showing
FATAL: pre-existing shared memory block (key xx, ID yyy) is still in use.
this notwithstanding running
launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist
to start and stop the service. [head scratch] Got we worrying... So I run
launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist
I query the master's application and can edit data! WAIT postgresql was supposed to be unloaded!
So launctl is NOT working as expected... This explains why some synching up has fallen by the wayside.
Question 1: How can we ensure that postgres is really stopped on OS X because although it may be risky, the only alternative appears to be to kill the postmaster.pid
Then things went probably haywire in the flow... master's log states:
LOG: could not send data to client: Broken pipe
and effectively one of the VPS is complaining
$ service postgresql stop
Error: pid file is invalid, please manually kill the stale server process.
Question 2: how can those be killed (Ubuntu 14.04) without harming the database and the WAL process that should be updating the slave? (or is there a more effective/saner way of handling master-slave replication?)

Resources