I created a playbook to reboot my remote servers. I use wait_for to wait for remote servers up before I continue. So I have the following code:
—-
- hosts: hostName
tasks:
- name: reboot
shell: reboot
async: 1
poll: 0
- name: wait for server to come up
Local_action: wait_for
args:
host: hostName
port: 22
state: started
delay: 10
timeout: 600
My targeted server was up about 5 minutes after reboot was initiated. However, the playbook stacked at this play till it timed out and generated error.
My questions are:
1. How doeS wait_for work here? Does it send ssh connection request to target host and time out if it cannot connect to the target host after 600 seconds? Or does it keep pinging the target host till it times out?
2.What could be the problem I am having?
You'll be better off using wait_for_connection in this case. For example, given the play is running at - hosts: hostName
- name: Wait 600 seconds, but only start checking after 10 seconds
wait_for_connection:
delay: 10
timeout: 600
Q: How does wait_for work here?
A: wait_for is waiting for a port to become available.
Q: Does it send the ssh connection request to the target host and time out if it cannot connect to the target host after 600 seconds?
A: No. It's testing the port.
Q: Does it keep pinging the target host till it times out?
A: No. It tries to create a socket. See wait_for.py
s = socket.create_connection((host, port), connect_timeout)
Q: What could be the problem I am having?
A: It's not clear from the data available. Do not run wait_for as local_action. Make sure the host rebooted successfully.
Related
I'm deploying helm charts using community.kubernetes.helm with ease but I've run into conditions where the connection is refused and it's not clear how best to configure a retries/wait/until. I've run into a case where every now and then, helm can't communicate with the cluster, here's an example (dns/ip faked) showing that the issue is as simple as not being able to connect to the cluster:
fatal: [localhost]: FAILED! => {"changed": false, "command":
"/usr/local/bin/helm --kubeconfig /var/opt/kubeconfig
--namespace=gpu-operator list --output=yaml --filter gpu-operator", "msg": "Failure when executing Helm command. Exited 1.\nstdout:
\nstderr: Error: Kubernetes cluster unreachable: Get
"https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial
tcp 192.168.1.1:443: connect: connection refused\n", "stderr": "Error:
Kubernetes cluster unreachable: Get
"https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial
tcp 192.168.1.1:443: connect: connection refused\n", "stderr_lines":
["Error: Kubernetes cluster unreachable: Get
"https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial
tcp 192.168.1.1:443: connect: connection refused"], "stdout": "",
"stdout_lines": []}
In my experience, I have seen that try/retry will work. I agree that it would be ideal to figure out why I can't connect to the service, but it would be even more ideal to work around this by taking advantage of a catch all "until" block that tries this block until it works or gives up after N tries while taking a break of N seconds.
Here's an example of the ansible block:
- name: deploy Nvidia GPU Operator
block:
- name: deploy gpu operator
community.kubernetes.helm:
name: gpu-operator
chart_ref: "{{ CHARTS_DIR }}/gpu-operator"
create_namespace: yes
release_namespace: gpu-operator
kubeconfig: "{{ STATE_DIR }}/{{ INSTANCE_NAME }}-kubeconfig"
until: ???
retries: 5
delay: 3
when: GPU_NODE is defined
I would really appreciate any suggestions/pointers.
I discovered that registering the output and then testing until it's defined get's ansible to rerun. The key is learning what is going to be a successful output. For helm, it says it will define a status when it works correctly. So, this is what you need to add
register: _gpu_result
until: _gpu_result.status is defined
ignore_errors: true
retries: 5
delay: 3
retries/delay is up to you
I am running ansible playbook to restart some of our servers but we need to sleep for 40 minutes between each server restart so if I sleep for 40 minutes in my playbook then it sleeps for a while but then my session gets terminated on Ubuntu box in prod and whole script is also stopped. Is there anything I can add in ansible playbook so that it can keep my session alive during the time whole playbook is running?
# This will restart servers
---
- hosts: tester
serial: "{{ num_serial }}"
tasks:
- name: copy files
copy: src=conf.prod dest=/opt/process/config/conf.prod owner=goldy group=goldy
- name: stop server
command: sudo systemctl stop server_one.service
- name: start server
command: sudo systemctl start server_one.service
- name: sleep for 40 minutes
pause: minutes=40
I want to sleep for 40 minutes without terminating my linux session and then move to next set of servers restart.
I am running ansible 2.6.3 version.
You can run your ansible script inside screen in order to keep the session alive even after disconnection.
Basically what you want to do is ssh into the production server, run screen, then execute the playbook inside the newly created session.
If you ever get disconnected, you can connect back to the server, then run screen -r to get back into your saved session.
I have been trying to call the 'restart the network' service in a fire and forget mode because obviously I will get disconnected from the SSH connection after I restart the network in a VM so I wanted to have a timeout process to do that.
In order to do that I did this inside of my restart networking tasks:
- name: Restart network
become: true
service: name=network state=restarted
async: 1000
poll: 0
When Ansible gets to this point I get this error:
fatal: [build]: FAILED! => {"failed": true, "msg": "async mode is not supported with the service module"}
Which I found that is an Ansible bug that is not yet in production and they still have it in the development branch, which I don't want to do because that would also mean more possible bugs in Ansible.
So, I have two options in my opinion, either I wait for the new release of Ansible to come with the bug fix or change async: 0 and poll: 0 to wait for the service to finish ( which it will never is going to finish ) so I press CTRL+C when get to that point to stop the service manually.
I don't want to go either of those routes because they are not very efficient for me, so I was wondering if there is a solution would be better at this point.
Try this as a temporary workaround:
- name: Restart network
become: yes
shell: sleep 2 && service network restart
async: 1
poll: 0
And don't forget to wait_for port 22 after this task to avoid host unreachable error.
I want to test if the host I am provisioning can reach a specific server and connect to a specific TCP port. If it can't the playbook should fail.
How can I do that?
There is wait_for module for this.
To check that target.host can access remote.host:8080:
- hosts: target.host
tasks:
- wait_for: host=remote.host port=8080 timeout=1
- debug: msg=ok
There are a lot of other examples in the documentation.
Using wait_for is fine, however it requires the service is actually running and gives a reply.
If you just like to check whether the port is open in your firewall, you can use curl.
- name: Check if host is reachable
shell:
cmd: "/usr/bin/curl --connect-timeout 10 --silent --show-error remote.host:8080"
warn: no
executable: /bin/bash
register: res
failed_when: res.rc in [28] or res.stderr is search("No route to host")
When the port is open but service does not run you get an curl: (7) Failed connect to 10.192.147.224:27019; Connection refused" which you would consider as OK.
A connection blocked by firewall will return curl: (28) Connection timed out after 10001 milliseconds
I was wondering if it were possible to tell Ansible to set up a VPN connection before executing the rest of the playbook. I've googled around, but haven't seen much on this.
You could combine a local playbook to setup a VPN and a playbook to run your tasks against a server.
Depending on whats the job you can use ansible or a shell script to connect the VPN. Maybe there should be another playbook to disconnect afterwards.
As result you will have three playbooks and one to combine them via include:
- include: connect_vpn.yml
- include: do_stuff.yml
- include: disconnect_vpn.yml
Check How To Use Ansible and Tinc VPN to Secure Your Server Infrastructure.
Basically, you need to install thisismitch/ansible-tinc playbook and create a hosts inventory file with the nodes that you want to include in the VPN, for example:
[vpn]
prod01 vpn_ip=10.0.0.1 ansible_host=162.243.125.98
prod02 vpn_ip=10.0.0.2 ansible_host=162.243.243.235
prod03 vpn_ip=10.0.0.3 ansible_host=162.243.249.86
prod04 vpn_ip=10.0.0.4 ansible_host=162.243.252.151
[removevpn]
Then you should review the contents of the /group_vars/all file such as:
---
netname: nyc3
physical_ip: "{{ ansible_eth1.ipv4.address }}"
vpn_interface: tun0
vpn_netmask: 255.255.255.0
vpn_subnet_cidr_netmask: 32
where:
physical_ip is IP address which you want tinc to bind to;
vpn_netmask is the netmask that the will be applied to the VPN interface.
If you're using Amazon Web Services, check out the ec2_vpc_vpn module which can create, modify, and delete VPN connections. It uses boto3/botocore library.
For example:
- name: create a VPN connection
ec2_vpc_vpn:
state: present
vpn_gateway_id: vgw-XXXXXXXX
customer_gateway_id: cgw-XXXXXXXX
- name: delete a connection
ec2_vpc_vpn:
vpn_connection_id: vpn-XXXXXXXX
state: absent
For other cloud services, check the list of Ansible Cloud Modules.