I created a role to set up our new servers but am running into one issue. The play triggers a Python script. This script submits information about the server to our API. The script eventually triggers a job from the API, and the server is rebooted by this job. The play does not end until the Python script completes. However, Ansible loses connection during the reboot, because the play itself didn't initiate the reboot, and the playbook fails. I have already tried the following.
- name: Run setup.py
command: "{{ run_setup_py }} --username {{ username }} --password {{ password }} --ip {{ ansible_host }} --hostname {{ host_name }}"
async: 1800
poll: 60
This fails after async times out. It appears Ansible doesn't recognize that the script completed and fails. I attempted a few other async plays such as
- name: Run setup.py
command: "{{ run_setup_py }} --username {{ username }} --password {{ password }} --ip {{ ansible_host }} --hostname {{ host_name }}"
async: 600
poll: 0
register: run_setup
- name: check on async task
async_status:
jid: "{{ run_setup.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 1000
delay: 450
No luck with either of the following. For some reason wait_for_connection at the play level was skipping the Python script entirely and causing later plays to fail.
- name: Wait until remote system is reachable
wait_for_connection:
delay: 180
sleep: 15
delegate_to: localhost
- name: Wait until remote system is reachable
wait_for_connection:
delay: 180
sleep: 15
I attempted adding ignore_unreachable: yes at the playbook level. Ansible attempted to reconnect immediately but failed due to the server still in POST.
The script runs and works perfectly when run on the remote host so it isn't an issue with the script. The remaining steps of our setup cannot run until after the script is run.
At this point, any answer as to how to maintain Ansible's connection would be greatly appreciated. It would be ideal to not waste time if at all possible, e.g. constant connection checking.
My apologies if any information is missing or confusing, I've only been using Ansible for about a month now. Currently using ansible-core 2.12
You need wait_for instead of wait_for_connection. It is run locally:
- name: Run setup.py
command: "{{ run_setup_py }} --username {{ username }} --password {{ password }}
- name: Wait for the reboot and reconnect
wait_for:
port: 22
host: '{{ (ansible_ssh_host|default(ansible_host))|default(inventory_hostname) }}'
search_regex: OpenSSH
delay: 10
timeout: 60
connection: local
- name: Check the Uptime of the servers
shell: "uptime"
register: Uptime
- name: Show uptime
debug:
var: Uptime
Related
Greetings for the day,
I was trying to install cyberpanel using Ansible by writing a playbook.
The playbook was this
---
- name: Installing cybepanel
hosts: ansible_client
user: ubuntu
become: yes
become_user: root
become_method: sudo
tasks:
- name: Installing screen
apt:
name: screen
state: present
- name: Download the script
get_url:
url=https://cyberpanel.net/install.sh
dest=/root/installer.sh
- name: Execute the script
become: yes
become_method: su
become_user: root
become_exe: sudo su -
expect:
command:
screen -S cyberpanel-installation
sh installer.sh
echo: yes
responses:
(.*) Please enter the number(>*): "1"
'Full installation \[Y/n\]:': "Y"
(.*) Remote MySQL(.*): "N"
(.*)Enter specific version such as:(.*): ""
(.*)Choose(.*)password(.*): "r"
'Please select \[Y/n\]:': "Y"
(.*)Please type Yes or no(.*): "no"
'Would you like to restart your server now? \[y/N\]:': "y"
async: 1800
poll: 5
register: output
- name: 'Checking the status'
async_status:
jid: "{{ output.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 150
delay: 60
- name: debugging
debug:
var=output
The playbook doesn't have any error or conflicts.
The playbook works fine and the cyberpanel is installing with in 20-30 mins(As there is screen in the playbook. The screen stays detached in the destination server and after attaching it (when the playbook stops execution) in the destination server we could see that the installation in progress And successfully completes with in 20-30 mins.
The issue is that the playbook stops execution after 1 minutes of execution with a return code(rc)=0.
This is the output of playbook.
As you can see i am using the async method with poll=0 and poll>0 for long time execution of the script. It is not working the playbook still timesout.
I also increased the SSH timeout to check whether any ssh timeout takes place or not and there is no ssh timeout too.
Also tried using the timeout attribute instead of async method that also don't worked for me.
Anybody with a helping hand is well appreciated.
I have a users.yaml file with information regarding 400+ users. I need Ansible to create these users during provisioning. I tried with the async keyword (if that's the right word to use, tell me if I'm wrong) and poll: 15 but it takes ~10minutes.
- name: Add FTP users asynchronously
ansible.builtin.user:
name: "{{ item.name }}"
home: "{{ item.home }}"
shell: /sbin/nologin
groups: ftp-users
create_home: yes
append: no
loop: "{{ ftp_users }}"
async: 60
poll: 15
tags: users
I also tried using poll:0 but many users aren't created.
Your actual use of async is adapted to a single long running task use case where you want to minimize the chance of getting your connection kicked because of a timeout. You are asking ansible to start a job, disconnect from the target and then reconnect every 15 seconds to check if the job is done (or until you reach the 60 seconds timeout). Nothing will be launched in parallel: the next iteration in the loop will only start when the current is done.
What you want to do instead is run those tasks in parallel as fast as possible and then check back later if they are done. In this case, you have to use poll: 0 on your task and later check for completion with the async_status module as described on the ansible async guide. Note that you also need to cleanup the async job cache as ansible will not do it automagically for you in that case.
In your case, this would give:
- name: Add FTP users asynchronously
ansible.builtin.user:
name: "{{ item.name }}"
home: "{{ item.home }}"
shell: /sbin/nologin
groups: ftp-users
create_home: yes
append: no
loop: "{{ ftp_users }}"
async: 60
poll: 0
register: add_user
- name: Wait until all commands are done
async_status:
jid: "{{ item.ansible_job_id }}"
register: async_poll_result
until: async_poll_result.finished
retries: 60
delay: 1
loop: "{{ add_user.results }}"
- name: clean async job cache
async_status:
jid: "{{ item.ansible_job_id }}"
mode: cleanup
loop: "{{ add_user.results }}"
Meanwhile, although this is a direct answer on how to use async for parallel jobs, I'm not entirely sure this will fix your actual performance problem which could come from other issues (like slow dns, slow network, pipelining not enabled if that is possible, master ssh connection not configured...)
I have a 3 node ubuntu 20.04 lts - kvm - kubernetes cluster, and the kvm-host is also ubuntu 20.04 lts. I ran the playbooks on the kvm-host.
I have the following inventory extract:
nodes:
hosts:
sea_r:
ansible_host: 192.168.122.60
spring_r:
ansible_host: 192.168.122.92
island_r:
ansible_host: 192.168.122.93
vars:
ansible_user: root
and have been trying a lot with async_status, but always fails,
- name: root commands
hosts: nodes
tasks:
- name: bash commands
ansible.builtin.shell: |
apt update
args:
chdir: /root
executable: /bin/bash
async: 2000
poll: 2
register: output
- name: check progress
ansible.builtin.async_status:
jid: "{{ output.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 200
delay: 5
with error:
fatal: [sea_r]: FAILED! => {"msg": "The task
includes an option with an undefined variable.
The error was: 'dict object' has no attribute
'ansible_job_id' ...
If instead I try with the following,
- name: root commands
hosts: nodes
tasks:
- name: bash commands
ansible.builtin.shell: |
apt update
args:
chdir: /root
executable: /bin/bash
async: 2000
poll: 2
register: output
- debug: msg="{{ output.stdout_lines }}"
- debug: msg="{{ output.stderr_lines }}"
I get no errors.
Also tried following variation,
- name: check progress
ansible.builtin.async_status:
jid: "{{ item.ansible_job_id }}"
with_items: "{{ output }}"
register: job_result
until: job_result.finished
retries: 200
delay: 5
that was suggested as a solution to similar error. That also does not help, I just get slightly different error:
fatal: [sea_r]: FAILED! => {"msg": "The task includes
an option with an undefined variable. The error
was: 'ansible.utils.unsafe_proxy.AnsibleUnsafeText
object' has no attribute 'ansible_job_id' ...
At the beginning and the end of the playbook, I resume and pause my 3 kvm server nodes like so:
- name: resume vms
hosts: local_vm_ctl
tasks:
- name: resume vm servers
shell: |
virsh resume kub3
virsh resume kub2
virsh resume kub1
virsh list --state-paused --state-running
args:
chdir: /home/bi
executable: /bin/bash
environment:
LIBVIRT_DEFAULT_URI: qemu:///system
register: output
- debug: msg="{{ output.stdout_lines }}"
- debug: msg="{{ output.stderr_lines }}"
and so
- name: pause vms
hosts: local_vm_ctl
tasks:
- name: suspend vm servers
shell: |
virsh suspend kub3
virsh suspend kub2
virsh suspend kub1
virsh list --state-paused --state-running
args:
chdir: /home/bi
executable: /bin/bash
environment:
LIBVIRT_DEFAULT_URI: qemu:///system
register: output
- debug: msg="{{ output.stdout_lines }}"
- debug: msg="{{ output.stderr_lines }}"
but I don't see how these plays could have anything to do with said error.
Any help will be much appreciated.
You get an undefined error for your job id because:
You use poll: X on your initial task, so ansible connects every X seconds to check if the task is finished
When ansible exists that task and enters your next async_status task, the job is done. And since you used a non-zero value to poll the async status cache is automatically cleared.
since the cache was cleared, the job id does not exist anymore.
Your above scenario is meant to be used to avoid timeouts with your target on long running tasks, not to run tasks concurrently and have a later checkpoint on their status. For this second requirement, you need to run the async task with poll: 0 and clean-up the cache by yourself
See the documentation for more explanation on the above concepts:
ansible async guide
ansible async_status module
I made an example with your above task and fixed it to use the dedicated module apt (note that you could add a name option to the module with one or a list of packages and ansible would do both the cache update and install in a single step). Also, retries * delay on the async_status task should be equal or greater than async on the initial task if you want to make sure that you won't miss the end.
- name: Update apt cache
ansible.builtin.apt:
update_cache: true
async: 2000
poll: 0
register: output
- name: check progress
ansible.builtin.async_status:
jid: "{{ output.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 400
delay: 5
- name: clean async job cache
ansible.builtin.async_status:
jid: "{{ output.ansible_job_id }}"
mode: cleanup
This is more useful to launch a bunch of long lasting tasks in parallel. Here is a useless yet functional example:
- name: launch some loooooong tasks
shell: "{{ item }}"
loop:
- sleep 30
- sleep 20
- sleep 35
async: 100
poll: 0
register: long_cmd
- name: wait until all commands are done
async_status:
jid: "{{ item.ansible_job_id }}"
register: async_poll_result
until: async_poll_result.finished
retries: 50
delay: 2
loop: "{{ long_cmd.results }}"
- name: clean async job cache
async_status:
jid: "{{ item.ansible_job_id }}"
mode: cleanup
loop: "{{ long_cmd.results }}"
You have poll: 2 on your task, which tells Ansible to internally poll the async job every 2 seconds and return the final status in the registered variable. In order to use async_status you should set poll: 0 so that the task does not wait for the job to finish.
I am using the following Ansible playbook to shut down a list of remote Ubuntu hosts all at once:
- hosts: my_hosts
become: yes
remote_user: my_user
tasks:
- name: Confirm shutdown
pause:
prompt: >-
Do you really want to shutdown machine(s) "{{play_hosts}}"? Press
Enter to continue or Ctrl+C, then A, then Enter to abort ...
- name: Cancel existing shutdown calls
command: /sbin/shutdown -c
ignore_errors: yes
- name: Shutdown machine
command: /sbin/shutdown -h now
Two questions on this:
Is there any module available which can handle the shutdown in a more elegant way than having to run two custom commands?
Is there any way to check that the machines are really down? Or is it an anti-pattern to check this from the same playbook?
I tried something with the net_ping module but I am not sure if this is its real purpose:
- name: Check that machine is down
become: no
net_ping:
dest: "{{ ansible_host }}"
count: 5
state: absent
This, however, fails with
FAILED! => {"changed": false, "msg": "invalid connection specified, expected connection=local, got ssh"}
In more restricted environments, where ping messages are blocked you can listen on ssh port until it goes down. In my case I have set timeout to 60 seconds.
- name: Save target host IP
set_fact:
target_host: "{{ ansible_host }}"
- name: wait for ssh to stop
wait_for: "port=22 host={{ target_host }} delay=10 state=stopped timeout=60"
delegate_to: 127.0.0.1
There is no shutdown module. You can use single fire-and-forget call:
- name: Shutdown server
become: yes
shell: sleep 2 && /sbin/shutdown -c && /sbin/shutdown -h now
async: 1
poll: 0
As for net_ping, it is for network appliances such as switches and routers. If you rely on ICMP messages to test shutdown process, you can use something like this:
- name: Store actual host to be used with local_action
set_fact:
original_host: "{{ ansible_host }}"
- name: Wait for ping loss
local_action: shell ping -q -c 1 -W 1 {{ original_host }}
register: res
retries: 5
until: ('100.0% packet loss' in res.stdout)
failed_when: ('100.0% packet loss' not in res.stdout)
changed_when: no
This will wait for 100% packet loss or fail after 5 retries.
Here you want to use local_action because otherwise commands are executed on remote host (which is supposed to be down).
And you want to use trick to store ansible_host into temp fact, because ansible_host is replaced with 127.0.0.1 when delegated to local host.
I'm provisioning a new server via Terraform and using Ansible as the provisioner on my local system.
Terraform provisions a system on EC2, and then it runs the Ansible playbook providing the IP of the newly built system as the inventory.
I want to use Ansible to wait for the system to finish booting and prevent further tasks from being attempted up until a connection can be established. Up until this point I have been using a manual pause which is inconvenient and imprecise.
Ansible doesn't seem to do what the documentation says it will (unless I'm wrong, a very possible scenario). Here's my code:
- name: waiting for server to be alive
wait_for:
state: started
port: 22
host: "{{ ansible_ssh_host | default(inventory_hostname) }}"
delay: 10
timeout: 300
connect_timeout: 300
search_regex: OpenSSH
delegate_to: localhost
What happens in this step is that the connection doesn't wait any more than 10 seconds to make the connection, and it fails. If the server has booted and I try the playbook again, it works fine and performs as expected.
I've also tried do_until style loops which never seem to work. All examples given in documentation use shell output, and I don't see any way that it would work for non-shell modules.
I also can't seem to get any debug information if I try to register a result and print it out using the debug module.
Anyone have any suggestions as to what I'm doing wrong?
When you use delegate_to or local_action module, {{ ansible_ssh_host }} resolves to localhost, so your task is always running with the following parameter:
host: localhost
It waits for 10 seconds, checks the SSH connection to local host and proceeds (because most likely it is open).
If you use gather_facts: false (which I believe you do) you can add a set_fact task before, to store the target host name value in a variable:
- set_fact:
host_to_wait_for: "{{ ansible_ssh_host | default(inventory_hostname) }}"
and change the line to:
host: "{{ host_to_wait_for }}"
You can proof-test the variables with the following playbook:
---
- hosts: all
gather_facts: false
tasks:
- set_fact:
host_to_wait_for: "{{ ansible_ssh_host | default(inventory_hostname) }}"
- debug: msg="ansible_ssh_host={{ ansible_ssh_host }}, inventory_hostname={{ inventory_hostname }}, host_to_wait_for={{ host_to_wait_for }}"
delegate_to: localhost
Alternatively you can try to find a way to provide the IP address of the EC2 instance to Ansible as a variable and use it as a value for host: parameter. For example, you run Ansible from CLI, then pass ${aws_instance.example.public_ip} to --extra-vars argument.
As techraf indicates, your inventory lookup is actually grabbing the localhost address because of the delegation, so it's not running against the correct machine.
I think your best solution might be to have terraform pass a variable to the playbook containing the instance's IP address. Example:
terraform passes -e "new_ec2_host=<IP_ADDR>"
Ansible task:
- name: waiting for server to be alive
wait_for:
state: started
port: 22
host: "{{ new_ec2_host }}"
delay: 10
timeout: 300
connect_timeout: 300
search_regex: OpenSSH
delegate_to: localhost