Ansible connection becomes unreachble after a while - shell

I am running a task in ansible, and the task runs a script present on a remote host.
---
- hosts: remote_host
become: yes
gather_facts: true
connection: ssh
tasks:
- name: Run a script
shell: bash /root/script.sh
This initially establishes a connection successfully, and the task runs the script for a while, before the task fails with an "unreachable" error.
The ansible playbook itself is triggered by a Jenkins job, and the duration of run is passed as a Jenkins parameter.
When I pass the duration as 30 minutes, it runs throughout without interruptions.
But it fails after a while for a 1 hour duration.
Could this be an issue with maintaining the connection for that long?

It might be breaking from default connection timeout if it takes sometime at a place somewhere while executing.

Related

How to run a task from a playbook on a specific host?

I'm writing an Ansible playbook to manage backup and I want two different tasks:
- name: Setup local machine for backup
cron:
cron_file: /etc/cron.d/backup
hour: 4
minute: 0
job: /root/do_backup.sh
state: present
name: backup
- name: Setup backup server for new machine
shell:
cmd: "mkdir /backups/{{inventory_hostname}}"
Is it possible to tell ansible that second task is intended to be executed on another machine of my inventory?
I don't want a dedicated playbook, because some later tasks should be executed after the task on backup server.
I'm answering my own question: task delegation is what I'm looking for:
- name: Setup backup server for new machine
delegate_to: backup-server
shell:
cmd: "mkdir /backups/{{inventory_hostname}}"

Ansible is getting stuck on some server which is in bad ps state?

I have an ansible playbook as shown below and it works fine most of the times. But recently what I am noticing is it is getting stuck on some of the servers from the ALL group and just sits there. It doesn't even move forward to other servers in the ALL list.
# This will copy files
---
- hosts: ALL
serial: "{{ num_serial }}"
tasks:
- name: copy files
shell: "(ssh -o StrictHostKeyChecking=no abc.com 'ls -1 /var/lib/jenkins/workspace/copy/stuff/*' | parallel -j20 'scp -o StrictHostKeyChecking=no abc.com:{} /data/records/')"
- name: sleep for 5 sec
pause: seconds=5
So when I started debugging, I noticed on the actual server it is getting stuck - I can ssh (login) fine but when I run ps command then it just hangs and I don't get my cursor back so that means ansible is also getting stuck executing above scp command on that server.
So my question is even if I have some server in that state, why not just Ansible times out and move to other server? IS there anything we can do here so that ansible doesn't pause everything just waiting for that server to respond.
Note server is up and running and I can ssh fine but when we run ps command it just hangs and because of that Ansible is also hanging.
Is there any way to run this command ps aux | grep app on all the servers in ALL group and make a list of all the servers which executed this command fine (and if gets hang on some server then time out and move to other server in ALL list) and then pass on that list to work with my above ansible playbook? Can we do all this in one playbook?
Ansible doesn't have this feature and it might even be dangerous to have it. My suggestion in this case would be: see the failure, rebuild the server, run again.
It's possible to to build the feature you want in your playbook, what you could do is to have a dummy async task that triggers the issue, and the verify the outcome of that. If the async task didn't finish in a reasonable time, use the meta: end_host task to move to the next host.
You might need to mark some of those tasks with ignore_errors: yes.
Sorry that I cannot give you a complete answer as I've never tried to do this.
You can use strategies to achieve your goal. By default:
Plays run with a linear strategy, in which all hosts will run each
task before any host starts the next task
By using the free strategy, each host will run until the end of the play as fast as it can. For example:
---
- hosts: ALL
strategy: free
tasks:
- name: copy files
shell: "(ssh -o StrictHostKeyChecking=no abc.com 'ls -1 /var/lib/jenkins/workspace/copy/stuff/*' | parallel -j20 'scp -o StrictHostKeyChecking=no abc.com:{} /data/records/')"
- name: sleep for 5 sec
pause: seconds=5
Another option would be to use timeout to run your command, then using registers to process whether the command executed successfully or not. For example timeout 5 sleep 10 returns 124 because of the timeout while timeout 5 sleep 3 returns 0 because the command terminates before the timeout occurs. In an ansible script, you could use something like:
tasks:
- shell: timeout 5 ps aux | grep app
register: result
ignore_errors: True
- debug:
msg: timeout occured
when: result.rc == 124
As told by"Alassane Ndiaye ", you can try below code snippet.
Where I am giving condition when shell is not timeout
tasks:
- shell: timeout 5 ps aux | grep app
register: result
ignore_errors: True
- name: Run your shell command
shell: "(ssh -o StrictHostKeyChecking=no abc.com 'ls -1 /var/lib/jenkins/workspace/copy/stuff/*' | parallel -j20 'scp -o StrictHostKeyChecking=no abc.com:{} /data/records/')"
when: result.rc != 124 && result.rc != 0

Ansible use different hostname if first fails

I have a number of raspberry pis that I swap out (only one running at a time) and run ansible against. Most pis respond to ping raspberrypi but I have one that responds to ping raspberrypi.local
Rather than remembering to manually ping the correct hostname before executing the playbook, is there a way in ansible to run a playbook against a different hostname if the first fails?
Currently my playbook is
---
- hosts: raspberrypi
and /etc/ansible/hosts
[raspberrypi]
raspberrypi
#raspberrypi.local
If I uncomment out the second hostname and the first fails, then the playbook will fail and not run on the .local hostname
I am not sure if this is directly possible in ansible.
But a hack I can think of is to create a list of hosts store them in a variable do a ping using the localhost. If ping is successful create a custom hosts group and execute the task you want to do.
Also are you executing your playbook with serial: 1 ?
Hope so this helps.
You could run the play against both host groups.
- hosts: raspberrypi:raspberrypi.local

Ansible wait_for for connecting machine to actually login

In my working environment, virtual machines are created and after creating login access information is added to them and there can be delays so just waiting for my ansible script to check if SSH is available is not enough, I actually need to check if ansible can get inside the remote machine via ssh.
Here is my old script which fails me:
- name: wait for instances to listen on port:22
wait_for:
state: started
host: "{{ item }}"
port: 22
with_items: myservers
How can I rewrite this task snippet to achieve waiting for the localmachine can ssh into the remote machines (again not only checking if ssh is ready at the remote but it can actually authenticate to it).
This is somewhat ugly, but given your needs it might work:
- local_action: command ssh myuser#{{ ansible_inventory_hostname }} exit
register: log_output
until: log_output.stdout.find("Last login") > -1
retries: 10
delay: 5
The first line would cause your ansible host to try to ssh into the target host and immediately issue an "exit" to return control back to ansible. Any output from that command gets stored in the log_output variable. The until clause will check the output for the string 'Last login' (you may want to change this to something else depending on your environment), and Ansible will retry this task up to 10 times with a 5 second delay between attempts.
Bruce P's answer was close to what I needed, but my ssh doesn't print any banner when running a command, so checking stdout is problematic.
- local_action: command ssh "{{ hostname }}" exit
register: ssh_test
until: ssh_test.rc == 0
retries: 25
delay: 5
So instead I use the return code to check for success
As long as your Ansible user is already installed on the image you are using to create the new server instance, the wait_for command works well.
If that is not the case, then you need to poll the system that adds that user to the newly created instance for when you should continue - of course that system will have to have something to poll against...
The (very ugly) alternative is to put a static pause in your script that will wait the appropriate amount of time between the instance being created and the user being added like so:
- pause: seconds=1
Try not to though, static pauses are a bad way of solving this issue.

Using ansible to launch a long running process on a remote host

I'm new to Ansible. I'm trying to start a process on a remote host using a very simple Ansible Playbook.
Here is how my playbook looks like
-
hosts: somehost
gather_facts: no
user: ubuntu
tasks:
- name: change directory and run jetty server
shell: cd /home/ubuntu/code; nohup ./run.sh
async: 45
run.sh calls a java server process with a few parameters.
My understanding was that using async my process on the remote machine would continue to run even after the playbook has completed (which should happen after around 45 seconds.)
However, as soon as my playbook exits the process started by run.sh on the remote host terminals as well.
Can anyone explain what's going and what am I missing here.
Thanks.
I have ansible playbook to deploy my Play application. I use the shell's command substitution to achieve this and it does the trick for me. I think this is because command substitution spawns a new sub-shell instance to execute the command.
-
hosts: somehost
gather_facts: no
user: ubuntu
tasks:
- name: change directory and run jetty server
shell: dummy=$(nohup /run.sh &) chdir={{/home/ubuntu/code}}
Give a longer time to async say 6 months or an year or evenmore and this should be fine.
Or convert this process to an initscript and use the service module.
and add poll: 0
I'd concur. Since it's long running, I'd call it a service and run it like so. Just create an init.d script, push that out with a 'copy' then run the service.

Resources