Ansible retry get attempt number - ansible

I am using a task that connects via ssh to a device. Latency is not always constant and sometimes when the prompt is not displayed in time the task fails.
Assuming it is possible to control the timeout value for this task is it possible to dynamically increase this timeout proportionally to the number of the attempt performed?
Something like this
- name: task_name
connection : local
task_module:
args...
timeout : 10 * "{{ attempt_number }}"
retries: 3
delay: 2
register: result
until: result | success

I don't think its possible to get the current attempt number while running the task, it's quite unclear why you're trying to achieve such thing.
Can you elaborate a little bit more?

Yes, it's possible, here are docs.
When you run a task with until and register the result as a variable, the registered variable will include a key called “attempts”, which records the number of the retries for the task.

Related

Until loop module fails with "dict object has no attribute state"

Here is my Ansible task
- name: wait until response has key word "PIPELINE_STATE_SUCCEEDED"
uri:
url: https://abcd.com/response
method: GET
register: ABCD
until: ABCD.json.state == "PIPELINE_STATE_SUCCEEDED"
retries: 30
delay: 600
When I run this script (total retries and delay adds up to 300 minutes for task to pass), after few retries, suddenly the scripts emits below error message and it breaks.
''dict object'' has no attribute ''state'''
I also tried decreasing delay number and increasing retries but still the same problem. I have several other task in the same playbook which uses similar module except size of delay is significantly less in those (total retries and delay adds up to around 30 minutes).
Any idea why this could be happening?

Store Timestamp As A Constant Value

I'm trying to save the timestamp corresponding to when a playbook runs in a variable.
I plan to use this variable across the playbook but I'm facing issues particularly since
the lookup plugin runs and generates a new value each time.
module_defaults:
group/ns.col.session:
sid: "{{ lookup('pipe','date \"+%Y-%m-%d-%H%M\"') }}"
The value is looked up at the time that it is invoked.
I could use set_fact: but it only works inside of the tasks: block and I'd like to set it to some value before any task can run i.e. right after hosts.
- hosts:
- localhost
module_defaults:
group/ns.col.session:
sid: .............
How do I achieve this WITHOUT using set_fact OR without using the lookup() ?
In other words, how to save or copy the value of a lookup to some variable ?
I've already reviewed Constant Date and Time
but the solutions proposed over there do not satisfy my constraints.
The whole reason behind this is to not have any task run anywhere or modify the playbook to run a task on the Ansible Controller node alone, for looking up the timestamp but rather have it be preserved in a variable at the very beginning itself.

Ansible WinRM The maximum number of concurrent operations for this user has been exceeded

We are using Ansible playbooks to automate long running scripts on many systems within our network, Some of those systems are Windows 10 while the others are Windows 7. The long running operations are launched using the async mechanism and the ansible module async_status is used to poll the results of the tasks every 30 seconds.
- name: Long running operation
win_command:
cmd: cmd
_raw_params: python long_running_script.py
async: 2140000
poll: 0
register: async_sleeper
- name: Status poll
async_status: jid="{{ async_sleeper.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 100001
delay: 30
The windows 10 server have the following default configuration for WinRM:
MaxConcurrentOperations = 4294967295
MaxConcurrentOperationsPerUser = 1500
Every ~12.5 hours or so, the playbook errors out with "maximum number of concurrent operations for this user has been exceeded" and this corresponds neatly to 1500 / 30 (our poll interval)
But clearly async_status is not a concurrent operation. It is supposed to be a short-lived check whether the process is still running and should exit after. So at any given point the number of concurrent processes must not exceed 2. The Task manager on the client machine does not show any lingering processes. So what is happening? Does ConcurrentOperation refer to Count of operations and not really of concurrency? We know we can increase the quota but we do not want to do that on production systems without getting to the root of this problem.
It would help to know:
What do Concurrent Operation really mean?
What is the industry best practice to overcome this problem?
What has changed in Windows 10 that this error is not found in the other version of the OS?
We ran some experiments and it turns out that the value for MaxConcurrentOperationsPerUser is indeed a counter and does not have to be "Concurrent"
This behavior is different between Windows 7 where it behaves like what its name implies and Windows 10 where it is a counter.
So if we set the variable to 30, and have a long running operation that we poll the status every 30 seconds - then the operation will error in 15 minutes.
This issue may either be addressed in the future or fixed but leaving it here for anyone else who might face it.

Using a synchronized counter variable ansible

My playbook is creating multiple instances of application in AWS.I want each of the instance to be tagged with a counter variable to maintain the count and id of each instance (do not want to use instance id and any other random id). Now , since the provisioning happens in parallel i am failing to get a consistent counter variable.
I have tried using a global variable to the play and incrementing it but it always returns the initial value as set fact is executed once.
I have also tried putting a variable in a file, reading and incrementing it for every host. This leads to race condition and i see same values for different hosts. Is there any way to do this.
Assuming that your ec2.ini file has
all_instances = True
to get stopped instances, they already ARE tagged, in a sense.
webserver[1] is always going to be the same host, until your inventory changes.
However, you can still tag your instances as you want, but if your inventory changes, it might be difficult to tag new instances with unique numbers.
- name: Loop over webserver instances and tag sequentially
ec2_tag:
state: present
tags:
myTag: "webserver{{item}}"
resource: "{{ hostvars[groups['webserver'][item|int]]['ec2_id'] }}"
with_sequence: start=0 end="{{ groups['webserver']|length - 1 }}"
delegate_to: localhost
N.B: item is a string, so we have to use [item|int] when pulling from the groups['webserver'] array.

Job with multiple tasks on different servers

I need to have a Job with multiple tasks, being run on different machines, one after another (not simultaneously), and while the current job is running, another same job can arrive to the queue, but should not be started until the previous one has finished. So I came up with this 'solution' which might not be the best but it gets the job done :). I just have one problem.
I figured out I would need a JobQueue (either MongoDb or Redis) with the following structure:
{
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
Hosts:
search for the jobs with same hostname, and running==FALSE
execute the task that is set in that job
upon finish, host sets running=FALSE, checks if there are any other tasks to perform and increases task number + sets the hostname to the next machine from the next task
Because jobs can accumulate, imagine situation when jobs are queued for one host like this: A,B,A
Since I have to run all the jobs for the specified machine how do I not start the 3rd A (first A is still running)?
{
_id : ObjectId("xxxx"), // unique, generated by MongoDB, indexed, sortable
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
The question is how would the next available "worker" know whether it's safe for it to start the next job on a particular host.
You probably need to have some sort of a sortable (indexed) field to indicate the arrival order of the jobs. If you are using MongoDB, then you can let it generate _id which will already be unique, indexed and in time-order since its first four bytes are timestamp.
You can now query to see if there is a job to run for a particular host like so:
// pseudo code - shell syntax, not actual code
var jobToRun = db.queue.findOne({hostname:<myHostName>},{},{sort:{_id:1}});
if (jobToRun.running == FALSE) {
myJob = db.queue.findAndModify({query:{_id:jobToRun._id, running:FALSE},update:{$set:{running:TRUE}}});
if (myJob == null) print("Someone else already grabbed it");
else {
/* now we know that we updated this and we can run it */
}
} else { /* sleep and try again */ }
What this does is checks for the oldest/earliest job for specific host. It then looks to see if that job is running. If yes then do nothing (sleep and try again?) otherwise try to "lock" it up by doing findAndModify on _id and running FALSE and setting running to TRUE. If that document is returned, it means this process succeeded with the update and can now start the work. Since two threads can be both trying to do this at the same time, if you get back null it means that this document already was changed to be running by another thread and we wait and start again.
I would advise using a timestamp somewhere to indicate when a job started "running" so that if a worker dies without completing a task it can be "found" - otherwise it will be "blocking" all the jobs behind it for the same host.
What I described works for a queue where you would remove the job when it was finished rather than setting running back to FALSE - if you set running to FALSE so that other "tasks" can be done, then you will probably also be updating the tasks array to indicate what's been done.

Resources