I found quite a specific issue when setting AWS CloudWatch alarms via ansible using the ec2 dynamic inventory.
I've successfully set up the aws-script-mon to monitor the usage of Disk and RAM usage on my machines.
Also I've managed to set RAM usage alarms with the ansible ec2_metric_alarm module.
The problem I'm facing at the moment is when setting the alarms for disk usage the Filesystem dimension parameter is required, but not returned in the ec2 dynamic inventory variables.
Some of my machines have filesystem set to /dev/xvda1 and others have something like: /dev/disk/by-uuid/abcd123-def4-....
My current "solution" is as follows:
- name: "Disk > 60% (filesystem by fixed uuid)"
ec2_metric_alarm:
state: present
name: "{{ ec2_tag_Name }}-Disk"
region: "{{ ec2_region }}"
dimensions:
InstanceId: '{{ ec2_id }}'
MountPath: "/"
Filesystem: '/dev/disk/by-uuid/abcd123-def4-...'
namespace: "System/Linux"
metric: DiskSpaceUtilization
statistic: Average
comparison: ">="
threshold: 60.0
unit: Percent
period: 300
evaluation_periods: 1
description: Triggered when Disk utilization is more than 60% for 5 minutes
alarm_actions: ['arn:aws:sns:us-west-2:1234567890:slack']
when: ec2_tag_Name in ['srv1', 'srv2']
- name: "Disk > 60% (filesystem /dev/xvda1)"
ec2_metric_alarm:
state: present
name: "{{ ec2_tag_Name }}-Disk"
region: "{{ ec2_region }}"
dimensions:
InstanceId: '{{ ec2_id }}'
MountPath: "/"
Filesystem: '/dev/xvda1'
namespace: "System/Linux"
metric: DiskSpaceUtilization
statistic: Average
comparison: ">="
threshold: 60.0
unit: Percent
period: 300
evaluation_periods: 1
description: Triggered when Disk utilization is more than 60% for 5 minutes
alarm_actions: ['arn:aws:sns:us-west-2:1234567890:slack']
when: ec2_tag_Name not in ['srv1', 'srv2']
The only difference between those two tasks is the Filesystem dimension and the when condition (in or not in).
Is there any way how to obtain the Filesystem value so I can use them as comfortably as let's say ec2_id? My biggest concern is that I have to watch filesystem values when creating new machines and handle lists of machines according to that values.
I couldn't found a nice solution to this probem, and ended up writing a bash script to generate a YAML file containing UUID variables.
Run the script on the remote machine using the script module:
- script: ../files/get_disk_uuid.sh > /disk_uuid.yml
Fetch the created file from the remote
Use the include_vars_module to import the variables from the file. YAML syntax requires hyphens be replaced with underscores in variable names. The disk label 'cloudimg-rootfs' becomes 'cloudimg_rootfs'. Unlabeled disks use the variable names: 'disk0, disk1, disk2...'
Scripts aren't ideal. One day I'd like to write a module that accomplishes this same task.
#! /bin/bash
# get_disk_uuid.sh
echo '---'
disk_num=0
for entry in `blkid -o export -s LABEL -s UUID`; do
if [[ $entry == LABEL* ]];
then
label=${entry: 6}
elif [[ $entry == UUID* ]];
then
uuid=${entry: 5}
fi
if [ $label ] && [ $uuid ]; then
printf $label: | tr - _
printf ' '$uuid'\n'
label=''
uuid=''
elif [ $uuid ]; then
printf disk$disk_num:
printf ' '$uuid'\n'
label=''
uuid=''
fi
done
Related
According to the Ansible documentation, the setup module is
This module is automatically called by playbooks to gather useful variables about remote hosts that can be used in playbooks. It can also be executed directly by /usr/bin/ansible to check what variables are available to a host. Ansible provides many facts about the system, automatically.
And there are some parameters which include gather_subset.
If supplied, restrict the additional facts collected to the given subset. Possible values: all, min, hardware, network, virtual, ohai, and facter. Can specify a list of values to specify a larger subset. Values can also be used with an initial ! to specify that that specific subset should not be collected. For instance: !hardware,!network,!virtual,!ohai,!facter. If !all is specified then only the min subset is collected. To avoid collecting even the min subset, specify !all,!min. To collect only specific facts, use !all,!min, and specify the particular fact subsets. Use the filter parameter if you do not want to display some collected facts.
I want to know the exact list of fact that min subset would collect.
Thanks
It will depend on your environment and setup what is available and can be collected.
You could just have a short test with
---
- hosts: localhost
become: false
gather_facts: true
gather_subset:
- "min"
tasks:
- name: Show Gathered Facts
debug:
msg: "{{ ansible_facts }}"
and check the output, in example for a RHEL system the keys are
_ansible_facts_gathered: true
ansible_apparmor:
status:
ansible_architecture:
ansible_cmdline:
BOOT_IMAGE:
LANG:
elevator:
quiet:
rhgb:
ro:
root:
ansible_date_time:
date: ''
day: ''
epoch: ''
hour:
iso8601: ''
iso8601_basic:
iso8601_basic_short:
iso8601_micro: ''
minute: ''
month: ''
second: ''
time:
tz:
tz_offset: ''
weekday:
weekday_number: ''
weeknumber: ''
year: ''
ansible_distribution:
ansible_distribution_file_parsed:
ansible_distribution_file_path:
ansible_distribution_file_search_string:
ansible_distribution_file_variety:
ansible_distribution_major_version: ''
ansible_distribution_release:
ansible_distribution_version: ''
ansible_dns:
nameservers:
-
search:
-
ansible_domain:
ansible_effective_group_id:
ansible_effective_user_id:
ansible_env:
HISTCONTROL:
HISTSIZE: ''
HOME:
HOSTNAME:
KRB5CCNAME:
LANG:
LESSOPEN: ''
LOGNAME:
LS_COLORS:
MAIL:
PATH:
PWD:
SELINUX_LEVEL_REQUESTED: ''
SELINUX_ROLE_REQUESTED: ''
SELINUX_USE_CURRENT_RANGE: ''
SHELL:
SHLVL: ''
SSH_CLIENT:
SSH_CONNECTION:
SSH_TTY:
TERM:
TZ:
USER:
XDG_RUNTIME_DIR:
XDG_SESSION_ID: ''
_:
ansible_fips:
ansible_fqdn:
ansible_hostname:
ansible_kernel:
ansible_kernel_version: ''
ansible_local: {}
ansible_lsb: {}
ansible_machine:
ansible_machine_id:
ansible_nodename:
ansible_os_family:
ansible_pkg_mgr:
ansible_proc_cmdline:
BOOT_IMAGE:
LANG:
elevator:
quiet:
rhgb:
ro:
root:
ansible_python:
executable:
has_sslcontext:
type:
version:
major:
micro:
minor:
releaselevel:
serial:
version_info:
-
ansible_python_version:
ansible_real_group_id:
ansible_real_user_id:
ansible_selinux:
config_mode:
mode:
policyvers:
status:
type:
ansible_selinux_python_present:
ansible_service_mgr:
ansible_ssh_host_key_ecdsa_public:
ansible_ssh_host_key_ed25519_public:
ansible_ssh_host_key_rsa_public:
ansible_system:
ansible_system_capabilities:
- ''
ansible_system_capabilities_enforced: ''
ansible_user_dir:
ansible_user_gecos:
ansible_user_gid:
ansible_user_id:
ansible_user_shell:
ansible_user_uid:
ansible_userspace_architecture:
ansible_userspace_bits: ''
gather_subset:
- min
module_setup: true
For min the /ansible/modules/setup.py tries to gather information from
minimal_gather_subset = frozenset(['apparmor', 'caps', 'cmdline', 'date_time',
'distribution', 'dns', 'env', 'fips', 'local',
'lsb', 'pkg_mgr', 'platform', 'python', 'selinux',
'service_mgr', 'ssh_pub_keys', 'user'])
so that can be considered as
the exact list of fact that min subset would collect.
As one can see, blocks of the information are coming from different modules, in example for ansible_distribution from facts/system/distribution.py.
In my case in example, the module for ansible_env, facts/system/env.py will create keys which can not be found in any other environment.
For more background information of what is collected for specific environments and setups you have a look into /ansible/module_utils/facts.
Further Q&A
How Ansible gather_facts and sets variables
Getting full name of the OS using ansible_facts
Q: "I want to know the exact list of facts that the "min" subset would collect."
A: Run the module separately by ansible. You'll see the list of the facts collected by this module
shell> ansible localhost -m setup -a 'gather_subset=min'
As a side note, the facts differ among the systems. For example, compare the collected minimal facts among FreeBSD and Ubuntu
shell> grep PRETTY_NAME /etc/os-release
PRETTY_NAME="FreeBSD 13.0-RELEASE"
Store the output in a file
shell> ansible localhost -m setup -a 'gather_subset=min' > /scratch/freebsd.json
shell> cat /scratch/freebsd.json
localhost | SUCCESS => {
"ansible_facts": {
Remove 'localhost | SUCCESS => ` from the file
shell> cat /scratch/freebsd.json
{
"ansible_facts": {
Create a file with the keys (variables) only
shell> cat freebs.json | jq '.ansible_facts | keys[]' > freebsd_keys.txt
Repeat the procedure in Ubuntu and create ubuntu_keys.txt
shell> grep DISTRIB_DESCRIPTION /etc/lsb-release
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
Diff the files
shell> diff ubuntu_keys.txt freebsd_keys.txt
3d2
< "ansible_cmdline"
6,8d4
< "ansible_distribution_file_parsed"
< "ansible_distribution_file_path"
< "ansible_distribution_file_variety"
25d20
< "ansible_machine_id"
29d23
< "ansible_proc_cmdline"
44,45d37
< "ansible_system_capabilities"
< "ansible_system_capabilities_enforced"
52d43
< "ansible_userspace_architecture"
There are also differences among the Linux distributions. For example, Ubuntu 20.04 and Centos 8
shell> diff ubuntu_keys.txt centos_keys.txt
38d37
< "ansible_ssh_host_key_ecdsa_public_keytype"
40d38
< "ansible_ssh_host_key_ed25519_public_keytype"
42d39
< "ansible_ssh_host_key_rsa_public_keytype"
I have a bunch of servers on different subnets. I want to configure a NFS mountpoint for each one, selecting which NFS server based on which subnet they're on. I can easily do it with multiple code blocks, each having a different when condition, but with more than just a couple of networks, this results in an awful lot of code duplication:
- name: mount /home for 192.168.1.0
mount:
path: /home
src: nfsserver-1.domain.net:/vol/home
fstype: nfs
opts: tcp,hard,intr,bg
state: mounted
when: ansible_default_ipv4.network == '192.168.1.0'
- name: mount /home for 192.168.2.0
mount:
path: /home
src: nfsserver-2.domain.net:/vol/home
fstype: nfs
opts: tcp,hard,intr,bg
state: mounted
when: ansible_default_ipv4.network == '192.168.2.0'
How can I register a variable, based on another variable? When I try the code block below, it fails because multiple tasks register detected_nfs_server and conflict with each other. (The variable doesn't just get registered when the when block is applicable).
- name: detected_nfs_server initialize to blank
shell: echo ''
register: detected_nfs_server
changed_when: False
- name: detected_nfs_server nfsserver-1.domain.net
shell: echo 'nfsserver-1.domain.net'
register: detected_nfs_server
when: ansible_default_ipv4.network == '192.168.1.0'
changed_when: False
- name: detected_nfs_server nfsserver-2.domain.net
shell: echo 'nfsserver-2.domain.net'
register: detected_nfs_server
when: ansible_default_ipv4.network == '192.168.2.0'
changed_when: False
- name: Fail detected_nfs_server
fail:
msg: "'domain.net' not in detected_nfs_server.stdout"
when: "'domain.net' not in detected_nfs_server.stdout"
- name: mount /home
mount:
path: /home
src: {{ detected_nfs_server.stdout }}:/vol/home
fstype: nfs
opts: tcp,hard,intr,bg
state: mounted
So far, the best solution I've found is to run a shell script as shown below. This works fine, but it requires using shell script instead of doing variable manipulation inside of ansible. Is there a good way to register a variable in ansible, based on another variable, rather than depending on shell script?
- name: detect_nfs_server
shell: if [ '{{ ansible_default_ipv4.network }}' = '192.168.1.0' ] ; then echo 'nfsserver-1.domain.net' ; elif [ '{{ ansible_default_ipv4.network }}' = '192.168.2.0' ] ; then echo 'nfsserver-2.domain.net' ; else echo 'domain not detected' ; fi
register: detected_nfs_server
changed_when: False
- name: fail if domain not detected
fail:
msg: 'domain not detected'
when: detected_domain.stdout == 'domain not detected'
- name: mount /home
mount:
path: /home
src: {{ detected_nfs_server.stdout }}:/vol/home
fstype: nfs
opts: tcp,hard,intr,bg
state: mounted
There are more options. It's possible to use a dictionary (stored in group_vars ?). For example
vars:
detected_nfs_server:
192.168.1.0: nfsserver-1.domain.net
192.168.2.0: nfsserver-2.domain.net
192.168.3.0: nfsserver-3.domain.net
tasks:
- name: mount /home
mount:
path: /home
src: "{{ detected_nfs_server[ansible_default_ipv4.network] }}:/vol/home"
fstype: nfs
opts: tcp,hard,intr,bg
state: mounted
Next option is to generate the server's name. For example
- set_fact:
detected_nfs_server: "{{ 'nfsserver-' ~
address.split('.').2 ~
'.domain.net' }}"
- name: mount /home
mount:
path: /home
src: "{{ detected_nfs_server }}:/vol/home"
fstype: nfs
opts: tcp,hard,intr,bg
state: mounted
one option might be to have a var file loaded with each of the NFS servers on different subnets. Then you can reference the vars.
nfs1 = 192.168.1.x
nfs2 = 192.168.2.x
nfs3 = 192.168.3.x
I am new to ansible and currently working on a play which will see if disk space of remote machines has reached 70% threshold. If they have reached it should throw error.
i found a good example at : Using ansible to manage disk space
but at this example the mount names are hard coded. And my requirement is to pass them dynamically. So i wrote below code which seems to not work:
name: test for available disk space
assert:
that:
- not {{ item.mount == '{{mountname}}' and ( item.size_available <
item.size_total - ( item.size_total|float * 0.7 ) ) }}
with_items: '{{ansible_mounts}}'
ignore_errors: yes
register: disk_free
name: Fail the play
fail: msg="disk space has reached 70% threshold"
when: disk_free|failed
This play works when i use:
item.mount == '/var/app'
Is there any way to enter mountname dynamically ? and can i enter multiple mount names ??
I am using ansible 2.3 on rhel
Thanks in advance :)
Try this:
name: Ensure that free space on {{ mountname }} is grater than 30%
assert:
that: mount.size_available > mount.size_total|float * 0.3
msg: disk space has reached 70% threshold
vars:
mount: "{{ ansible_mounts | selectattr('mount','equalto',mountname) | list | first }}"
that is a raw Jinja2 expression, don't use curly brackets in it.
why do you use separate fail task, if assert can fail with a message?
For those who cannot use selectattr (like me), here is a variant of the first answer using when and with_items to select the mount point to check.
name: 'Ensure that free space on {{ mountname }} is grater than 30%'
assert:
that: item.size_available > item.size_total|float * 0.3
msg: 'disk space has reached 70% threshold'
when: item.mount == mountname
with_items: '{{ ansible_mounts }}'
Note: To be able to use the variable {{ ansible_mounts }} you need to turn gather_facts to yes that can be limited to gather_subset=!all,hardware.
I'm running Ansible 2.5 and was able to get Konstantin Suvorov's solution to work with a slight mod by adding with_items. Sample code below:
- name: Ensure that free space on the tested volume is greater than 15%
assert:
that:
- mount.size_available > mount.size_total|float * 0.15
msg: Disk space has reached 85% threshold
vars:
mount: "{{ ansible_mounts | selectattr('mount','equalto',item.mount) | list | first }}"
with_items:
- "{{ ansible_mounts }}"
Simple ask: I want to delete some files if partition utilization goes over a certain percentage.
I have access to "size_total" and "size_available" via "ansible_mounts". i.e.:
ansible myhost -m setup -a 'filter=ansible_mounts'
myhost | success >> {
"ansible_facts": {
"ansible_mounts": [
{
"device": "/dev/mapper/RootVolGroup00-lv_root",
"fstype": "ext4",
"mount": "/",
"options": "rw",
"size_available": 5033046016,
"size_total": 8455118848
},
How do I access those values, and how would I perform actions conditionally based on them using Ansible?
Slava's answer definitely was on the right track, here is what I used:
- name: test for available disk space
assert:
that:
- not {{ item.mount == '/' and ( item.size_available < item.size_total - ( item.size_total|float * 0.8 ) ) }}
- not {{ item.mount == '/var' and ( item.size_available < item.size_total - ( item.size_total|float * 0.8 ) ) }}
with_items: ansible_mounts
ignore_errors: yes
register: disk_free
- name: free disk space
command: "/some/command/that/fixes/it"
when: disk_free|failed
The assert task simply tests for a condition, by setting ignore_errors, and registering the result of the test to a new variable we can perform a conditional task later in the play instead of just failing when the result of the assert fails.
The tests themselves could probably be written more efficiently, but at the cost of readability. So I didn't use a multiple-list loop in the example. In this case the task loops over each item in the list of mounted filesystems (an ansible-created fact, called ansible_mounts.)
By negating the test we avoid failing on file system mounts not in our list, then simple math handles the rest. The part that tripped me up was that the size_available and size_total variables were strings, so a jinja filter converts them to a float before calculating the percentage.
In my case, all I care about is the root partition. But I found when using the example from frameloss above, that I needed a negated 'or' condition, because each mount point will get tested against the assertion. If more than one mount point existed, then that meant the assertion would always fail. In my example, I'm testing for if the size_available is less than 50% of size_total directly, rather than calculate it as frameloss did.
Secondly, at least in the version of ansible I used, it was necessary to include the {{ }} around the variable in with_items. A mistake that I made that wasn't in the example above was not aligning the 'when' clause at the same indentation as the 'fail' directive. ( If that mistake is made, then the solution does not work... )
# This works with ansible 2.2.1.0
- hosts: api-endpoints
become: True
tasks:
- name: Test disk space available
assert:
that:
- item.mount != '/' or {{ item.mount == '/' and item.size_available > (item.size_total|float * 0.4) }}
with_items: '{{ ansible_mounts }}'
ignore_errors: yes
register: disk_free
- name: Fail when disk space needs attention
fail:
msg: 'Disk space needs attention.'
when: disk_free|failed
I didn't test it but I suggest to try something like this:
file:
dest: /path/to/big/file
state: absent
when: "{% for point in ansible_mounts %}{% if point.mount == '/' and point.size_available > (point.size_total / 100 * 85) %}true{% endif %}{% endfor %}" == "true"
In this example, we iterate over mount points and find "/", after that we calculate is there utilization goes over 85 percentage and prints "true" if it's true. Next, we compare that string and decide should this file be deleted.
Inspired by examples from the following blog: https://blog.codecentric.de/en/2014/08/jinja2-better-ansible-playbooks-templates/
My solution
- name: cleanup logs, free disk space below 20%
sudo: yes
command: find /var -name "*.log" \( \( -size +50M -mtime +7 \) -o -mtime +30 \) -exec truncate {} --size 0 \;
when: "item.mount == '/var' and ( item.size_available < item.size_total * 0.2 )"
with_items: ansible_mounts
This will truncate any *.log files on the volume /var that are either older than 7 days and greater than 50M or older than 30 days if the free disk space falls below 20%.
When I create new EC2 instances I use an ansible dynamic inventory to create new cloudwatch metrics alarms. So far so good:
- name: set AWS CloudWatch alarms
hosts: tag_env_production
vars:
alarm_slack: 'arn:aws:sns:123:metrics-alarms-slack'
tasks:
- name: "CPU > 70%"
ec2_metric_alarm:
state: present
name: "{{ ec2_tag_Name }}-CPU"
region: "{{ ec2_region }}"
dimensions:
InstanceId: '{{ ec2_id }}'
namespace: "AWS/EC2"
metric: CPUUtilization
statistic: Average
comparison: ">="
threshold: 70.0
unit: Percent
period: 300
evaluation_periods: 1
description: Triggered when CPU utilization is more than 70% for 5 minutes
alarm_actions: ['{{ alarm_slack }}']
when: ec2_tag_group == 'lazyservers'
Executing as follows:
ansible-playbook -v ec2_alarms.yml -i inventories/ec2/ec2.py
After creating the new instances I drop the old ones (manually). The problem is that I'd need to delete the alarms for existing metrics attached to the old instances.
Am I missing something or there is no way how to do this via the dynamic inventory?
My current idea is to delete the metrics for instances that are in the "Terminating" state, but the downside is that if I run the playbook after those instances are terminated, they simply won't be visible.
Before delete the instance, delete the alarm try with something like this:
- name: delete alarm
ec2_metric_alarm:
state: absent
region: ap-southeast-2
name: "cpu-low"
metric: "CPUUtilization"
namespace: "AWS/EC2"
statistic: Average
comparison: "<="
threshold: 5.0
period: 300
evaluation_periods: 3
unit: "Percent"
description: "This will alarm when a bamboo slave's cpu usage average is lower than 5% for 15 minutes "
dimensions: {'InstanceId':'{{ instance_id }}'}
alarm_actions: ["action1","action2"]