How can I show progress for a long-running Ansible task? - ansible

I have a some Ansible tasks that perform unfortunately long operations - things like running an synchronization operation with an S3 folder. It's not always clear if they're progressing, or just stuck (or the ssh connection has died), so it would be nice to have some sort of progress output displayed. If the command's stdout/stderr was directly displayed, I'd see that, but Ansible captures the output.
Piping output back is a difficult problem for Ansible to solve in its current form. But are there any Ansible tricks I can use to provide some sort of indication that things are still moving?
Current ticket is https://github.com/ansible/ansible/issues/4870

I came across this problem today on OSX, where I was running a docker shell command which took a long time to build and there was no output whilst it built. It was very frustrating to not understand whether the command had hung or was just progressing slowly.
I decided to pipe the output (and error) of the shell command to a port, which could then be listened to via netcat in a separate terminal.
myplaybook.yml
- name: run some long-running task and pipe to a port
shell: myLongRunningApp > /dev/tcp/localhost/4000 2>&1
And in a separate terminal window:
$ nc -lk 4000
Output from my
long
running
app will appear here
Note that I pipe the error output to the same port; I could as easily pipe to a different port.
Also, I ended up setting a variable called nc_port which will allow for changing the port in case that port is in use. The ansible task then looks like:
shell: myLongRunningApp > /dev/tcp/localhost/{{nc_port}} 2>&1
Note that the command myLongRunningApp is being executed on localhost (i.e. that's the host set in the inventory) which is why I listen to localhost with nc.

Ansible has since implemented the following:
---
# Requires ansible 1.8+
- name: 'YUM - async task'
yum:
name: docker-io
state: installed
async: 1000
poll: 0
register: yum_sleeper
- name: 'YUM - check on async task'
async_status:
jid: "{{ yum_sleeper.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 30
For further information, see the official documentation on the topic (make sure you're selecting your version of Ansible).

There's a couple of things you can do, but as you have rightly pointed out, Ansible in its current form doesn't really offer a good solution.
Official-ish solutions:
One idea is to mark the task as async and poll it. Obviously this is only suitable if it is capable of running in such a manner without causing failure elsewhere in your playbook. The async docs are here and here's an example lifted from them:
- hosts: all
remote_user: root
tasks:
- name: simulate long running op (15 sec), wait for up to 45 sec, poll every 5 sec
command: /bin/sleep 15
async: 45
poll: 5
This can at least give you a 'ping' to know that the task isn't hanging.
The only other officially endorsed method would be Ansible Tower, which has progress bars for tasks but isn't free.
Hacky-ish solutions:
Beyond the above, you're pretty much going to have to roll your own. Your specific example of synching an S3 bucket could be monitored fairly easily with a script periodically calling the AWS CLI and counting the number of items in a bucket, but that's hardly a good, generic solution.
The only thing I could imagine being somewhat effective would be watching the incoming ssh session from one of your nodes.
To do that you could configure the ansible user on that machine to connect via screen and actively watch it. Alternatively perhaps using the log_output option in the sudoers entry for that user, allowing you to tail the file. Details of log_output can be found on the sudoers man page

If you're on Linux you may use systemd-run to create a transient unit and inspect the output with journalctl, like:
sudo systemd-run --unit foo \
bash -c 'for i in {0..10}; do
echo "$((i * 10))%"; sleep 1;
done;
echo "Complete"'
And in another session
sudo journalctl -xf --unit foo
It would output something like:
Apr 07 02:10:34 localhost.localdomain systemd[1]: Started /bin/bash -c for i in {0..10}; do echo "$((i * 10))%"; sleep 1; done; echo "Complete".
-- Subject: Unit foo.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit foo.service has finished starting up.
--
-- The start-up result is done.
Apr 07 02:10:34 localhost.localdomain bash[10083]: 0%
Apr 07 02:10:35 localhost.localdomain bash[10083]: 10%
Apr 07 02:10:36 localhost.localdomain bash[10083]: 20%
Apr 07 02:10:37 localhost.localdomain bash[10083]: 30%
Apr 07 02:10:38 localhost.localdomain bash[10083]: 40%
Apr 07 02:10:39 localhost.localdomain bash[10083]: 50%
Apr 07 02:10:40 localhost.localdomain bash[10083]: 60%
Apr 07 02:10:41 localhost.localdomain bash[10083]: 70%
Apr 07 02:10:42 localhost.localdomain bash[10083]: 80%
Apr 07 02:10:43 localhost.localdomain bash[10083]: 90%
Apr 07 02:10:44 localhost.localdomain bash[10083]: 100%
Apr 07 02:10:45 localhost.localdomain bash[10083]: Complete

Related

Memory builds up overtime on Kubernetes pod causing JVM unable to start

We are running a kubernetes environment and we have a pod that is encountering memory issues. The pod runs only a single container, and this container is responsible for running various utility jobs throughout the day.
The issue is that this pod's memory usage grows slowly over time. There is a 6 GB memory limit for this pod, and eventually, the memory consumption grows very close to 6GB.
A lot of our utility jobs are written in Java, and when the JVM spins up for them, they require -Xms256m in order to start. Yet, since the pod's memory is growing over time, eventually it gets to the point where there isn't 256MB free to start the JVM, and the Linux oom-killer kills the java process. Here is what I see from dmesg when this occurs:
[Thu Feb 18 17:43:13 2021] Memory cgroup stats for /kubepods/burstable/pod4f5d9d31-71c5-11eb-a98c-023a5ae8b224/921550be41cd797d9a32ed7673fb29ea8c48dc002a4df63638520fd7df7cf3f9: cache:8KB rss:119180KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:119132KB inactive_file:8KB active_file:0KB unevictable:4KB
[Thu Feb 18 17:43:13 2021] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Thu Feb 18 17:43:13 2021] [ 5579] 0 5579 253 1 4 0 -998 pause
[Thu Feb 18 17:43:13 2021] [ 5737] 0 5737 3815 439 12 0 907 entrypoint.sh
[Thu Feb 18 17:43:13 2021] [13411] 0 13411 1952 155 9 0 907 tail
[Thu Feb 18 17:43:13 2021] [28363] 0 28363 3814 431 13 0 907 dataextract.sh
[Thu Feb 18 17:43:14 2021] [28401] 0 28401 768177 32228 152 0 907 java
[Thu Feb 18 17:43:14 2021] Memory cgroup out of memory: Kill process 28471 (Finalizer threa) score 928 or sacrifice child
[Thu Feb 18 17:43:14 2021] Killed process 28401 (java), UID 0, total-vm:3072708kB, anon-rss:116856kB, file-rss:12056kB, shmem-rss:0kB
Based on research I've been doing, here for example, it seems like it is normal on Linux to grow in memory consumption over time as various caches grow. From what I understand, cached memory should also be freed when new processes (such as my java process) begin to run.
My main question is: should this pod's memory be getting freed in order for these java processes to run? If so, are there any steps I can take to begin to debug why this may not be happening correctly?
Aside from this concern, I've also been trying to track down what is responsible for the growing memory in the first place. I was able to narrow it down to a certain job that runs every 15 minutes. I noticed that after every time it ran, used memory for the pod grew by ~.1 GB.
I was able to figure this out by running this command (inside the container) before and after each execution of the job:
cat /sys/fs/cgroup/memory/memory.usage_in_bytes | numfmt --to si
From there I narrowed down the piece of bash code from which the memory seems to consistently grow. That code looks like this:
while [ "z${_STATUS}" != "z0" ]
do
RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
_STATUS=`echo $RES | jq -r '.status.status' || exit 1`
PROGRES=`echo $RES | jq -r '.status.progress' || exit 1`
[ "x$_STATUS" == "x1" ] && exit 1
[ "x$_STATUS" == "x3" ] && exit 3
[ $CNT -gt 10 ] && PrintLog "WC Job ($JOB_ID) Progress: $PROGRES Status: $_STATUS " && CNT=0
sleep 10
((CNT++))
done
[ "z${_STATUS}" == "z0" ] && STATUS=Success || STATUS=Failed
This piece of code seems innocuous to me at first glance, so I do not know where to go from here.
I would really appreciate any help, I've been trying to get to the bottom of this issue for days now.
I did eventually get to the bottom of this so I figured I'd post my solution here. I mentioned in my original post that I narrowed down my issue to the while loop that I posted above in my question. Each time the job in question ran, that while loop would iterate maybe 10 times. After the while loop completed, I noticed that utilized memory increased by 100MB each time pretty consistently.
On a hunch, I had a feeling the CURL command within the loop could be the culprit. And in fact, it did turn out that CURL was eating up my memory and not releasing it for whatever reason. Instead of looping and running the following CURL command:
RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
I replaced this command with a simple python script that utilized the requests module to check our job statuses instead.
I am not sure still why CURL was the culprit in this case. After running CURL --version it appears that the underlying library being used is libcurl/7.29.0. Maybe there is an bug within that library version causing some issues with memory management, but that is just a guess.
In any case, switching from using python's requests module instead of CURL has resolved my issue.

collectd - exec plugin: Unable to parse command

I'm trying to return a value from a simple script. However, I'm getting the following error.
Feb 26 09:26:37 localhost systemd[1]: Starting Collectd statistics daemon...
Feb 26 09:26:37 localhost collectd[834]: plugin_load: plugin "exec" successfully loaded.
Feb 26 09:26:37 localhost collectd[834]: Systemd detected, trying to signal readyness.
Feb 26 09:26:37 localhost systemd[1]: Started Collectd statistics daemon.
Feb 26 09:26:37 localhost collectd[834]: Initialization complete, entering read-loop.
Feb 26 09:26:37 localhost collectd[834]: exec plugin: Unable to parse command, ignoring line: "73"
Feb 26 09:26:47 localhost collectd[834]: exec plugin: Unable to parse command, ignoring line: "74"
Feb 26 09:26:57 localhost collectd[834]: exec plugin: Unable to parse command, ignoring line: "73"
Feb 26 09:27:07 localhost collectd[834]: exec plugin: Unable to parse command, ignoring line: "73"
My config is
LoadPlugin exec
<Plugin exec>
Exec "cwagent" "/opt/aws/amazon-cloudwatch-agent/bin/supervisor.sh"
</Plugin>
and my script is
#!/bin/bash
VALUE=$(/bin/systemctl status | wc -l)
echo "$VALUE"
I realise that this is probably a silly mistake I'm making. I have spent a bit of time playing around and googling to try to understand the problem. But I'm afraid I've made little progress. Grateful for any advice :¬)
Number of things, your plugin is forked off by collectd with the expectation that it keeps running and producing consumable output, so you need to use a while loop like it lays out here: https://collectd.org/wiki/index.php/Plugin:Exec
Second, your output format is wrong. I found this bit of the documentation badly written because it isn't completely clear how the gauge name and metric name are constituted out of the string. Taking the example in the page above:
echo "PUTVAL \"$HOSTNAME/exec-magic/gauge-magic_level\" interval=$INTERVAL N:$VALUE"
Then:
exec-magic is the plugin name
magic_level is the metric name
gauge is the data source type from collectd types
N: is the abbreviation for "now" as defined in the exec plugin
So putting this together you'd something similar to:
#!/bin/bash
HOSTNAME="${COLLECTD_HOSTNAME:-localhost}"
INTERVAL="${COLLECTD_INTERVAL:-60}"
while sleep "$INTERVAL"; do
VALUE=$(/bin/systemctl status | wc -l)
echo "PUTVAL ${HOSTNAME}/cwagent/counter-line_count\" N:$VALUE"
done
In this case you are using the simple counter type and returning a single value equivalent to the number of lines you counted in your command.

How to find the date using internet (ie ntp) from bash?

How can I learn date and time from the internet using bash without installing anything extra.
I am basically looking for an equivalent of bash $ date, but using an NTP (or any other way) to get the correct date and time from the internet. All the methods I find (such as ntpd) are meant to correct the system time, which is not my purpose.
date has a lot of options for formatting, but I'm assuming that you just want the date and time:
ntpdate -q time.google.com | sed -n 's/ ntpdate.*//p'
(or any other time server)
If you have ntpd installed & configured then you can use the NTP Query command ntpq -crv which will return;
associd=0 status=04ff leap_none, sync_uhf_radio, 15 events, stale_leapsecond_values,
version="ntpd 4.2.6p5#1.2349-o Mon Feb 6 07:22:46 UTC 2017 (1)",
processor="x86_64", system="Linux/4.10.13-1.el6.elrepo.x86_64", leap=00,
stratum=1, precision=-23, rootdelay=0.000, rootdisp=1.000, refid=PPS,
reftime=dd2c9f10.f25911ee Wed, Aug 2 2017 19:57:20.946,
clock=dd2c9f11.f4251b0a Wed, Aug 2 2017 19:57:21.953, peer=6516, tc=4,
mintc=3, offset=-0.005, frequency=-17.045, sys_jitter=0.110,
clk_jitter=0.007, clk_wander=0.003, tai=37, leapsec=201701010000,
expire=201706010000
You want the line starting clock which gives the time, date etc - you would be best parsing this out with awk or something if you just want the date stamp rather then everything else.
You do not need to be a root user to run the command. It won't set anything, but will query your local server (presuming your running ntp) and present the details.

How do I read / understand ansible logs on target host (written by syslog)

When you execute ansible on some host, it will write to syslog on that host, something like this:
Dec 1 15:00:22 run-tools python: ansible-<stdin> Invoked with partial=False links=None copy_links=None perms=None owner=False rsync_path=None dest_port=22 _local_rsync_path=rsync group=False existing_only=False archive=True _substitute_controller=False verify_host=False dirs=False private_key=None dest= compress=True rsync_timeout=0 rsync_opts=None set_remote_user=True recursive=None src=/etc/ansible/repo/external/golive/ checksum=False times=None mode=push ssh_args=None delete=False
Dec 1 15:00:22 run-tools python: ansible-<stdin> Invoked with partial=False links=None copy_links=None perms=None owner=False rsync_path=None dest_port=22 _local_rsync_path=rsync group=False existing_only=False archive=True _substitute_controller=False verify_host=False dirs=False private_key=None dest= compress=True rsync_timeout=0 rsync_opts=None set_remote_user=True recursive=None src=/etc/ansible/repo/external/golive/ checksum=False times=None mode=push ssh_args=None delete=False
Dec 1 15:00:22 run-tools python: ansible-<stdin> Invoked with partial=False links=None copy_links=None perms=None owner=False rsync_path=None dest_port=22 _local_rsync_path=rsync group=False existing_only=False archive=True _substitute_controller=False verify_host=False dirs=False private_key=None dest= compress=True rsync_timeout=0 rsync_opts=None set_remote_user=True recursive=None src=/etc/ansible/repo/external/golive/ checksum=False times=None mode=push ssh_args=None delete=False
Dec 1 15:00:56 run-tools python: ansible-<stdin> Invoked with filter=* fact_path=/etc/ansible/facts.d
Dec 1 15:09:56 run-tools python: ansible-<stdin> Invoked with checksum_algorithm=sha1 mime=False get_checksum=True path=/usr/local/bin/check_open_files_generic.sh checksum_algo=sha1 follow=False get_md5=False
Dec 1 15:09:56 run-tools python: ansible-<stdin> Invoked with directory_mode=None force=False remote_src=None path=/usr/local/bin/check_open_files_generic.sh owner=root follow=False group=root state=None content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None dest=/usr/local/bin/check_open_files_generic.sh selevel=None original_basename=check_open_files_generic.sh regexp=None validate=None src=check_open_files_generic.sh seuser=None recurse=False delimiter=None mode=0755 backup=None
Dec 1 15:20:03 run-tools python: ansible-<stdin> Invoked with warn=True executable=None _uses_shell=False _raw_params=visudo -c removes=None creates=None chdir=None
Is there any documentation or explanation of these logs that would help me understand how to read them? Specifically I would like to be able to see what exactly ansible did, which files it touched etc. Is it possible to find it there? Or reconfigure ansible so that it writes this kind of information in there?
Is it possible to configure these logs at all? How?
I am not aware of documentation that explains the contents of syslog messages specifically. However, you can look at some of the logging code in AnsibleModule.log() to see what's going on. Basically, it's reporting module names and the parameters they were called with.
For configuring logs, there are some good suggestions in response to this related question. The summary is that you can get more information - including your request about what ansible did - by specifying a log path and running with the verbose -v flag. For more fine-grained control, you can attack the problem from two different angles:
From the playbook side, you can use the debug module or tailor your handling of changed/failed results to suit your needs. Both of those changes can add useful context to your log output.
Outside of playbooks, you can use Ansible callback plugins to control logging. Here is an example of a callback plugin which intercepts logs and outputs something more human readable.

SGE submitted job state doesn't change from "qw"

I'm using Sun Grid Engine on ubuntu 14.04 to queue my jobs to be run on a multicore CPU.
I've installed and set up SGE on my system. I created a "hello_world" dir which contains two shell scripts namely "hello_world.sh" & "hello_world_qsub.sh", first one including a simple command and second one including qsub command to submit the first script file as a job to be run.
Here's what "hello_world.sh" includes:
#!/bin/bash
echo "Hello world" > /home/theodore/tmp/hello_world/hello_world_output.txt
And here's what "hello_world_qsub.sh" includes:
#!/bin/bash
qsub \
-e /home/hello_world/hello_world_qsub.error \
-o /home/hello_world/hello_world_qsub.log \
./hello_world.sh
after giving permission to the second sh file and running it with "./hello_world_qsub.sh" command from the specified dir, the output is reasonable:
Your job 1 ("hello_world.sh") has been submitted
But the output of "qstat" command is frustrating:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1 0.50000 hello_worl mhr qw 05/16/2016 20:26:23 1
And the "state" column always remains on "qw" and never changes to "r".
Here's the output of "qstat -j 1" command:
==============================================================
job_number: 1
exec_file: job_scripts/1
submission_time: Mon May 16 20:26:23 2016
owner: mhr
uid: 1000
group: mhr
gid: 1000
sge_o_home: /home/mhr
sge_o_log_name: mhr
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sge_o_shell: /bin/bash
sge_o_workdir: /home/mhr/hello_world
sge_o_host: localhost
account: sge
stderr_path_list: NONE:NONE:/home/hello_world/hello_world_qsub.error
mail_list: mhr#localhost
notify: FALSE
job_name: hello_world.sh
stdout_path_list: NONE:NONE:/home/hello_world/hello_world_qsub.log
jobshare: 0
env_list:
script_file: ./hello_world.sh
scheduling info: queue instance "mainqueue#localhost" dropped because it is temporarily not available
All queues dropped because of overload or full
And here's the output of "qhost" command:
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
localhost - - - - - - -
What should I do to make my jobs run and finish their task?
From your qhost output, it looks like your machine "localhost" is properly configured in SGE. However, on "localhost" sge_execd is either not running or not configured properly. If it were, qhost would report statistics for "localhost".

Resources