collectd - exec plugin: Unable to parse command - metrics

I'm trying to return a value from a simple script. However, I'm getting the following error.
Feb 26 09:26:37 localhost systemd[1]: Starting Collectd statistics daemon...
Feb 26 09:26:37 localhost collectd[834]: plugin_load: plugin "exec" successfully loaded.
Feb 26 09:26:37 localhost collectd[834]: Systemd detected, trying to signal readyness.
Feb 26 09:26:37 localhost systemd[1]: Started Collectd statistics daemon.
Feb 26 09:26:37 localhost collectd[834]: Initialization complete, entering read-loop.
Feb 26 09:26:37 localhost collectd[834]: exec plugin: Unable to parse command, ignoring line: "73"
Feb 26 09:26:47 localhost collectd[834]: exec plugin: Unable to parse command, ignoring line: "74"
Feb 26 09:26:57 localhost collectd[834]: exec plugin: Unable to parse command, ignoring line: "73"
Feb 26 09:27:07 localhost collectd[834]: exec plugin: Unable to parse command, ignoring line: "73"
My config is
LoadPlugin exec
<Plugin exec>
Exec "cwagent" "/opt/aws/amazon-cloudwatch-agent/bin/supervisor.sh"
</Plugin>
and my script is
#!/bin/bash
VALUE=$(/bin/systemctl status | wc -l)
echo "$VALUE"
I realise that this is probably a silly mistake I'm making. I have spent a bit of time playing around and googling to try to understand the problem. But I'm afraid I've made little progress. Grateful for any advice :¬)

Number of things, your plugin is forked off by collectd with the expectation that it keeps running and producing consumable output, so you need to use a while loop like it lays out here: https://collectd.org/wiki/index.php/Plugin:Exec
Second, your output format is wrong. I found this bit of the documentation badly written because it isn't completely clear how the gauge name and metric name are constituted out of the string. Taking the example in the page above:
echo "PUTVAL \"$HOSTNAME/exec-magic/gauge-magic_level\" interval=$INTERVAL N:$VALUE"
Then:
exec-magic is the plugin name
magic_level is the metric name
gauge is the data source type from collectd types
N: is the abbreviation for "now" as defined in the exec plugin
So putting this together you'd something similar to:
#!/bin/bash
HOSTNAME="${COLLECTD_HOSTNAME:-localhost}"
INTERVAL="${COLLECTD_INTERVAL:-60}"
while sleep "$INTERVAL"; do
VALUE=$(/bin/systemctl status | wc -l)
echo "PUTVAL ${HOSTNAME}/cwagent/counter-line_count\" N:$VALUE"
done
In this case you are using the simple counter type and returning a single value equivalent to the number of lines you counted in your command.

Related

Memory builds up overtime on Kubernetes pod causing JVM unable to start

We are running a kubernetes environment and we have a pod that is encountering memory issues. The pod runs only a single container, and this container is responsible for running various utility jobs throughout the day.
The issue is that this pod's memory usage grows slowly over time. There is a 6 GB memory limit for this pod, and eventually, the memory consumption grows very close to 6GB.
A lot of our utility jobs are written in Java, and when the JVM spins up for them, they require -Xms256m in order to start. Yet, since the pod's memory is growing over time, eventually it gets to the point where there isn't 256MB free to start the JVM, and the Linux oom-killer kills the java process. Here is what I see from dmesg when this occurs:
[Thu Feb 18 17:43:13 2021] Memory cgroup stats for /kubepods/burstable/pod4f5d9d31-71c5-11eb-a98c-023a5ae8b224/921550be41cd797d9a32ed7673fb29ea8c48dc002a4df63638520fd7df7cf3f9: cache:8KB rss:119180KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:119132KB inactive_file:8KB active_file:0KB unevictable:4KB
[Thu Feb 18 17:43:13 2021] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Thu Feb 18 17:43:13 2021] [ 5579] 0 5579 253 1 4 0 -998 pause
[Thu Feb 18 17:43:13 2021] [ 5737] 0 5737 3815 439 12 0 907 entrypoint.sh
[Thu Feb 18 17:43:13 2021] [13411] 0 13411 1952 155 9 0 907 tail
[Thu Feb 18 17:43:13 2021] [28363] 0 28363 3814 431 13 0 907 dataextract.sh
[Thu Feb 18 17:43:14 2021] [28401] 0 28401 768177 32228 152 0 907 java
[Thu Feb 18 17:43:14 2021] Memory cgroup out of memory: Kill process 28471 (Finalizer threa) score 928 or sacrifice child
[Thu Feb 18 17:43:14 2021] Killed process 28401 (java), UID 0, total-vm:3072708kB, anon-rss:116856kB, file-rss:12056kB, shmem-rss:0kB
Based on research I've been doing, here for example, it seems like it is normal on Linux to grow in memory consumption over time as various caches grow. From what I understand, cached memory should also be freed when new processes (such as my java process) begin to run.
My main question is: should this pod's memory be getting freed in order for these java processes to run? If so, are there any steps I can take to begin to debug why this may not be happening correctly?
Aside from this concern, I've also been trying to track down what is responsible for the growing memory in the first place. I was able to narrow it down to a certain job that runs every 15 minutes. I noticed that after every time it ran, used memory for the pod grew by ~.1 GB.
I was able to figure this out by running this command (inside the container) before and after each execution of the job:
cat /sys/fs/cgroup/memory/memory.usage_in_bytes | numfmt --to si
From there I narrowed down the piece of bash code from which the memory seems to consistently grow. That code looks like this:
while [ "z${_STATUS}" != "z0" ]
do
RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
_STATUS=`echo $RES | jq -r '.status.status' || exit 1`
PROGRES=`echo $RES | jq -r '.status.progress' || exit 1`
[ "x$_STATUS" == "x1" ] && exit 1
[ "x$_STATUS" == "x3" ] && exit 3
[ $CNT -gt 10 ] && PrintLog "WC Job ($JOB_ID) Progress: $PROGRES Status: $_STATUS " && CNT=0
sleep 10
((CNT++))
done
[ "z${_STATUS}" == "z0" ] && STATUS=Success || STATUS=Failed
This piece of code seems innocuous to me at first glance, so I do not know where to go from here.
I would really appreciate any help, I've been trying to get to the bottom of this issue for days now.
I did eventually get to the bottom of this so I figured I'd post my solution here. I mentioned in my original post that I narrowed down my issue to the while loop that I posted above in my question. Each time the job in question ran, that while loop would iterate maybe 10 times. After the while loop completed, I noticed that utilized memory increased by 100MB each time pretty consistently.
On a hunch, I had a feeling the CURL command within the loop could be the culprit. And in fact, it did turn out that CURL was eating up my memory and not releasing it for whatever reason. Instead of looping and running the following CURL command:
RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
I replaced this command with a simple python script that utilized the requests module to check our job statuses instead.
I am not sure still why CURL was the culprit in this case. After running CURL --version it appears that the underlying library being used is libcurl/7.29.0. Maybe there is an bug within that library version causing some issues with memory management, but that is just a guess.
In any case, switching from using python's requests module instead of CURL has resolved my issue.

Communicating with Systemd service through socket mapped to stdin

I'm creating my first background service and I want to communicate with it through a socket.
I have the following script /tmp/myservice.sh:
#! /usr/bin/env bash
while read received_cmd
do
echo "Received command ${received_cmd}"
done
And the following socket /etc/systemd/user/myservice.socket
[Unit]
Description=Socket to communicate with myservice
[Socket]
ListenSequentialPacket=/tmp/myservice.socket
And the following service:
[Unit]
Description=A simple service example
[Service]
ExecStart=/bin/bash /tmp/myservice.sh
StandardError=journal
StandardInput=socket
StandardOutput=socket
Type=simple
The idea is to understand how to communicate with a background service, here using an unix file socket. The script works well when launched from the shell and reading stdin and I thought that by setting StandardInput = "socket" it would read from the socket the same way.
Nevertheless, when I run nc -U /tmp/myservice.socket the command returns right away and I have the following output:
$ journalctl --user -u myservice
-- Logs begin at Sat 2020-10-24 17:26:25 BST, end at Thu 2020-10-29 14:00:53 GMT. --
Oct 29 08:40:16 shiny systemd[1689]: Started A simple service example.
Oct 29 08:40:16 shiny bash[21941]: /tmp/myservice.sh: line 3: read: read error: 0: Invalid argument
Oct 29 08:40:16 shiny systemd[1689]: myservice.service: Succeeded.
Oct 29 08:40:16 shiny systemd[1689]: Started A simple service example.
Oct 29 08:40:16 shiny bash[21942]: /tmp/myservice.sh: line 3: read: read error: 0: Invalid argument
Oct 29 08:40:16 shiny systemd[1689]: myservice.service: Succeeded.
Oct 29 08:40:16 shiny systemd[1689]: Started A simple service example.
Oct 29 08:40:16 shiny bash[21943]: /tmp/myservice.sh: line 3: read: read error: 0: Invalid argument
Oct 29 08:40:16 shiny systemd[1689]: myservice.service: Succeeded.
Oct 29 08:40:16 shiny systemd[1689]: Started A simple service example.
Oct 29 08:40:16 shiny bash[21944]: /tmp/myservice.sh: line 3: read: read error: 0: Invalid argument
Oct 29 08:40:16 shiny systemd[1689]: myservice.service: Succeeded.
Oct 29 08:40:16 shiny systemd[1689]: Started A simple service example.
Oct 29 08:40:16 shiny bash[21945]: /tmp/myservice.sh: line 3: read: read error: 0: Invalid argument
Oct 29 08:40:16 shiny systemd[1689]: myservice.service: Succeeded.
Oct 29 08:40:16 shiny systemd[1689]: myservice.service: Start request repeated too quickly.
Oct 29 08:40:16 shiny systemd[1689]: myservice.service: Failed with result 'start-limit-hit'.
Oct 29 08:40:16 shiny systemd[1689]: Failed to start A simple service example.
Did I misunderstand how sockets work? Why read fails to read from the socket? Should I use another mechanism to communicate with my background service (as I said, it's my first background service so I may do unconventional things here)?
The only thing I have seen working with a shell script is ListenStream= rather than ListenSequentialPacket=. (Obviously, this means you lose packet boundaries, but the shell read is usually oriented to read lines ending \n from streams, so it is not usually a problem).
But the most important thing that is missing, is the extra Accept line:
[Socket]
ListenStream=...
Accept=true
As I understand it, without this the service will be passed a socket on which it must first do a socket accept() call, to get the actual connection socket (hence the read error). The service must also then handle all further connections.
By using Accept=true, a new service will be started for each new connection, and will be passed the immediately usable socket. Note, however, that this means the service must now be templated, i.e. called myservice#.service rather than myservice.service.
(For datagram sockets, Accept must be left defaulted to false). See man systemd.socket.

Difference between **journalctl -u test.service** and **journalctl CONTAINER_NAME=test**

I have a systemd service file which run a docker container with log driver journald.
ExecStart=/usr/bin/docker run \
--name ${CONTAINER_NAME} \
-p ${PORT}:8080 \
--add-host ${DNS} \
-v /etc/localtime:/etc/localtime:ro \
--log-driver=journald \
--log-opt tag="docker.{{.Name}}" \
${RESPOSITORY_NAME}/${CONTAINER_NAME}
ExecStop=-/usr/bin/docker stop ${CONTAINER_NAME}
When I check the logs via journalctl I see two different _TRANSPORT.
With journalctl -u test.service I see _TRANSPORT=stdout. And with Journalctl CONTAINER_NAME=test I see _TRANSPORT=journal
What is the difference?
The difference here is in how the logs get to systemd-journald before they are logged.
As of right now, the supported transports (at least according to the _TRANSPORT field in systemd-journald) are: audit, driver, syslog, journal, stdout and kernel (see systemd.journal-fields(7)).
In your case, everything logged to stdout by commands executed by the ExecStart= and ExecStop= directives is logged under the _TRANSPORT=stdout transport.
However, Docker is internally capable of using the journald logging driver which, among other things, introduces several custom journal fields - one of them being CONTAINER_ID=. It's just a different method of delivering data to systemd-journald - instead of relying on systemd to catch and send everything from stdout to systemd-journald, Docker internally sends everything straight to systemd-journald by itself.
This can be achieved by using the sd-journal API (as described in sd-journal(3)). Docker uses the go-systemd Go bindings for the sd-journal C library.
Simple example:
hello.c
#include <stdio.h>
#include <systemd/sd-journal.h>
int main(void)
{
printf("Hello from stdout\n");
sd_journal_print(LOG_INFO, "Hello from journald");
return 0;
}
# gcc -o /var/tmp/hello -lsystemd hello.c
# cat > /etc/systemd/system/hello.service << EOF
[Service]
ExecStart=/var/tmp/hello
EOF
# systemctl daemon-reload
# systemctl start test.service
Now if I check journal, I'll see both messages:
# journalctl -u hello.service
-- Logs begin at Mon 2019-09-30 22:08:02 CEST, end at Fri 2020-03-27 17:11:29 CET. --
Mar 27 17:08:28 localhost systemd[1]: Started hello.service.
Mar 27 17:08:28 localhost hello[921852]: Hello from journald
Mar 27 17:08:28 localhost hello[921852]: Hello from stdout
Mar 27 17:08:28 localhost systemd[1]: hello.service: Succeeded.
But each of them arrived using a different transport:
# journalctl -u hello.service _TRANSPORT=stdout
-- Logs begin at Mon 2019-09-30 22:08:02 CEST, end at Fri 2020-03-27 17:12:29 CET. --
Mar 27 17:08:28 localhost hello[921852]: Hello from stdout
# journalctl -u hello.service _TRANSPORT=journal
-- Logs begin at Mon 2019-09-30 22:08:02 CEST, end at Fri 2020-03-27 17:12:29 CET. --
Mar 27 17:08:28 localhost systemd[1]: Started hello.service.
Mar 27 17:08:28 localhost hello[921852]: Hello from journald
Mar 27 17:08:28 localhost systemd[1]: hello.service: Succeeded.

How can I show progress for a long-running Ansible task?

I have a some Ansible tasks that perform unfortunately long operations - things like running an synchronization operation with an S3 folder. It's not always clear if they're progressing, or just stuck (or the ssh connection has died), so it would be nice to have some sort of progress output displayed. If the command's stdout/stderr was directly displayed, I'd see that, but Ansible captures the output.
Piping output back is a difficult problem for Ansible to solve in its current form. But are there any Ansible tricks I can use to provide some sort of indication that things are still moving?
Current ticket is https://github.com/ansible/ansible/issues/4870
I came across this problem today on OSX, where I was running a docker shell command which took a long time to build and there was no output whilst it built. It was very frustrating to not understand whether the command had hung or was just progressing slowly.
I decided to pipe the output (and error) of the shell command to a port, which could then be listened to via netcat in a separate terminal.
myplaybook.yml
- name: run some long-running task and pipe to a port
shell: myLongRunningApp > /dev/tcp/localhost/4000 2>&1
And in a separate terminal window:
$ nc -lk 4000
Output from my
long
running
app will appear here
Note that I pipe the error output to the same port; I could as easily pipe to a different port.
Also, I ended up setting a variable called nc_port which will allow for changing the port in case that port is in use. The ansible task then looks like:
shell: myLongRunningApp > /dev/tcp/localhost/{{nc_port}} 2>&1
Note that the command myLongRunningApp is being executed on localhost (i.e. that's the host set in the inventory) which is why I listen to localhost with nc.
Ansible has since implemented the following:
---
# Requires ansible 1.8+
- name: 'YUM - async task'
yum:
name: docker-io
state: installed
async: 1000
poll: 0
register: yum_sleeper
- name: 'YUM - check on async task'
async_status:
jid: "{{ yum_sleeper.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 30
For further information, see the official documentation on the topic (make sure you're selecting your version of Ansible).
There's a couple of things you can do, but as you have rightly pointed out, Ansible in its current form doesn't really offer a good solution.
Official-ish solutions:
One idea is to mark the task as async and poll it. Obviously this is only suitable if it is capable of running in such a manner without causing failure elsewhere in your playbook. The async docs are here and here's an example lifted from them:
- hosts: all
remote_user: root
tasks:
- name: simulate long running op (15 sec), wait for up to 45 sec, poll every 5 sec
command: /bin/sleep 15
async: 45
poll: 5
This can at least give you a 'ping' to know that the task isn't hanging.
The only other officially endorsed method would be Ansible Tower, which has progress bars for tasks but isn't free.
Hacky-ish solutions:
Beyond the above, you're pretty much going to have to roll your own. Your specific example of synching an S3 bucket could be monitored fairly easily with a script periodically calling the AWS CLI and counting the number of items in a bucket, but that's hardly a good, generic solution.
The only thing I could imagine being somewhat effective would be watching the incoming ssh session from one of your nodes.
To do that you could configure the ansible user on that machine to connect via screen and actively watch it. Alternatively perhaps using the log_output option in the sudoers entry for that user, allowing you to tail the file. Details of log_output can be found on the sudoers man page
If you're on Linux you may use systemd-run to create a transient unit and inspect the output with journalctl, like:
sudo systemd-run --unit foo \
bash -c 'for i in {0..10}; do
echo "$((i * 10))%"; sleep 1;
done;
echo "Complete"'
And in another session
sudo journalctl -xf --unit foo
It would output something like:
Apr 07 02:10:34 localhost.localdomain systemd[1]: Started /bin/bash -c for i in {0..10}; do echo "$((i * 10))%"; sleep 1; done; echo "Complete".
-- Subject: Unit foo.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit foo.service has finished starting up.
--
-- The start-up result is done.
Apr 07 02:10:34 localhost.localdomain bash[10083]: 0%
Apr 07 02:10:35 localhost.localdomain bash[10083]: 10%
Apr 07 02:10:36 localhost.localdomain bash[10083]: 20%
Apr 07 02:10:37 localhost.localdomain bash[10083]: 30%
Apr 07 02:10:38 localhost.localdomain bash[10083]: 40%
Apr 07 02:10:39 localhost.localdomain bash[10083]: 50%
Apr 07 02:10:40 localhost.localdomain bash[10083]: 60%
Apr 07 02:10:41 localhost.localdomain bash[10083]: 70%
Apr 07 02:10:42 localhost.localdomain bash[10083]: 80%
Apr 07 02:10:43 localhost.localdomain bash[10083]: 90%
Apr 07 02:10:44 localhost.localdomain bash[10083]: 100%
Apr 07 02:10:45 localhost.localdomain bash[10083]: Complete

Parsing entry name from a log

Writing bash parsing scripts is my own personal nightmare, so here I am.
The server log format is below:
197 INFO Thu Mar 27 10:10:32 2014
seq_1_1..JobControl (DSWaitForJob): Waiting for job job_1_1_1 to finish
198 INFO Thu Mar 27 10:10:36 2014
seq_1_1..JobControl (DSWaitForJob): Job job_1_1_1 has finished, status = 3 (Aborted)
199 WARNING Thu Mar 27 10:10:36 2014
seq_1_1..JobControl (#job_1_1_1): Job job_1_1_1 did not finish OK, status = 'Aborted'
From here I need to parse out the string which follows the format:
Job job_name has finished, status = 3 (Aborted)
So from the output above I should get: job_1_1_1
What would the script for that look like if I get this server log as a certain command output?
Thanks xx
Using grep -P:
grep -oP '\w+(?= has finished, status = 3)' file
job_1_1_1

Resources