Monit: 'Matching' functionality isn't working - bash

I have a process that's kicked off from a custom script. The process does not end with '.pid' so I am trying to use 'matching'. However it seems to be breaking on the whitespace (just stops after 'bin/bash'), no matter how I format the command. The commands themselves do work fine, outside of monit.
Here is what I am trying to use:
check process example_process matching "example_process"
start program = "/bin/bash -c 'nohup /mnt1/path/to/custom/bin/run.sh &'"
stop program = "/usr/bin/killall example_process"
if cpu > 80% for 2 cycles then alert
if cpu > 95% for 5 cycles then restart
if totalmem > 500.0 MB for 5 cycles then restart
if children > 3 then restart
Errors logged:
[UTC Jun 18 02:01:46] info : 'system_ip-10-0-11-189' Monit started
[UTC Jun 18 02:01:46] error : 'example_process' process is not running
[UTC Jun 18 02:01:46] info : 'example_process' trying to restart
[UTC Jun 18 02:01:46] info : 'example_process' start: /bin/bash
[UTC Jun 18 02:02:16] error : 'example_process' failed to start

Related

Memory builds up overtime on Kubernetes pod causing JVM unable to start

We are running a kubernetes environment and we have a pod that is encountering memory issues. The pod runs only a single container, and this container is responsible for running various utility jobs throughout the day.
The issue is that this pod's memory usage grows slowly over time. There is a 6 GB memory limit for this pod, and eventually, the memory consumption grows very close to 6GB.
A lot of our utility jobs are written in Java, and when the JVM spins up for them, they require -Xms256m in order to start. Yet, since the pod's memory is growing over time, eventually it gets to the point where there isn't 256MB free to start the JVM, and the Linux oom-killer kills the java process. Here is what I see from dmesg when this occurs:
[Thu Feb 18 17:43:13 2021] Memory cgroup stats for /kubepods/burstable/pod4f5d9d31-71c5-11eb-a98c-023a5ae8b224/921550be41cd797d9a32ed7673fb29ea8c48dc002a4df63638520fd7df7cf3f9: cache:8KB rss:119180KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:119132KB inactive_file:8KB active_file:0KB unevictable:4KB
[Thu Feb 18 17:43:13 2021] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Thu Feb 18 17:43:13 2021] [ 5579] 0 5579 253 1 4 0 -998 pause
[Thu Feb 18 17:43:13 2021] [ 5737] 0 5737 3815 439 12 0 907 entrypoint.sh
[Thu Feb 18 17:43:13 2021] [13411] 0 13411 1952 155 9 0 907 tail
[Thu Feb 18 17:43:13 2021] [28363] 0 28363 3814 431 13 0 907 dataextract.sh
[Thu Feb 18 17:43:14 2021] [28401] 0 28401 768177 32228 152 0 907 java
[Thu Feb 18 17:43:14 2021] Memory cgroup out of memory: Kill process 28471 (Finalizer threa) score 928 or sacrifice child
[Thu Feb 18 17:43:14 2021] Killed process 28401 (java), UID 0, total-vm:3072708kB, anon-rss:116856kB, file-rss:12056kB, shmem-rss:0kB
Based on research I've been doing, here for example, it seems like it is normal on Linux to grow in memory consumption over time as various caches grow. From what I understand, cached memory should also be freed when new processes (such as my java process) begin to run.
My main question is: should this pod's memory be getting freed in order for these java processes to run? If so, are there any steps I can take to begin to debug why this may not be happening correctly?
Aside from this concern, I've also been trying to track down what is responsible for the growing memory in the first place. I was able to narrow it down to a certain job that runs every 15 minutes. I noticed that after every time it ran, used memory for the pod grew by ~.1 GB.
I was able to figure this out by running this command (inside the container) before and after each execution of the job:
cat /sys/fs/cgroup/memory/memory.usage_in_bytes | numfmt --to si
From there I narrowed down the piece of bash code from which the memory seems to consistently grow. That code looks like this:
while [ "z${_STATUS}" != "z0" ]
do
RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
_STATUS=`echo $RES | jq -r '.status.status' || exit 1`
PROGRES=`echo $RES | jq -r '.status.progress' || exit 1`
[ "x$_STATUS" == "x1" ] && exit 1
[ "x$_STATUS" == "x3" ] && exit 3
[ $CNT -gt 10 ] && PrintLog "WC Job ($JOB_ID) Progress: $PROGRES Status: $_STATUS " && CNT=0
sleep 10
((CNT++))
done
[ "z${_STATUS}" == "z0" ] && STATUS=Success || STATUS=Failed
This piece of code seems innocuous to me at first glance, so I do not know where to go from here.
I would really appreciate any help, I've been trying to get to the bottom of this issue for days now.
I did eventually get to the bottom of this so I figured I'd post my solution here. I mentioned in my original post that I narrowed down my issue to the while loop that I posted above in my question. Each time the job in question ran, that while loop would iterate maybe 10 times. After the while loop completed, I noticed that utilized memory increased by 100MB each time pretty consistently.
On a hunch, I had a feeling the CURL command within the loop could be the culprit. And in fact, it did turn out that CURL was eating up my memory and not releasing it for whatever reason. Instead of looping and running the following CURL command:
RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
I replaced this command with a simple python script that utilized the requests module to check our job statuses instead.
I am not sure still why CURL was the culprit in this case. After running CURL --version it appears that the underlying library being used is libcurl/7.29.0. Maybe there is an bug within that library version causing some issues with memory management, but that is just a guess.
In any case, switching from using python's requests module instead of CURL has resolved my issue.

Scylla fails to mount RAID volume after restarting EC2 instance

I am new to Scylla. I have followed the installation steps on the Scylla website to setup a small 4 node Scylla cluster in my AWS account. I am using the Scylla ami on my EC2 instances.
If I stop one of the EC2 instances and then start it up again. I get the message Failed mounting RAID volume! when I try to restart Scylla.
I believe I have to remount the RAID volume by running this:
scylla_raid_setup --raiddev /dev/md0 --disks /dev/nvme1n1,/dev/nvme2n1 --update-fstab --root /var/lib/scylla --volume-role all
However, when I then try to start Scylla I get the following error message:
A dependency job for scylla-server.service failed. See 'journalctl -xe' for details.
It seems that the mount failed, here are the logs:
-- Subject: Unit var-lib-scylla.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit var-lib-scylla.mount has failed.
--
-- The result is dependency.
Dependency failed for Scylla Server.
-- Subject: Unit scylla-server.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit scylla-server.service has failed.
--
-- The result is dependency.
May 05 13:23:56 systemd[1]: Dependency failed for Scylla JMX.
-- Subject: Unit scylla-jmx.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit scylla-jmx.service has failed.
--
-- The result is dependency.
May 05 13:23:56 systemd[1]: Job scylla-jmx.service/start failed with result 'dependency'.
May 05 13:23:56 systemd[1]: Dependency failed for Run Scylla Housekeeping daily mode.
-- Subject: Unit scylla-housekeeping-daily.timer has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit scylla-housekeeping-daily.timer has failed.
--
-- The result is dependency.
May 05 13:23:56 polkitd[4226]: Unregistered Authentication Agent for unix-process:7668:53288 (system bus name :1.20, object path /org/freedesktop/PolicyKit1/AuthenticationAge
May 05 13:23:56 systemd[1]: Job scylla-housekeeping-daily.timer/start failed with result 'dependency'.
May 05 13:23:56 sudo[7666]: pam_unix(sudo:session): session closed for user root
May 05 13:23:56 systemd[1]: Dependency failed for Run Scylla Housekeeping restart mode.
-- Subject: Unit scylla-housekeeping-restart.timer has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit scylla-housekeeping-restart.timer has failed.
--
-- The result is dependency.
May 05 13:23:56 systemd[1]: Job scylla-housekeeping-restart.timer/start failed with result 'dependency'.
May 05 13:23:56 systemd[1]: Job scylla-server.service/start failed with result 'dependency'.
May 05 13:23:56 systemd[1]: Job var-lib-scylla.mount/start failed with result 'dependency'.
May 05 13:23:56 systemd[1]: Job dev-disk-by\x2duuid-67fde517\x2d892a\x2d4a3f\x2d9e19\x2dac71c9bdd533.device/start failed with result 'timeout'.
What should my next step be?
Here are the disks:
Disk /dev/nvme1n1: 7500.0 GB, 7500000000000 bytes, 14648437500 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/nvme2n1: 7500.0 GB, 7500000000000 bytes, 14648437500 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/nvme0n1: 10.7 GB, 10737418240 bytes, 20971520 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000b0301
If I include nvme0n1 in disks for scylla_raid_setup then it returns: /dev/nvme0n1 is busy.
Otherwise, this is what scylla_raid_setup outputs:
Creating RAID0 for scylla using 2 disk(s): /dev/nvme2n1,/dev/nvme1n1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
meta-data=/dev/md0 isize=512 agcount=32, agsize=114438912 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=3662043136, imaxpct=5
= sunit=256 swidth=512 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
My /etc/fstab file looks like this:
UUID=0a84de8e-5bfe-43e7-992b-5bfff8cdce43 / xfs defaults 0 0
UUID="67fde517-892a-4a3f-9e19-ac71c9bdd533" /var/lib/scylla xfs noatime,nofail 0 0
UUID="24aab0fc-dc32-48de-bf6b-5a3d5bcd1f00" /var/lib/scylla xfs noatime,nofail 0 0
I removed one of the entries and tried restarting Scylla. But it still failed to start :(
After running systemctl start var-lib-scylla.mount:
May 06 14:18:18 ip-172-31-14-126.ec2.internal polkitd[4760]: Registered Authentication Agent for unix-process:7789:57998 (system bus name :1.34 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8)
May 06 14:19:48 ip-172-31-14-126.ec2.internal systemd[1]: Job dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device/start timed out.
May 06 14:19:48 ip-172-31-14-126.ec2.internal systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device.
-- Subject: Unit dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device has failed.
--
-- The result is timeout.
May 06 14:19:48 ip-172-31-14-126.ec2.internal systemd[1]: Dependency failed for /var/lib/scylla.
-- Subject: Unit var-lib-scylla.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit var-lib-scylla.mount has failed.
--
-- The result is dependency.
May 06 14:19:48 systemd[1]: Job var-lib-scylla.mount/start failed with result 'dependency'.
May 06 14:19:48 systemd[1]: Job dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device/start failed with result 'timeout'.
May 06 14:19:48 polkitd[4760]: Unregistered Authentication Agent for unix-process:7789:57998 (system bus name :1.34, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8) (disconnected from bus)
May 06 14:19:48 sudo[7787]: pam_unix(sudo:session): session closed for user root
Here are the steps that you can try
1) List all the disks
$ fdisk -l
2) Recreate the RAID
$ sudo /usr/lib/scylla/scylla_raid_setup --disks /dev/nvme2n1,/dev/nvme3n1,/dev/nvme0n1,/dev/nvme1n1…………………<list all the disks you want to create a RAID volume>……………… --raiddev /dev/md0 --update-fstab --root /var/lib/scylla --volume-role all
(Alternative approach)
udevadm settle
mdadm --create --verbose --force --run /dev/md0 --level=0 -c1024 --raid-devices=<NUMBER OF DISKS> /dev/nvme0n1….<SPECIFY THE DISKS COMMA DELIMITED>
udevadm settle
3) Format the raid0 disk with XFS
$ mkfs.xfs /dev/md0 -f -K
4) Clear the old entry from fstab
$ vi /etc/fstab ## delete the /var/lib/scylla line
5) Add the new line to fstab
$ echo "`blkid /dev/md0 | awk '{print $2}'` /var/lib/scylla xfs noatime 0 0" >> /etc/fstab
6) Reload the daemon
$ systemctl daemon-reload
7) Mount the file-system
$ systemctl start var-lib-scylla.mount
8) Recreate the directories
$ mkdir -p "/var/lib/scylla/data"
$ mkdir -p "/var/lib/scylla/commitlog"
$ mkdir -p "/var/lib/scylla/hints"
$ mkdir -p "/var/lib/scylla/coredump"
9) Change the permissions
$ chown -R scylla:scylla "/var/lib/scylla"
10) Start Scylla
$ systemctl start scylla-server
Let me know if you run into issues...
You should probably check the content of /etc/fstab, see if you have 2 (or more) entries for scylla (/var/lib/scylla). If you do, that's probably the cause for the mount failure, there should be only 1 entry.
If there's more than 1 entry, or no entry for scylla in /etc/fstab, scylla service will fail to start, and that's the error you see in the logs.

Bash script for monitoring logs based upon last update time

I have a directory on a RHEL 6 server where logs are being written as below. As you can see there are 4 logs already written within 1 minute. I just want to write a script which can check in every 15 minute (Cron ) & if log files are not updating then send an email alert like " Adapter is in hang status, Restart Required". I know basic linux commands & knowledge of crons. This is how i am trying
-rw-r--r-- 1 root root 11M Oct 6 00:32 Adapter.log.3
-rw-r--r-- 1 root root 11M Oct 6 00:32 Adapter.log.2
-rw-r--r-- 1 root root 10M Oct 6 00:32 Adapter.log.1
-rw-r--r-- 1 root root 6.3M Oct 6 00:32 Adapter.log
$ ll Adapter.log >/tmp/test.txt
$ cat test.txt | awk '{print $6,$7,$8}'
Oct 6 03:10
Now how can i get the time of same log file after 15 minutes, so that i can compare the time difference and write a script to send the alert.
Given description, looks like you timestamp can be checked every 15 minutes.
If file was updated in last 15 minutes, do nothing
If file was updated 15 to 30 minutes ago, send email alert
If file was updated 30 minutes ago, do nothing, as error was already reported on previous cycle
Consider placing the following into cron, on 15 minute interval:
find /path/to/log/Adapter.log* -mmin +15 -mmin -30 | xargs -L1 send-alert
This solution will work on most situations. However, it's worth noting that if the system load is very high, cron execution may be delayed, impacting the age test. In those cases, extra file to store the last test time is needed.

Autohotkey unable to kill a process

I have facing some issues with svchost going out of control at times and making my system unstable. Mostly i just kill it manually, but i decided to write an AHK script to do that automatically everytime if starts using too much memory.
#NoEnv ; Recommended for performance and compatibility with future AutoHotkey releases.
#Warn ; Enable warnings to assist with detecting common errors.
#SingleInstance force
;--------------------------------------------------------------
; Variables
;--------------------------------------------------------------
minMemMB = 200
minCPUPercentage = 50
Loop
{
for process in ComObjGet("winmgmts:").ExecQuery("Select IDProcess, PercentProcessorTime, WorkingSet from Win32_PerfFormattedData_PerfProc_Process where Name like '%svchost%'")
PID = % process.IDProcess
CPU = % process.PercentProcessorTime
MEM = % Round(process.WorkingSet/1000000)
FormatTime, TIME
if (CPU > minCPUPercentage or MEM > minMemMB)
{
Process, Close, %PID%
sleep, 2000
if ErrorLevel = %PID%
FileAppend,
(
Killed, %PID% , %CPU% , %MEM%, %TIME% `r`n
), log.csv
else
FileAppend,
(
Failed, %PID% , %CPU% , %MEM%, %TIME% `r`n
), log.csv
}
}
My code works fine in identifying when svchost has exceeded the accepted amount of memory it should take. But it fails in killing it. my log is full of entries like this:
Failed 624 0 1036 11:15 PM Wednesday May 13 2015
Failed 7408 68 65 12:36 AM Thursday May 14 2015
Failed 7408 92 121 12:37 AM Thursday May 14 2015
Failed 7408 80 142 12:39 AM Thursday May 14 2015
Failed 7408 55 176 12:39 AM Thursday May 14 2015
Failed 7408 99 149 12:46 AM Thursday May 14 2015
Failed 7408 80 150 12:53 AM Thursday May 14 2015
Can someone help me in this?
Should I use run + taskkill instead?
Or is there a WMI command I can use?
Thanks.
Killing svchost.exe (service host process) is probably a bad idea. An instance of svchost usually takes care of multiple services and if you kill it all the services that run under it will stop.
You should rather try to find out which service is causing the system to become unstable, then find out which program this service is part of and then update or uninstall that program... or in the worst case stop the service.
You could also disable the service to keep it from starting automatically when windows does.
I recommend the process explorer from sysinternals. Just hover over an svchost process and a tooltip will show you which services it currently hosts.
To disable or stop a service go here: Win+R -> services.msc -> Enter

Parsing entry name from a log

Writing bash parsing scripts is my own personal nightmare, so here I am.
The server log format is below:
197 INFO Thu Mar 27 10:10:32 2014
seq_1_1..JobControl (DSWaitForJob): Waiting for job job_1_1_1 to finish
198 INFO Thu Mar 27 10:10:36 2014
seq_1_1..JobControl (DSWaitForJob): Job job_1_1_1 has finished, status = 3 (Aborted)
199 WARNING Thu Mar 27 10:10:36 2014
seq_1_1..JobControl (#job_1_1_1): Job job_1_1_1 did not finish OK, status = 'Aborted'
From here I need to parse out the string which follows the format:
Job job_name has finished, status = 3 (Aborted)
So from the output above I should get: job_1_1_1
What would the script for that look like if I get this server log as a certain command output?
Thanks xx
Using grep -P:
grep -oP '\w+(?= has finished, status = 3)' file
job_1_1_1

Resources