Scylla fails to mount RAID volume after restarting EC2 instance - amazon-ec2

I am new to Scylla. I have followed the installation steps on the Scylla website to setup a small 4 node Scylla cluster in my AWS account. I am using the Scylla ami on my EC2 instances.
If I stop one of the EC2 instances and then start it up again. I get the message Failed mounting RAID volume! when I try to restart Scylla.
I believe I have to remount the RAID volume by running this:
scylla_raid_setup --raiddev /dev/md0 --disks /dev/nvme1n1,/dev/nvme2n1 --update-fstab --root /var/lib/scylla --volume-role all
However, when I then try to start Scylla I get the following error message:
A dependency job for scylla-server.service failed. See 'journalctl -xe' for details.
It seems that the mount failed, here are the logs:
-- Subject: Unit var-lib-scylla.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit var-lib-scylla.mount has failed.
--
-- The result is dependency.
Dependency failed for Scylla Server.
-- Subject: Unit scylla-server.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit scylla-server.service has failed.
--
-- The result is dependency.
May 05 13:23:56 systemd[1]: Dependency failed for Scylla JMX.
-- Subject: Unit scylla-jmx.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit scylla-jmx.service has failed.
--
-- The result is dependency.
May 05 13:23:56 systemd[1]: Job scylla-jmx.service/start failed with result 'dependency'.
May 05 13:23:56 systemd[1]: Dependency failed for Run Scylla Housekeeping daily mode.
-- Subject: Unit scylla-housekeeping-daily.timer has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit scylla-housekeeping-daily.timer has failed.
--
-- The result is dependency.
May 05 13:23:56 polkitd[4226]: Unregistered Authentication Agent for unix-process:7668:53288 (system bus name :1.20, object path /org/freedesktop/PolicyKit1/AuthenticationAge
May 05 13:23:56 systemd[1]: Job scylla-housekeeping-daily.timer/start failed with result 'dependency'.
May 05 13:23:56 sudo[7666]: pam_unix(sudo:session): session closed for user root
May 05 13:23:56 systemd[1]: Dependency failed for Run Scylla Housekeeping restart mode.
-- Subject: Unit scylla-housekeeping-restart.timer has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit scylla-housekeeping-restart.timer has failed.
--
-- The result is dependency.
May 05 13:23:56 systemd[1]: Job scylla-housekeeping-restart.timer/start failed with result 'dependency'.
May 05 13:23:56 systemd[1]: Job scylla-server.service/start failed with result 'dependency'.
May 05 13:23:56 systemd[1]: Job var-lib-scylla.mount/start failed with result 'dependency'.
May 05 13:23:56 systemd[1]: Job dev-disk-by\x2duuid-67fde517\x2d892a\x2d4a3f\x2d9e19\x2dac71c9bdd533.device/start failed with result 'timeout'.
What should my next step be?
Here are the disks:
Disk /dev/nvme1n1: 7500.0 GB, 7500000000000 bytes, 14648437500 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/nvme2n1: 7500.0 GB, 7500000000000 bytes, 14648437500 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/nvme0n1: 10.7 GB, 10737418240 bytes, 20971520 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000b0301
If I include nvme0n1 in disks for scylla_raid_setup then it returns: /dev/nvme0n1 is busy.
Otherwise, this is what scylla_raid_setup outputs:
Creating RAID0 for scylla using 2 disk(s): /dev/nvme2n1,/dev/nvme1n1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
meta-data=/dev/md0 isize=512 agcount=32, agsize=114438912 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=3662043136, imaxpct=5
= sunit=256 swidth=512 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
My /etc/fstab file looks like this:
UUID=0a84de8e-5bfe-43e7-992b-5bfff8cdce43 / xfs defaults 0 0
UUID="67fde517-892a-4a3f-9e19-ac71c9bdd533" /var/lib/scylla xfs noatime,nofail 0 0
UUID="24aab0fc-dc32-48de-bf6b-5a3d5bcd1f00" /var/lib/scylla xfs noatime,nofail 0 0
I removed one of the entries and tried restarting Scylla. But it still failed to start :(
After running systemctl start var-lib-scylla.mount:
May 06 14:18:18 ip-172-31-14-126.ec2.internal polkitd[4760]: Registered Authentication Agent for unix-process:7789:57998 (system bus name :1.34 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8)
May 06 14:19:48 ip-172-31-14-126.ec2.internal systemd[1]: Job dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device/start timed out.
May 06 14:19:48 ip-172-31-14-126.ec2.internal systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device.
-- Subject: Unit dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device has failed.
--
-- The result is timeout.
May 06 14:19:48 ip-172-31-14-126.ec2.internal systemd[1]: Dependency failed for /var/lib/scylla.
-- Subject: Unit var-lib-scylla.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit var-lib-scylla.mount has failed.
--
-- The result is dependency.
May 06 14:19:48 systemd[1]: Job var-lib-scylla.mount/start failed with result 'dependency'.
May 06 14:19:48 systemd[1]: Job dev-disk-by\x2duuid-17c356e1\x2d1ec9\x2d47d1\x2d8e98\x2d45182b7a9454.device/start failed with result 'timeout'.
May 06 14:19:48 polkitd[4760]: Unregistered Authentication Agent for unix-process:7789:57998 (system bus name :1.34, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8) (disconnected from bus)
May 06 14:19:48 sudo[7787]: pam_unix(sudo:session): session closed for user root

Here are the steps that you can try
1) List all the disks
$ fdisk -l
2) Recreate the RAID
$ sudo /usr/lib/scylla/scylla_raid_setup --disks /dev/nvme2n1,/dev/nvme3n1,/dev/nvme0n1,/dev/nvme1n1…………………<list all the disks you want to create a RAID volume>……………… --raiddev /dev/md0 --update-fstab --root /var/lib/scylla --volume-role all
(Alternative approach)
udevadm settle
mdadm --create --verbose --force --run /dev/md0 --level=0 -c1024 --raid-devices=<NUMBER OF DISKS> /dev/nvme0n1….<SPECIFY THE DISKS COMMA DELIMITED>
udevadm settle
3) Format the raid0 disk with XFS
$ mkfs.xfs /dev/md0 -f -K
4) Clear the old entry from fstab
$ vi /etc/fstab ## delete the /var/lib/scylla line
5) Add the new line to fstab
$ echo "`blkid /dev/md0 | awk '{print $2}'` /var/lib/scylla xfs noatime 0 0" >> /etc/fstab
6) Reload the daemon
$ systemctl daemon-reload
7) Mount the file-system
$ systemctl start var-lib-scylla.mount
8) Recreate the directories
$ mkdir -p "/var/lib/scylla/data"
$ mkdir -p "/var/lib/scylla/commitlog"
$ mkdir -p "/var/lib/scylla/hints"
$ mkdir -p "/var/lib/scylla/coredump"
9) Change the permissions
$ chown -R scylla:scylla "/var/lib/scylla"
10) Start Scylla
$ systemctl start scylla-server
Let me know if you run into issues...

You should probably check the content of /etc/fstab, see if you have 2 (or more) entries for scylla (/var/lib/scylla). If you do, that's probably the cause for the mount failure, there should be only 1 entry.
If there's more than 1 entry, or no entry for scylla in /etc/fstab, scylla service will fail to start, and that's the error you see in the logs.

Related

Memory builds up overtime on Kubernetes pod causing JVM unable to start

We are running a kubernetes environment and we have a pod that is encountering memory issues. The pod runs only a single container, and this container is responsible for running various utility jobs throughout the day.
The issue is that this pod's memory usage grows slowly over time. There is a 6 GB memory limit for this pod, and eventually, the memory consumption grows very close to 6GB.
A lot of our utility jobs are written in Java, and when the JVM spins up for them, they require -Xms256m in order to start. Yet, since the pod's memory is growing over time, eventually it gets to the point where there isn't 256MB free to start the JVM, and the Linux oom-killer kills the java process. Here is what I see from dmesg when this occurs:
[Thu Feb 18 17:43:13 2021] Memory cgroup stats for /kubepods/burstable/pod4f5d9d31-71c5-11eb-a98c-023a5ae8b224/921550be41cd797d9a32ed7673fb29ea8c48dc002a4df63638520fd7df7cf3f9: cache:8KB rss:119180KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:119132KB inactive_file:8KB active_file:0KB unevictable:4KB
[Thu Feb 18 17:43:13 2021] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Thu Feb 18 17:43:13 2021] [ 5579] 0 5579 253 1 4 0 -998 pause
[Thu Feb 18 17:43:13 2021] [ 5737] 0 5737 3815 439 12 0 907 entrypoint.sh
[Thu Feb 18 17:43:13 2021] [13411] 0 13411 1952 155 9 0 907 tail
[Thu Feb 18 17:43:13 2021] [28363] 0 28363 3814 431 13 0 907 dataextract.sh
[Thu Feb 18 17:43:14 2021] [28401] 0 28401 768177 32228 152 0 907 java
[Thu Feb 18 17:43:14 2021] Memory cgroup out of memory: Kill process 28471 (Finalizer threa) score 928 or sacrifice child
[Thu Feb 18 17:43:14 2021] Killed process 28401 (java), UID 0, total-vm:3072708kB, anon-rss:116856kB, file-rss:12056kB, shmem-rss:0kB
Based on research I've been doing, here for example, it seems like it is normal on Linux to grow in memory consumption over time as various caches grow. From what I understand, cached memory should also be freed when new processes (such as my java process) begin to run.
My main question is: should this pod's memory be getting freed in order for these java processes to run? If so, are there any steps I can take to begin to debug why this may not be happening correctly?
Aside from this concern, I've also been trying to track down what is responsible for the growing memory in the first place. I was able to narrow it down to a certain job that runs every 15 minutes. I noticed that after every time it ran, used memory for the pod grew by ~.1 GB.
I was able to figure this out by running this command (inside the container) before and after each execution of the job:
cat /sys/fs/cgroup/memory/memory.usage_in_bytes | numfmt --to si
From there I narrowed down the piece of bash code from which the memory seems to consistently grow. That code looks like this:
while [ "z${_STATUS}" != "z0" ]
do
RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
_STATUS=`echo $RES | jq -r '.status.status' || exit 1`
PROGRES=`echo $RES | jq -r '.status.progress' || exit 1`
[ "x$_STATUS" == "x1" ] && exit 1
[ "x$_STATUS" == "x3" ] && exit 3
[ $CNT -gt 10 ] && PrintLog "WC Job ($JOB_ID) Progress: $PROGRES Status: $_STATUS " && CNT=0
sleep 10
((CNT++))
done
[ "z${_STATUS}" == "z0" ] && STATUS=Success || STATUS=Failed
This piece of code seems innocuous to me at first glance, so I do not know where to go from here.
I would really appreciate any help, I've been trying to get to the bottom of this issue for days now.
I did eventually get to the bottom of this so I figured I'd post my solution here. I mentioned in my original post that I narrowed down my issue to the while loop that I posted above in my question. Each time the job in question ran, that while loop would iterate maybe 10 times. After the while loop completed, I noticed that utilized memory increased by 100MB each time pretty consistently.
On a hunch, I had a feeling the CURL command within the loop could be the culprit. And in fact, it did turn out that CURL was eating up my memory and not releasing it for whatever reason. Instead of looping and running the following CURL command:
RES=`$CURL -X GET "${TS_URL}/wcs/resources/admin/index/dataImport/status?jobStatusId=${JOB_ID}"`
I replaced this command with a simple python script that utilized the requests module to check our job statuses instead.
I am not sure still why CURL was the culprit in this case. After running CURL --version it appears that the underlying library being used is libcurl/7.29.0. Maybe there is an bug within that library version causing some issues with memory management, but that is just a guess.
In any case, switching from using python's requests module instead of CURL has resolved my issue.

How do systemd journal cursors work?

I'm having trouble with systemd journal cursors.
If I SeekTail(), I get a value for the cursor and can keep calling Next() and it behaves exactly as expected.
However, if I SeekCursor() and then call Next() entry it jumps back to the Head() and starts reading over again. Why would it do that? I can verify that it did locate the cursor correctly. But it's as though SeekCursor only worked for the specific item and thats all. This is not what I would expect reading the man pages and other documentation.
I'm using go-systemd from the CoreOS project which is a simple wrapper for the systemd C-API.
But the go wrapper is not the issue, the C library is. I can see that journalctl is doing the same thing on Ubuntu.
e.g. append to journal, show tail output, get full entry detail in json. Jump to cursor and show tail
matthewh#xen:~$ echo "Cursor example" | systemd-cat
matthewh#xen:~$ journalctl -f
-- Logs begin at Mon 2017-07-03 08:56:12 NZST. --
May 31 17:50:31 xen code.desktop[6771]: [main 17:50:31] update#setState idle
May 31 17:55:01 xen CRON[4468]: pam_unix(cron:session): session opened for user root by (uid=0)
May 31 17:55:01 xen CRON[4469]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 31 17:55:01 xen CRON[4468]: pam_unix(cron:session): session closed for user root
May 31 17:55:03 xen smokeping[2839]: RRDs::update ERROR: /var/lib/smokeping/Local/LocalMachine.rrd: illegal attempt to update using time 1527746103 when last update time is 4073847643 (minimum one second step)
May 31 17:55:22 xen cat[4479]: Hello
May 31 17:59:28 xen cat[4539]: Cursor example
May 31 18:00:03 xen smokeping[2839]: RRDs::update ERROR: /var/lib/smokeping/Local/LocalMachine.rrd: illegal attempt to update using time 1527746403 when last update time is 4073847643 (minimum one second step)
May 31 18:00:06 xen cat[4547]: Cursor example
May 31 18:01:09 xen cat[4597]: Cursor example
^C
matthewh#xen:~$ journalctl -f -o json-pretty -n1
{
"__CURSOR" : "s=b7f2a0f19c9946abab26788729a244c5;i=52a5;b=1ba1d5cabb5840adb02eedc4aba5b4d6;m=2d96b77f94;t=56d7a319ee462;x=8afac4ada39ae1fb",
"__REALTIME_TIMESTAMP" : "1527746469487714",
"__MONOTONIC_TIMESTAMP" : "195802136468",
"_BOOT_ID" : "1ba1d5cabb5840adb02eedc4aba5b4d6",
"_UID" : "1000",
"_GID" : "1000",
"_CAP_EFFECTIVE" : "0",
"_MACHINE_ID" : "f899a862e4aa4775b8995564d8da565d",
"_HOSTNAME" : "xen",
"_TRANSPORT" : "stdout",
"PRIORITY" : "6",
"_COMM" : "cat",
"MESSAGE" : "Cursor example",
"_STREAM_ID" : "d1fbcc3ff027401e9dc95b5648f9322e",
"_PID" : "4597"
}
^C
matthewh#xen:~$ journalctl -f --cursor="s=b7f2a0f19c9946abab26788729a244c5;i=52a5;b=1ba1d5cabb5840adb02eedc4aba5b4d6;m=2d96b77f94;t=56d7a319ee462;x=8afac4ada39ae1fb"
-- Logs begin at Mon 2017-07-03 08:56:12 NZST. --
May 31 18:01:09 xen cat[4597]: Cursor example
-- Reboot --
Feb 04 13:03:03 xen systemd-journald[420]: Runtime journal (/run/log/journal/) is 8.0M, max 241.0M, 233.0M free.
Feb 04 13:03:03 xen kernel: Initializing cgroup subsys cpuset
Feb 04 13:03:03 xen kernel: Initializing cgroup subsys cpu
Feb 04 13:03:03 xen kernel: Initializing cgroup subsys cpuacct
Feb 04 13:03:03 xen kernel: Linux version 4.4.0-116-generic (buildd#lgw01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9) ) #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 (Ubuntu 4.4.0-116.140-generic 4.4.98)
Feb 04 13:03:03 xen kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-116-generic root=UUID=f95a581f-2afb-4428-bade-c913f1c51741 ro quiet splash vt.handoff=7
Feb 04 13:03:03 xen kernel: KERNEL supported cpus:
Feb 04 13:03:03 xen kernel: Intel GenuineIntel
Feb 04 13:03:03 xen kernel: AMD AuthenticAMD
^C
Note the "--reboot--" text and the fact that it jumped back several days in the past. But prior to that, it located my entry via systemd-cat so it was found.
What am I doing wrong? is it a bug or an oversight on my part?
Oddly enough, I have a CoreOS server I was able to test this on and it behaves differently. It behaves as expected. The version of journalctl is the same on both. All the configuration is untouched stock standard.

How to do server monitoring? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Having no exprience as devops I've just been given a project where I have to do the whole thing.
So, how do I keep an eye on usage of disk, memory, database space and access time, api reply times etc?
It's imperatively impossible for any admin to keep eyes on running processes at all time, this is where Server Monitory comes handy.
Try Monit, it can be easily installed with:
apt-get install monit -y
Monitoring:
nano /etc/monit/monitrc
Use the example config to configure what you would like to monitor, this is accessible over http or https as well, plus you don't really need to access it because it will alert you if anything goes wrong in your server. For example, you will get an email if your memory consumption is getting higher than what you specified in the config file above, or cpu is getting overloaded, or a certain website is down.
Let's dig into it a little bit.
type monit status to get status like the following:
The Monit daemon 5.3.2 uptime: 1h 32m
System 'myhost.mydomain.tld'
status Running
monitoring status Monitored
load average [0.03] [0.14] [0.20]
cpu 3.5%us 5.9%sy 0.0%wa
memory usage 26100 kB [10.4%]
swap usage 0 kB [0.0%]
data collected Thu, 30 Aug 2017 18:35:00
You can monitor virtually anything, apache, nginx, mysql, disks, process etc
Sample monit status:
File 'mysql_bin'
status Accessible
monitoring status Monitored
permission 755
uid 0
gid 0
timestamp Fri, 05 May 2017 22:33:39
size 16097088 B
checksum 6d7b5ffd8563f8ad44dde35ae4b8bd52 (MD5)
data collected Mon, 28 Aug 2017 06:21:02
File 'apache_rc'
status Accessible
monitoring status Monitored
permission 755
uid 0
gid 0
timestamp Fri, 05 May 2017 11:21:22
size 9974 B
checksum 55b2bc7ce5e4a0835877dbfd98c2646b (MD5)
data collected Mon, 28 Aug 2017 06:21:02
Filesystem 'Server01'
status Accessible
monitoring status Monitored
permission 660
uid 0
gid 6
filesystem flags 0x1000
block size 4096 B
blocks total 5006559 [19556.9 MB]
blocks free for non superuser 2615570 [10217.1 MB] [52.2%]
blocks free total 2875653 [11233.0 MB] [57.4%]
inodes total 1281120
inodes free 1085516 [84.7%]
data collected Mon, 28 Aug 2017 06:23:02
Filesystem 'Media'
status Accessible
monitoring status Monitored
permission 660
uid 0
gid 6
filesystem flags 0x1000
block size 4096 B
blocks total 4414923 [17245.8 MB]
blocks free for non superuser 3454811 [13495.4 MB] [78.3%]
blocks free total 3684839 [14393.9 MB] [83.5%]
inodes total 1130496
inodes free 1130384 [100.0%]
data collected Mon, 28 Aug 2017 06:23:02
System 'mywebsite.com'
status Resource limit matched
monitoring status Monitored
load average [0.01] [0.10] [0.61]
cpu 2.7%us 0.2%sy 0.0%wa
memory usage 1150372 kB [28.5%]
swap usage 184356 kB [35.2%]
data collected Mon, 28 Aug 2017 06:21:02
Setup with alert!
Don't forget that you will receive email alert for every rule that you specified to be monitor, eg when your website "mywebsite" is down, or when disk space is less than 20%, or disk failure, cpu is more than x% etc.
Install monit, check it's manual with man monit
You can user Window Performance Analyzer. Xperf is also helpful.
here is the link for the same.
https://msdn.microsoft.com/en-us/library/windows/hardware/hh162945.aspx
#!/bin/sh
file="/var/www/html/index.html"
linebreak="--------------------------------------------------------------------------------------------"
while true
do
echo "<html>" > $file
echo "<head>" >> $file
echo "<meta http-equiv="refresh" content="100">" >> $file
echo "</head>" >> $file
echo "<body>" >> $file
echo "<pre>" >> $file
date >> $file
echo $linebreak >> $file
uptime >> $file
echo $linebreak >> $file
top -b -n1 -u nobody | sed -n '3p' >> $file
echo $linebreak >> $file
free -m >> $file
echo $linebreak >> $file
df -h >> $file
echo $linebreak >> $file
iptables -nL >> $file
echo $linebreak >> $file
echo "</pre>" >> $file
echo "</body>" >> $file
echo "</html>" >> $file
sleep 100
done
I use this script to monitoring some information like temperature, disk usage, ram, firewall and so on.
I put the results in the index of an apache. So i can call the homepage of the server and see everything.
The script refreshs every 100 seconds the results. The webpage will refreshs every 100 seconds too.
With these script and apache you can monitor the server all over the world with mobile devices or pc.
Mo 28. Aug 14:36:03 CEST 2017
--------------------------------------------------------------------------------------------
14:36:03 up 1:34, 4 users, load average: 0,10, 0,09, 0,11
--------------------------------------------------------------------------------------------
%Cpu(s): 14,8 us, 1,6 sy, 0,7 ni, 82,2 id, 0,5 wa, 0,0 hi, 0,1 si, 0,0 st
--------------------------------------------------------------------------------------------
total used free shared buff/cache available
Mem: 3949 1027 756 74 2165 2542
Swap: 4093 0 4093
--------------------------------------------------------------------------------------------
Filesystem Size Used Avail Use% Mounted on
udev 2,0G 0 2,0G 0% /dev
tmpfs 395M 6,0M 389M 2% /run
/dev/sda1 21G 6,2G 14G 32% /
tmpfs 2,0G 43M 1,9G 3% /dev/shm
tmpfs 5,0M 4,0K 5,0M 1% /run/lock
tmpfs 2,0G 0 2,0G 0% /sys/fs/cgroup
Sharepoint 476G 300G 176G 64% /media/sf_Sharepoint
tmpfs 395M 92K 395M 1% /run/user/1000
--------------------------------------------------------------------------------------------
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
--------------------------------------------------------------------------------------------

Monit: 'Matching' functionality isn't working

I have a process that's kicked off from a custom script. The process does not end with '.pid' so I am trying to use 'matching'. However it seems to be breaking on the whitespace (just stops after 'bin/bash'), no matter how I format the command. The commands themselves do work fine, outside of monit.
Here is what I am trying to use:
check process example_process matching "example_process"
start program = "/bin/bash -c 'nohup /mnt1/path/to/custom/bin/run.sh &'"
stop program = "/usr/bin/killall example_process"
if cpu > 80% for 2 cycles then alert
if cpu > 95% for 5 cycles then restart
if totalmem > 500.0 MB for 5 cycles then restart
if children > 3 then restart
Errors logged:
[UTC Jun 18 02:01:46] info : 'system_ip-10-0-11-189' Monit started
[UTC Jun 18 02:01:46] error : 'example_process' process is not running
[UTC Jun 18 02:01:46] info : 'example_process' trying to restart
[UTC Jun 18 02:01:46] info : 'example_process' start: /bin/bash
[UTC Jun 18 02:02:16] error : 'example_process' failed to start

My command-line program does not find its config directory on CMD shell, but it does on PowerShell

I am using in my script a command-line program that does not find its config directory on CMD shell, but it does on PowerShell.
Even when this question seems to correspond to the behavior of some specific program (a command line hash analyzer tool named OCLHashCat), I think this is mostly a matter of Windows shells behavior and/or variables involved. Let me explain it.
This is the contents of the command line program's directory (OCLHashCat):
d:\Programas\HashCat\OCLHashCat>dir
El volumen de la unidad D es Datos
El número de serie del volumen es: 57E9-ACA0
Directorio de d:\Programas\HashCat\OCLHashCat
07/10/2014 09:28 am <DIR> .
07/10/2014 09:28 am <DIR> ..
06/10/2014 11:56 pm <DIR> charsets
06/10/2014 11:56 pm <DIR> docs
06/10/2014 11:57 pm 4 eula.accepted
02/10/2014 12:11 pm 1.210.228 example.dict
02/10/2014 12:11 pm 220.796 example0.hash
02/10/2014 12:11 pm 36 example400.hash
02/10/2014 12:11 pm 36 example500.hash
06/10/2014 11:56 pm <DIR> extra
02/10/2014 12:11 pm 33.685.504 hashcat.hcstat
06/10/2014 11:56 pm <DIR> kernels
06/10/2014 11:56 pm <DIR> masks
02/10/2014 12:11 pm 72 oclExample0.cmd
02/10/2014 12:11 pm 66 oclExample0.sh
02/10/2014 12:11 pm 68 oclExample400.cmd
02/10/2014 12:11 pm 61 oclExample400.sh
02/10/2014 12:11 pm 61 oclExample500.cmd
02/10/2014 12:11 pm 55 oclExample500.sh
15/11/2014 11:46 pm 128 oclHashcat.dictstat
07/10/2014 02:52 am 11.448 oclHashcat.log
07/10/2014 02:52 am <DIR> oclHashcat.outfiles
07/10/2014 02:03 am 0 oclHashcat.pot
07/10/2014 09:28 am 400 oclHashcat.restore
02/10/2014 12:11 pm 388.744 oclHashcat32.bin
02/10/2014 12:11 pm 419.840 oclHashcat32.exe
02/10/2014 12:11 pm 383.136 oclHashcat64.bin
02/10/2014 12:11 pm 432.128 oclHashcat64.exe
06/10/2014 11:56 pm <DIR> rules
As you can see, the directory kernels is right there.
And the home dir of OCLHashCat is in my path:
C:\>oclhashcat64
oclHashcat v1.31 starting...
Usage: oclhashcat64 [options]... hash|hashfile|hccapfile [dictionary|mask|direct
ory]...
Try --help for more help.
But if I try to run, it can not find some of its own files/directories:
C:\Temporal>oclhashcat64 Test.hccap -m 2500 -a 3 ?d?d?d?d?d?d?d?d
oclHashcat v1.31 starting...
Device #1: Bonaire, 1024MB, 1050Mhz, 12MCU
Device #2: Tahiti, 3072MB, 900Mhz, 28MCU
Hashes: 1 hashes; 1 unique digests, 1 unique salts
Bitmaps: 8 bits, 256 entries, 0x000000ff mask, 1024 bytes
Applicable Optimizers:
* Zero-Byte
* Single-Hash
* Single-Salt
* Brute-Force
Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger set to 80c
Device #1: Kernel ./kernels/4098/m02500.Bonaire_1573.4_1573.4 (VM).kernel not fo
und in cache! Building may take a while...
ERROR: ./kernels/4098/m02500.VLIW1.llvmir: No such file or directory
Note the final error: it can not find a file, but I have checked that such file exists:
C:\Temporal>dir d:\Programas\HashCat\OCLHashCat\kernels\4098\m02500.VLIW1.llvmir
El volumen de la unidad D es Datos
El número de serie del volumen es: 57E9-ACA0
Directorio de d:\Programas\HashCat\OCLHashCat\kernels\4098
02/10/2014 12:11 pm 326.912 m02500.VLIW1.llvmir
1 archivos 326.912 bytes
And, if I CHDir into the program's directory:
d:\Programas\HashCat\OCLHashCat>oclhashcat64 -m 2500 "c:\Temporal\Test.hccap" -a
3 ?d?d?d?d?d?d?d?d
oclHashcat v1.31 starting...
Device #1: Bonaire, 1024MB, 1050Mhz, 12MCU
Device #2: Tahiti, 3072MB, 900Mhz, 28MCU
Hashes: 1 hashes; 1 unique digests, 1 unique salts
Bitmaps: 8 bits, 256 entries, 0x000000ff mask, 1024 bytes
Applicable Optimizers:
* Zero-Byte
* Single-Hash
* Single-Salt
* Brute-Force
Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger set to 80c
Device #1: Kernel ./kernels/4098/m02500.Bonaire_1573.4_1573.4 (VM).kernel (25932
0 bytes)
Device #1: Kernel ./kernels/4098/markov_le_v1.Bonaire_1573.4_1573.4 (VM).kernel
(92404 bytes)
Device #1: Kernel ./kernels/4098/bzero.Bonaire_1573.4_1573.4 (VM).kernel (30496
bytes)
Device #2: Kernel ./kernels/4098/m02500.Tahiti_1573.4_1573.4 (VM).kernel (259428
bytes)
Device #2: Kernel ./kernels/4098/markov_le_v1.Tahiti_1573.4_1573.4 (VM).kernel (
92388 bytes)
Device #2: Kernel ./kernels/4098/bzero.Tahiti_1573.4_1573.4 (VM).kernel (30492 b
ytes)
[s]tatus [p]ause [r]esume [b]ypass [q]uit =>
That is: everything works like a charm.
In Powershell everything works perfect wherever I call the program from. Example:
PS C:\Temporal> oclHashcat64.exe Test.hccap -m 2500 -a 3 ?d?d?d?d?d?d?d?d
oclHashcat v1.31 starting...
Device #1: Bonaire, 1024MB, 1050Mhz, 12MCU
Device #2: Tahiti, 3072MB, 900Mhz, 28MCU
Hashes: 1 hashes; 1 unique digests, 1 unique salts
Bitmaps: 8 bits, 256 entries, 0x000000ff mask, 1024 bytes
Applicable Optimizers:
* Zero-Byte
* Single-Hash
* Single-Salt
* Brute-Force
Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger set to 80c
Device #1: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/m02500.Bonaire_1573.4_1573.4 (VM).kernel (259320 bytes)
Device #1: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/markov_le_v1.Bonaire_1573.4_1573.4 (VM).kernel (92404 bytes)
Device #1: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/bzero.Bonaire_1573.4_1573.4 (VM).kernel (30496 bytes)
Device #2: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/m02500.Tahiti_1573.4_1573.4 (VM).kernel (259428 bytes)
Device #2: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/markov_le_v1.Tahiti_1573.4_1573.4 (VM).kernel (92388 bytes)
Device #2: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/bzero.Tahiti_1573.4_1573.4 (VM).kernel (30492 bytes)
[s]tatus [p]ause [r]esume [b]ypass [q]uit =>
, but there are some reasons that make me need the classic CMD shell, like usage with CygWin (that should allow me to use GNU Screen, but this is another matter).
I think the problem comes from that ./kernels... reference, that makes the command line program (OCLHashCat) try to search the directory in the running path, instead of searching for it in the origin path (the program`s path tree).
Could anyone please give me some idea to try?
EXTRA INFO: The program OCLHashCat has Linux and Windows versions, so it could be some compilation/programming problem or equivalent.
EXTRA INFO 2: This program has changed version 4-5 times in the last year, and I keep having this problem with it.
EXTRA INFO under requestion:
PS C:\Temporal> get-command oclHashcat64.exe | fl *
HelpUri :
FileVersionInfo : File: d:\Programas\HashCat\OCLHashCat\oclHashcat6
4.exe
InternalName:
OriginalFilename:
FileVersion:
FileDescription:
Product:
ProductVersion:
Debug: False
Patched: False
PreRelease: False
PrivateBuild: False
SpecialBuild: False
Language:
Path : d:\Programas\HashCat\OCLHashCat\oclHashcat64.exe
Extension : .exe
Definition : d:\Programas\HashCat\OCLHashCat\oclHashcat64.exe
Visibility : Public
OutputType : {System.String}
Name : oclHashcat64.exe
CommandType : Application
ModuleName :
Module :
Parameters :
ParameterSets :
So, the path in PowerShell seems correct.
EXTRA INFO about SSH: By SSHing into my computer (Windows 7 SP1 running Bitvise SSH Server) the behavior is exactly the same. It doesn't work for standard shell:
login as: Luis-
Luis-#Windu-'s password:
Microsoft Windows [Versión 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. Reservados todos los derechos.
C:\Users\Luis->cd \Temporal
C:\Temporal>oclhashcat64 Test.hccap -m 2500 -a 3 ?d?d?d?d?d?d?d?d
oclHashcat v1.31 starting...
WARN: ADL_Overdrive6_FanSpeed_Get(): -5
Device #1: Bonaire, 1024MB, 1050Mhz, 12MCU
Device #2: Tahiti, 3072MB, 900Mhz, 28MCU
Hashes: 1 hashes; 1 unique digests, 1 unique salts
Bitmaps: 8 bits, 256 entries, 0x000000ff mask, 1024 bytes
Applicable Optimizers:
* Zero-Byte
* Single-Hash
* Single-Salt
* Brute-Force
Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger set to 80c
Device #1: Kernel ./kernels/4098/m02500.Bonaire_1573.4_1573.4 (VM).kernel not found
in cache! Building may take a while...
ERROR: ./kernels/4098/m02500.VLIW1.llvmir: No such file or directory
and does for powershell:
login as: Luis-
Luis-#Windu-'s password:
Microsoft Windows [Versión 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. Reservados todos los derechos.
C:\Users\Luis->powershell
Windows PowerShell
Copyright (C) 2009 Microsoft Corporation. Reservados todos los derechos.
PS C:\Users\Luis-> cd \
PS C:\> cd .\Temporal
PS C:\Temporal> oclHashcat64.exe Test.hccap -m 2500 -a 3 ?d?d?d?d?d?d?d?d
oclHashcat v1.31 starting...
WARN: ADL_Overdrive6_FanSpeed_Get(): -5
Device #1: Bonaire, 1024MB, 1050Mhz, 12MCU
Device #2: Tahiti, 3072MB, 900Mhz, 28MCU
Hashes: 1 hashes; 1 unique digests, 1 unique salts
Bitmaps: 8 bits, 256 entries, 0x000000ff mask, 1024 bytes
Applicable Optimizers:
* Zero-Byte
* Single-Hash
* Single-Salt
* Brute-Force
Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger set to 80c
Device #1: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/m02500.Bonaire_1573.
4_1573.4 (VM).kernel (259320 bytes)
Device #1: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/markov_le_v1.Bonaire
_1573.4_1573.4 (VM).kernel (92404 bytes)
Device #1: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/bzero.Bonaire_1573.4
_1573.4 (VM).kernel (30496 bytes)
Device #2: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/m02500.Tahiti_1573.4
_1573.4 (VM).kernel (259428 bytes)
Device #2: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/markov_le_v1.Tahiti_
1573.4_1573.4 (VM).kernel (92388 bytes)
Device #2: Kernel d:\Programas\HashCat\OCLHashCat/kernels/4098/bzero.Tahiti_1573.4_
1573.4 (VM).kernel (30492 bytes)
[s]tatus [p]ause [r]esume [b]ypass [q]uit =>
EXTRA INFO upon requestion:
C:\Temporal>oclhashcat64 "c:\Temporal\Test.hccap" -m 2500 -a 3 ?d?d?d?d?d?d?d?d
oclHashcat v1.31 starting...
Device #1: Bonaire, 1024MB, 1050Mhz, 12MCU
Device #2: Tahiti, 3072MB, 900Mhz, 28MCU
Hashes: 1 hashes; 1 unique digests, 1 unique salts
Bitmaps: 8 bits, 256 entries, 0x000000ff mask, 1024 bytes
Applicable Optimizers:
* Zero-Byte
* Single-Hash
* Single-Salt
* Brute-Force
Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger set to 80c
Device #1: Kernel ./kernels/4098/m02500.Bonaire_1573.4_1573.4 (VM).kernel not fo
und in cache! Building may take a while...
ERROR: ./kernels/4098/m02500.VLIW1.llvmir: No such file or directory
EXTRA INFO about running on CygWin:
Luis#Windu /cygdrive/c/Temporal
$ oclhashcat64 Test.hccap -m 2500 -a 3 ?d?d?d?d?d?d?d?d
oclHashcat v1.31 starting...
Device #1: Bonaire, 1024MB, 1050Mhz, 12MCU
Device #2: Tahiti, 3072MB, 900Mhz, 28MCU
Hashes: 1 hashes; 1 unique digests, 1 unique salts
Bitmaps: 8 bits, 256 entries, 0x000000ff mask, 1024 bytes
Applicable Optimizers:
* Zero-Byte
* Single-Hash
* Single-Salt
* Brute-Force
Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger set to 80c
Device #1: Kernel D:\Programas\HashCat\OCLHashCat/kernels/4098/m02500.Bonaire_1573.4_1573.4 (VM).kernel (259320 bytes)
Device #1: Kernel D:\Programas\HashCat\OCLHashCat/kernels/4098/markov_le_v1.Bonaire_1573.4_1573.4 (VM).kernel (92404 bytes)
Device #1: Kernel D:\Programas\HashCat\OCLHashCat/kernels/4098/bzero.Bonaire_1573.4_1573.4 (VM).kernel (30496 bytes)
Device #2: Kernel D:\Programas\HashCat\OCLHashCat/kernels/4098/m02500.Tahiti_1573.4_1573.4 (VM).kernel (259428 bytes)
Device #2: Kernel D:\Programas\HashCat\OCLHashCat/kernels/4098/markov_le_v1.Tahiti_1573.4_1573.4 (VM).kernel (92388 bytes)
Device #2: Kernel D:\Programas\HashCat\OCLHashCat/kernels/4098/bzero.Tahiti_1573.4_1573.4 (VM).kernel (30492 bytes)
[s]tatus [p]ause [r]esume [b]ypass [q]uit =>
So we can say that the program works OK under CygWin. Due to it was possibly programmed initially for Linux?
Maybe I could use this at least as some sort of workaround.
This program apparently find its home from the command-line instead of calling GetModuleFileName. Unlike cmd, PowerShell doesn't use the lpApplicationName parameter of CreateProcess. Instead it modifies the command line to use the full path. For example, it replaces "oclHashcat64.exe" with "d:\Programas\HashCat\OCLHashCat\oclHashcat64.exe". In cmd you'd have to actually type out the full path.
As a workaround you can use the console API to add an input alias for cmd.exe. The old doskey program provides a command-line interface for this API. This way when you type oclHashcat64 into the console, cmd.exe will instead read the full path that's set in the alias:
doskey /exename=cmd.exe oclHashcat64="D:\Programas\HashCat\OCLHashCat\oclHashcat64.exe" $*
You can save aliases (i.e. macros) to a file using doskey /macros:all > aliases. Then load them using doskey /macrofile=aliases. You can also add a command in HKCU\Software\Microsoft\Command Processor\AutoRun to load your aliases when cmd.exe starts.
Another option is to create a Windows shortcut (i.e. a shell32 link file) in some directory that's on your PATH. Use the full path to the executable in the command line, and leave the start in directory empty (i.e. inherit the shell's working directory). Append .LNK to the PATHEXT environment variable to avoid having to type the .lnk extension. (I find link files to be more convenient than using a batch file as a glorified shortcut. Plus they don't install a Ctrl-C handler like batch files do, which is one less thing to annoy me.)

Resources