I find debugging monit to be a major pain. Monit's shell environment basically has nothing in it (no paths or other environment variables). Also, there are no log file that I can find.
The problem is, if the start or stop command in the monit script fails, it is difficult to discern what is wrong with it. Often times it is not as simple as just running the command on the shell because the shell environment is different from the monit shell environment.
What are some techniques that people use to debug monit configurations?
For example, I would be happy to have a monit shell, to test my scripts in, or a log file to see what went wrong.
I've had the same problem. Using monit's verbose command-line option helps a bit, but I found the best way was to create an environment as similar as possible to the monit environment and run the start/stop program from there.
# monit runs as superuser
$ sudo su
# the -i option ignores the inherited environment
# this PATH is what monit supplies by default
$ env -i PATH=/bin:/usr/bin:/sbin:/usr/sbin /bin/sh
# try running start/stop program here
$
I've found the most common problems are environment variable related (especially PATH) or permission-related. You should remember that monit usually runs as root.
Also if you use as uid myusername in your monit config, then you should change to user myusername before carrying out the test.
Be sure to always double check your conf and monitor your processes by hand before letting monit handle everything. systat(1), top(1) and ps(1) are your friends to figure out resource usage and limits. Knowing the process you monitor is essential too.
Regarding the start and stop scripts i use a wrapper script to redirect output and inspect environment and other variables. Something like this :
$ cat monit-wrapper.sh
#!/bin/sh
{
echo "MONIT-WRAPPER date"
date
echo "MONIT-WRAPPER env"
env
echo "MONIT-WRAPPER $#"
$#
R=$?
echo "MONIT-WRAPPER exit code $R"
} >/tmp/monit.log 2>&1
Then in monit :
start program = "/home/billitch/bin/monit-wrapper.sh my-real-start-script and args"
stop program = "/home/billitch/bin/monit-wrapper.sh my-real-stop-script and args"
You still have to figure out what infos you want in the wrapper, like process infos, id, system resources limits, etc.
You can start Monit in verbose/debug mode by adding MONIT_OPTS="-v" to /etc/default/monit (don't forget to restart; /etc/init.d/monit restart).
You can then capture the output using tail -f /var/log/monit.log
[CEST Jun 4 21:10:42] info : Starting Monit 5.17.1 daemon with http interface at [*]:2812
[CEST Jun 4 21:10:42] info : Starting Monit HTTP server at [*]:2812
[CEST Jun 4 21:10:42] info : Monit HTTP server started
[CEST Jun 4 21:10:42] info : 'ocean' Monit 5.17.1 started
[CEST Jun 4 21:10:42] debug : Sending Monit instance changed notification to monit#example.io
[CEST Jun 4 21:10:42] debug : Trying to send mail via smtp.sendgrid.net:587
[CEST Jun 4 21:10:43] debug : Processing postponed events queue
[CEST Jun 4 21:10:43] debug : 'rootfs' succeeded getting filesystem statistics for '/'
[CEST Jun 4 21:10:43] debug : 'rootfs' filesytem flags has not changed
[CEST Jun 4 21:10:43] debug : 'rootfs' inode usage test succeeded [current inode usage=8.5%]
[CEST Jun 4 21:10:43] debug : 'rootfs' space usage test succeeded [current space usage=59.6%]
[CEST Jun 4 21:10:43] debug : 'ws.example.com' succeeded testing protocol [WEBSOCKET] at [ws.example.com]:80/faye [TCP/IP] [response time 114.070 ms]
[CEST Jun 4 21:10:43] debug : 'ws.example.com' connection succeeded to [ws.example.com]:80/faye [TCP/IP]
monit -c /path/to/your/config -v
By default, monit logs to your system message log and you can check there to see what's happening.
Also, depending on your config, you might be logging to a different place
tail -f /var/log/monit
http://mmonit.com/monit/documentation/monit.html#LOGGING
Assuming defaults (as of whatever old version of monit I'm using), you can tail the logs as such:
CentOS:
tail -f /var/log/messages
Ubuntu:
tail -f /var/log/syslog
Mac OSX
tail -f /var/log/system.log
Windows
Here be Dragons
But there is a neato project I found while searching on how to do this out of morbid curiosity: https://github.com/derFunk/monit-windows-agent
Yeah monit isn't too easy to debug.
Here a few best practices
use a wrapper script that sets up your log file. Write your command arguments in there while you are at it:
shell:
#!/usr/bin/env bash
logfile=/var/log/myjob.log
touch ${logfile}
echo $$ ": ################# Starting " $(date) "########### pid " $$ >> ${logfile}
echo "Command: the-command $#" >> ${logfile} # log your command arguments
{
exec the-command $#
} >> ${logfile} 2>&1
That helps a lot.
The other thing I find that helps is to run monit with '-v', which gives you verbosity. So the workflow is
get your wrapper working from the shell "sudo my-wrapper"
then try and get it going from monit, run from the command line with "-v"
then try and get it going from monit, running in the background.
You can also try running monit validate once processes are running, to try and find out if any of them are having problems (and sometimes get more information than you would get in the log files if there are any problems). Beyond that, there's not much more you can do.
Related
I have a simple example of a service unit and bash script on Red Hat Enterprise Linux 7 using Type=notify that I am trying to get working.
When the service unit is configured to start the script as root, things work as expected. When adding User=testuser it fails. While the script initially starts (as seen on process list) the systemctl service never receives the notify message indicating ready so it hangs and eventually times out.
[Unit]
Description=My Test
[Service]
Type=notify
User=testuser
ExecStart=/home/iatf/test.sh
[Install]
WantedBy=multi-user.target
Test.sh (owned by testuser with execute permission)
#!/bin/bash
systemd-notify --status="Starting..."
sleep 5
systemd-notify --ready --status="Started"
while [ 1 ] ; do
systemd-notify --status="Processing..."
sleep 3
systemd-notify --status="Waiting..."
sleep 3
done
When run as root systemctl status test displays the correct status and status messages as sent from my test.sh bash script. When User=testuser the service hangs and then timesout and journalctl -xe reports:
Jul 15 13:37:25 tstcs03.ingdev systemd[1]: Cannot find unit for notify message of PID 7193.
Jul 15 13:37:28 tstcs03.ingdev systemd[1]: Cannot find unit for notify message of PID 7290.
Jul 15 13:37:31 tstcs03.ingdev systemd[1]: Cannot find unit for notify message of PID 7388.
Jul 15 13:37:34 tstcs03.ingdev systemd[1]: Cannot find unit for notify message of PID 7480.
I am not sure what those PIDs are as they do not appear on ps -ef list
This appears to be known limitation in the notify service type
From a pull request to the systemd man pages
Due to current limitations of the Linux kernel and the systemd, this
command requires CAP_SYS_ADMIN privileges to work
reliably. I.e. it's useful only in shell scripts running as a root
user.
I've attempted some hacky workarounds with sudo and friends but they won't work as systemd - generally failing with
No status data could be sent: $NOTIFY_SOCKET was not set
This refers to the socket that systemd-notify is trying to send data to - its defined in the service environment but I could not get it reliably exposed to a sudo environment
You could also try using a Python workaround described here
python -c "import systemd.daemon, time; systemd.daemon.notify('READY=1'); time.sleep(5)"
Its basically just a sleep which is not reliable and the whole point of using notify is reliable services.
In my case - I just refactored to use root as the user - with the actual service as a child under the main service with the desired user
sudo -u USERACCOUNT_LOGGED notify-send "hello"
I have a Solaris system with 3 users ( root, cfruntime , cfdev)
After a successful installation of ColdFusion 2018, the owner of the coldfusion2018 installation is cfruntime.
As cfdev I try starting ColdFusion using the following command
sudo /disktwo/coldfusion2018/cfusion/bin/coldfusion start
This however doesnt appear to start coldfusion normally, but also doesn't generate any abonormal error/log
Looking at the startup script /disktwo/coldfusion2018/cfusion/bin/coldfusion. The folllowing lines actually starts ColdFusion
CFSTART='su $RUNTIME_USER -c "LD_LIBRARY_PATH=$LD_LIBRARY_PATH;
export LD_LIBRARY_PATH;
cd $CF_DIR/bin;
$JAVA_EXECUTABLE -classpath $CLASSPATH $JVM_ARGS
com.adobe.coldfusion.bootstrap.Bootstrap -start &"'
eval $CFSTART > /dev/null
An interesting observation I made was that if I removed the & at the end of the CFSTART, ColdFusion would start normally (although I need to put it in the background crtl-z , bg)
The ColdFusion process doesn't appear to be persistent after exiting the startup script if started as (cfdev/cfruntime) , but starts normally if the script is run as root.
Any thoughts?
Adding a nohup before the $JAVA_EXECUTABLE command and sending the output to >/dev/null 2>&1 did the trick for me
CFSTART='su $RUNTIME_USER -c "LD_LIBRARY_PATH=$LD_LIBRARY_PATH;
export LD_LIBRARY_PATH;
cd $CF_DIR/bin;
nohup $JAVA_EXECUTABLE -classpath $CLASSPATH $JVM_ARGS
com.adobe.coldfusion.bootstrap.Bootstrap -start > /dev/null 2>&1 &"'
I found that it appears that switching to the runtime user su $RUNTIME_USER and starting the process in the background caused all jobs started by the shell to close once the startup script completed (sending a hangup signal (SIGHUP) to all jobs started by that terminal) .
The nohup prevents the $JAVA_EXECUTABLE from closing when it recives the hangup signal (SIGHUP)
I'm trying to write a bash script.
The script should check if the MC server is running. If it crashed or stopped it will start the server automatically.
I'll use crontab to run the script every minute. I think I can run it every second it won't stress the CPU too much. I also would like to know when was the server restarted. So I'm going to print the date to the "RestartLog" file.
This is what I have so far:
#!/bin/sh
ps auxw | grep start.sh | grep -v grep > /dev/null
if [ $? != 0 ]
then
cd /home/minecraft/minecraft/ && ./start.sh && echo "Server restarted on: $(date)" >> /home/minecraft/minecraft/RestartLog.txt > /dev/null
fi
I'm just started learning Bash and I'm not sure if this is the right way to do it.
The use of cron is possible, there are other (better) solutions (monit, supervisord etc.). But that is not the question; you asked for "the right way". The right way is difficult to define, but understanding the limits and problems in your code may help you.
Executing with normal cron will happen at most once per minute. That means that you minecraft server may be down 59 seconds before it is restarted.
#!/bin/sh
You should have the #! at the beginning of the line. Don't know if this is a cut/paste problem, but it is rather important. Also, you might want to use #!/bin/bash instead of #!/bin/sh to actually use bash.
ps auxw | grep start.sh | grep -v grep > /dev/null
Some may suggest to use ps -ef but that is a question of taste. You may even use ps -ef | grep [s]tart.sh to prevent using the second grep. The main problem however with this line is that that you are parsing the process-list for a fairly generic start.sh. This may be OK if you have a dedicated server for this, but if there are more users on the server, you run the risk that someone else runs a start.sh for something completely different.
if [ $? != 0 ]
then
There was already a comment about the use of $? and clean code.
cd /home/minecraft/minecraft/ && ./start.sh && echo "Server restarted on: $(date)" >> /home/minecraft/minecraft/RestartLog.txt > /dev/null
It is a good idea to keep a log of the restarts. In this line, you make the execution of the ./start.sh dependent on the fact that the cd succeeds. Also, the echo only gets executed after the ./start.sh exists.
So that leaves me with a question: does start.sh keep on running as long as the server runs (in that case: the ps-test is ok, but the && echo makes no sense, or does start.sh exit while leaving the minecraft-server in the background (in that case the ps-grep won't work correctly, but it makes sense to echo the log record only if start.sh exits correctly).
fi
(no remarks for the fi)
If start.sh blocks until the server exists/crashes, you'd be better off to simply restart it in an infinite loop without the involvement of cron. Simply type in a console (or put into another script):
#!/bin/bash
cd /home/minecraft/minecraft/
while sleep 3; do
echo "$(date) server (re)start" >> restart.log
./start.sh # blocks until server crashes
done
But if it doesn't block (i.e. if start.sh starts the server and then returns, but the server keeps running), you would need to implement a different check to verify if the server is actually still running, other than ps|grep start.sh
PS: To kill the infinite loop you have to Ctrl+C twice: Once to stop ./start.sh and once to exit from the immediate sleep.
You can use monit for this task. See docu. It is available on most linux distributions and has a straightforward config. Find some examples in this post
For your app it will look something like
check process minecraftserver
matching "start.sh"
start program = "/home/minecraft/minecraft/start.sh"
stop program = "/home/minecraft/minecraft/stop.sh"
I wrote this answer because sometimes the most efficient solution is already there and you don't have to code anything. Also follow the suggestions of William Pursell and use the init system of your OS (systemd,upstart,system-v,etc.) to host your scripts.
Find more:
Shell Script For Process Monitoring
I want to set my screen as screensave status every 50minutes (3000 seconds).
cat /home/rest.sh
while true;do
sleep 3000
xscreensaver-command --lock 1>/dev/null
done
sh /home/rest.sh & can make it run.
Now i want to set it as a daemon.
sudo vim /etc/systemd/system/screensave.service
[Unit]
Description=screensave
[Service]
User=root
ExecStart=/bin/bash /home/rest.sh
StandardError=journal
[Install]
WantedBy=multi-user.target
To set it and enable as daemon.
systemctl enable screensave.service
I find that the service is not running as a daemon.
sudo journalctl -u screensave
Jan 24 12:16:50 user systemd[1]: Started screensave.
Jan 24 12:17:22 user bash[621]: xscreensaver-command: warning: $DISPLAY is not set: defaulting to ":0.0".
Jan 24 12:17:22 user bash[621]: No protocol specified
Jan 24 12:17:22 user bash[621]: xscreensaver-command: can't open display :0.0
How to run it as a daemon after $DISPLAY is set ?
This is a very common FAQ. A system daemon cannot easily connect to the X session of any individual user. On a multi-user system, how do you tell which user's session to connect to, anyway? On a single-user system, what should the daemon do if no session is running (as it often isn't at the time the daemon starts up)?
Trying to run a system daemon as any particular user won't work, and giving individual users access to a system daemon is a recipe for security problems. It can be done, but the solution is complex, and probably not something you want to attempt on your own. (Briefly, have the daemon listen to commands on a socket; create a user-space program which knows how to talk to the socket, and build some sort of authorization and authentication so the daemon knows whom it's talking to and can verify that this user is allowed to connect to this display.)
The drop-dead simple solution is to run this from your desktop environment's startup scripts instead. Most desktops have something like "session start-up items" or "autorun on login" hooks.
I'm not running linux and can't check now but the steps to daemonize a process are to close stdin stdout stderr change current working directory to / and to fork twice and setsid so that current process is a new session leader.
adding something like this at the beginning, before running, first thing to check is exec command creates a new session leader process with ps -Cbash -o sid,pgid,pid,ppid,comm,args
# checking if current process is a session leader to avoid infinite call
if [[ $(ps -p $$ -osid=) != $$ ]]; then
( cd / ; exec setsid /bin/bash /home/rest.sh & ) </dev/null 1>&0 2>&0 &
exit
fi
I've written a scrip that works fine to start and stop a server.
#!/bin/bash
PID_FILE='/var/run/rserve.pid'
start() {
touch $PID_FILE
eval "/usr/bin/R CMD Rserve"
PID=$(ps aux | grep Rserve | grep -v grep | awk '{print $2}')
echo "Starting Rserve with PID $PID"
echo $PID > $PID_FILE
}
stop () {
pkill Rserve
rm $PID_FILE
echo "Stopping Rserve"
}
case $1 in
start)
start
;;
stop)
stop
;;
*)
echo "usage: rserve {start|stop}" ;;
esac
exit 0
If I start it by running
rserve start
and then start monit it will correctly capture the PID and the server:
The Monit daemon 5.3.2 uptime: 0m
Remote Host 'localhost'
status Online with all services
monitoring status Monitored
port response time 0.000s to localhost:6311 [DEFAULT via TCP]
data collected Mon, 13 May 2013 20:03:50
System 'system_gauss'
status Running
monitoring status Monitored
load average [0.37] [0.29] [0.25]
cpu 0.0%us 0.2%sy 0.0%wa
memory usage 524044 kB [25.6%]
swap usage 4848 kB [0.1%]
data collected Mon, 13 May 2013 20:03:50
If I stop it, it will properly kill the process and unmonitor it. However if I start it again, it won't start the server again:
ps ax | grep Rserve | grep -vc grep
1
monit stop localhost
ps ax | grep Rserve | grep -vc grep
0
monit start localhost
[UTC May 13 20:07:24] info : 'localhost' start on user request
[UTC May 13 20:07:24] info : monit daemon at 4370 awakened
[UTC May 13 20:07:24] info : Awakened by User defined signal 1
[UTC May 13 20:07:24] info : 'localhost' start: /usr/bin/rserve
[UTC May 13 20:07:24] info : 'localhost' start action done
[UTC May 13 20:07:34] error : 'localhost' failed, cannot open a connection to INET[localhost:6311] via TCP
Here is the monitrc:
check host localhost with address 127.0.0.1
start = "/usr/bin/rserve start"
stop = "/usr/bin/rserve stop"
if failed host localhost port 6311 type tcp with timeout 15 seconds for 5 cycles
then restart
I had problem start or stop process via shell too.
One solution might be add "/bin/bash" in the config like this:
start program = "/bin/bash /urs/bin/rserv start"
stop program = "/bin/bash /urs/bin/rserv stop"
It worked for me.
monit is a silent killer. It does not tell you anything. Here are things I would check which monit won't help you identify
Check permissions of all the files you are reading / writing. If you are redirecting output to a file, make sure that file is writable by uid and gid you are using to execute the program
Again check exec permission on the program you are trying to run
Specify full path to any program you are trying to execute ( not strictly necessary, but you don't have to worry about path not being set if you always specify full path )
Make sure you can run the program outside of monit without any error before trying to investigate why monit is not starting.
If the Monit log is displaying
failed to start (exit status -1) -- no output
Then it may be that you're trying to run a script without any of the Bash infrastructure. You can run such a command by wrapping it in /bin/bash -c, like so:
check process my-process
matching "my-process-name"
start program = "/bin/bash -c '/etc/init.d/my-init-script'"
When monit starts it checks for its own pidfile and checks if the process with
matching PID is running already - if it does, then it just wakes up this
process.
in your case, check if this pid is being used by some other process:
ps -ef |grep 4370
if yes, then you need to remove the below file(usually under /run directory) and start monit again:
monit.pid
For me, the issue was that the stop command was not being run, even though I specifically specified "then restart" on the configuration.
The solution was just to change:
start program = "/etc/init.d/.... restart"