OS version: suse linux enterprise server 12 sp2
apache ignite(2.8.0) service(one process with a child process,and the child serve out caculation ability and consumes memorys hard) running on this machine, always be killed accidently.sometimes in two weeks after last startup , sometimes only 3 hours after last startup.
(there's other processes on this machine, eg. zookeeper.)
i've got some useful info , but not enough:
NOT oomkiller. no exception in ignite service log and /var/log/message. ignite dissapeared without any exception log, seem that it was killed with signal 9.
i insmod a kernel module to record who is the killer, and found that is PID 1 killed the child process with signal 0 ( the father process soon disappeared , without any killer info). Also, i printed ps -ef every 10 seconds to logs, found that the child process name became "defunct" before it disappeared.
mem and swap were not in starvation when ignite was killed, very smooth and steady before killed. I am very sure with monitoring chart and DIY logs.
I don't know how to go on with it.. helppppp...
Related
I have setup a Hadoop Cluster with Hortonworks Data Platform 2.5. I'm using 1 master and 5 slave (worker) nodes.
Every few days one (or more) of my worker nodes gets a high load and seem to restart the whole CentOS operating system automatically. After the restart the Hadoop components don't run anymore and have to be restarted manually via the Amabri management UI.
Here a screenshot of the "crashed" node (reboot after the high load value ~4 hours ago):
Here a screenshot of one of other "healthy" worker node (all other workers have similar values):
The node crashes alternate between the 5 worker nodes, the master node seems to run without problems.
What could cause this problem? Where are these high load values coming from?
This seems to be a Kernel problem, as the log file (e.g. /var/spool/abrt/vmcore-127.0.0.1-2017-06-26-12:27:34/backtrace) says something like
Version: 3.10.0-327.el7.x86_64
BUG: unable to handle kernel NULL pointer dereference at 00000000000001a0
After running a sudo yum update I had the kernel version
[root#myhost ~]# uname -r
3.10.0-514.26.2.el7.x86_64
Since the operating system updates the problem didn't occur anymore. I will observe the issue and give feedback if neccessary.
I have install DC/OS (3master and 7slave server - all Centos7)
I saw problem - when one of slave server shut down - mesos/marathon start killed instance of application after 5 minutes.
For example - I run in mesos/marathon 8 instance simple web application. When I shut down or deactivate network interface of one slave server marathon show that some instancje are killed. From this moment mesos/marathon wait 5 minutes and start killed instance to another online slave server.
My question is - how can I change this time? 5 minutes is to long. I read documentation of DC/OS but I can't find variable responsible for this.
I will be very thankful for your help.
You can have a at the Marathon command-line flags. Based on your description, I guess the default for either task_launch_timeout or scale_apps_interval could be responsible for this.
I'm unsure though if this can be configured on the fly, or during installation in DC/OS. I saw that there's a quite recent enhancement request to Make Marathon flags passable via environment variables.
Hadoop YARN launches instances of YarnChild in child VM to execute the actual tasks. Those tasks communicate with their ApplicationMaster (AM) through the umbilical interface.
My question is what happens if AM dies and Resource Manager(RM) fails to bring it up (say, due to some code defect in AM)? In such a case, the children tasks would (a) note the absence of AM due to heartbeat and then, (b) go to RM to get new AM location, which in this case they will not get. So, what happens to these orphaned tasks? I have a scenario where I would like to terminate them. Is that the default behavior and does their NodeManager (NM) terminate them?
From Hadoop -Definitive Guide, Chapter 6, Failures, Failures in yarn
After a crash, a new resource manager instance is brought up(by
admin), and it recovers from the saved state. The state consists of
node managers in system, as well as running applications. Here tasks
are not part of resource managers state, as they are managed by
application.
Also, it is said that the resource manager is designed to be able to recover from crashes.
All child task related to that particular application master would be on halt state. Hadoop admin should either restart the application master or kill it. NodeManager doesn't terminate the failed Application Master.
If you want to kill a application then you can use yarn application -kill application_id command to kill the application. It will kill all running and queued jobs under the application.
If you want to kill a task in YARN then you can use hadoop job -kill-task <task-id> to kill a particular task in YARN
Can someone tell how to kill a container? i see nodes are still running containers even after the application is finished and i want to know the command to kill them? Because of this issue, my subsequent applications stays in accepted state.
Thanks
Hadoop job -list
This gives you jobs that are running with JobID's
To kill job
Hadoop job –kill JobID
If yarn application is finished and some containers are still running, I'd say this is a bug somewhere. Is this a MR app? I don't think there's any commands to kill containers and anyway those should be handled by a nodemanager. Resource manager and Node manager should kill all containers when application is finished.
You didn't provide any info on what is this app, hadoop version, operating system, etc. Having said that, I once had a problem in my ubuntu hosts which had HADOOP-9752 bug which prevented nodemanager to kill a container.
Being a level 1 novice in Linux (Ubuntu 9), shell and cron, I've had some difficulty figuring this out. Each night, I'd like to take a snapshot of our Cassandra nodes and restart the process.
Why? Because our team is hunting down a memory leak that requires a process restart every 3 weeks or so. The root cause has been difficult to track down. In the meantime, I'd like to put these cron jobs in place to reduce service interruption.
Thanks in advance for anyone who has some of these already figured out!
The general procedure is:
Run nodetool drain (http://www.riptano.com/docs/0.6/utilities/nodetool#nodetool-drain) on the node
Run nodetool snapshot
Kill the cassandra process
Start the cassandra process
When running nodetool snapshot, it is very important that you have JNA set up and working. This includes:
Having jna.jar in Cassandra's lib directory and either:
Running Cassandra as root, or
Increasing the memory locking limit using 'ulimit -l' or something like /etc/security/limits.conf
If this is all correct, you should see a message about "mlockall" succeeding in the logs on startup.
The other thing to keep an eye on is your disk space usage; this will grow as compactions occur and the old SSTables are replaced (but their snapshots remain).