Hi I am pretty new to the concept of Hadoop High Availability,I have done all the basic configurations needed for High Availability.When I manually killed the namenode process in one machine,the other node became active and this node went to standby mode.But when I shutdown the machine which has active node running,other node is not going to active state.
Any help is appreciated
thanks in advance.
It may be that when you kill the process, the name node goes into graceful shutdown which includes notifying the other name node to take it's place, which the other one does immediately.
On the other hand, when you shut down the machine, it may be that graceful shut down of name node is not executed, and so the other name node does not yet know that it should take over. Given enough time though, it should.
As far as I am concerned, if the node 'dies' in a unspected way, it will stop sending the heartbeat to it's zookeeper, so you must give enough time to that process (Zookeeper) to ensure the active NameNode is dead.
On a scenario with only NameNode process dying, there is another process in that machine, 'Failover controller' that will notify of NameNode death to the passive one, that is why it is a fast thing.
Please take a look at this thread, since it is well written.
Related
This is in relation to my previous post (here) regarding the OOM I'm experiencing on a driver after running some Spark steps.
I have a cluster with 2 nodes in addition to the master, running the job as client. It's a small job that is not very memory intensive.
I've paid particular attention to the hadoop processes via htop, they are the user generated ones and also the highest memory consumers. The main culprit is the amazon.emr.metric.server process, followed by the state pusher process.
As a test I killed the process, the memory as shown by Ganglia dropped quite drastically whereby I was then able to run 3-4 consecutive jobs before the OOM happened again. This behaviour repeats if I manually kill the process.
My question really is regarding the default behaviour of these processes and whether what I'm witnessing is the norm or whether something crazy is happening.
I'm testing "shutting down servers using UPS" while hadoop task is running, and I have two questions.
I wonder if running task can be saved, and then it continues the remaining work again after rebooting. (at all nodes)
If "1" is not supported, is it safe to start shutting down process while hadoop tasks are running? Or, is there anything I have to do to preserve hadoop system? (cluster?)
No, you can't "save" the task in an intermediate state. If you shut down hadoop while some jobs are running, you could end up with intermediate data from abandoned jobs occupying space. Apart from that, you could shut down the system while jobs are running.
It is not possible to save the state of running tasks with Hadoop as of now. It would be an extremely difficult process since all of the resource allocations happen based on the current load of the system but after restarting your entire cluster there might be entirely different workload therefore restoring the state does not make sense.
Answering your second questions, Hadoop was designed to tolerate node failures or temporary problems with accessing files and network outages as well. Individual tasks might fail and then the system restarts them on a other node. It is safe to shut down nodes from the cluster point of view, the only thing to keep in mind that the job will ultimately fail and you need to re-submit it after bringing back the cluster to life. One problem might arise with shutting down the cluster using the power switch is that temporary files are not getting cleaned up. This is usually not a major problem.
I am trying to write a simple Mesos framework that can relaunch tasks that don't succeed.
The basic algorithm, which seems to be mostly working, is to read in a task list (e.g. shell commands) and then launch executors, waiting to hear back status messages. If I get TASK_FINISHED, that particular task is done. If I get TASK_FAILED/TASK_KILLED, I can retry the task elsewhere (or maybe give up).
The case I'm not sure about is TASK_LOST (or even slave lost). I am hoping to ensure that I don't launch another copy of a task that is already running. After getting TASK_LOST, Is it possible that the executor is still running somewhere, but a network problem has disconnected the slave from the master? Does Mesos deal with this case somehow, perhaps by having the executor kill itself (and the task) when it is unable to contact the master?
More generally, how can I make sure I don't have two of the same task running in this context?
Let me provide some background first and then try to answer your question.
1) The difference between TASK_LOST and other terminal unsuccessful states is that restarting a lost task could end in TASK_FINISHED, while failed or killed will most probably not.
2) Until you get a TASK_LOST you should assume your task is running. Imagine a Mesos Agent (Slave) dies for a while, but the tasks may still be running and will be successfully reconciled, even though the connection is temporarily lost.
3) Now to your original question. The problem is that it is utterly hard to have exactly once instance running (see e.g. [1] and [2]). If you have lost connection to your task, that can mean either a (temporary) network partition or that your task has died. You basically have to choose between two alternatives: either having the possibility of multiple instances running at the same time, or possibly having periods when there are no instances running.
4) It's not easy to guarantee that two tasks are not running concurrently. When you get a TASK_LOST update from Mesos it means either your task is dead or orphaned (it will be killed once reconciled). Now imagine a situation when a slave with your task is disconnected from the Mesos Master (due to a network partition): while you will get a TASK_LOST update and the Master ensures the task is killed once the connection is restored, your task will be running on the disconnected slave until then, which violates the guarantee given you have already started another instance once you got the TASK_LOST update.
5) Things you may want to look at:
recovery_timeout on Mesos slaves regulates when tasks commit suicide if the mesos-slave process dies
slave_reregister_timeout on the Mesos Master specifies how much time do slaves have to reregister with the Mesos Master and have their tasks reconciled (basically, when you get TASK_LOST updates for unreachable tasks).
[1] http://antirez.com/news/78
[2] http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
You can assume that TASK_LOST really means your task is lost and there is nothing you can do but launch another instance.
Two things to keep in mind though:
Your framework may register with failover timeout which means if your framework cannot communicate with slave for any reason (network unstable, slave died, scheduler died etc.) then Mesos will kill tasks for this framework after they fail to recover within that timeout. You will get TASK_LOST status after the task is actually considered dead (e.g. when failover timeout expires).
When not using failover timeout tasks will be killed immediately when connectivity is lost for any reason.
I have a 23 node cluster running CoreOS Stable 681.2.0 on AWS across 4 availability zones. All nodes are running etcd2 and flannel. Of the 23 nodes, 8 are dedicated etcd2 nodes, the rest are specifically designated as etcd2 proxies.
Scheduled to the cluster are 3 nginx plus containers, a private Docker registry, SkyDNS, and 4 of our application containers. The application containers register themselves with with etcd2 and the nginx containers pick up any changes, render the necessary files, and finally reload.
This all works perfectly, until a singe etcd2 node is unavailable for any reason.
If the cluster of voting etcd2 members loses connectivity to a even a single other voting etcd2 member, all of the services scheduled to the fleet become unstable. Scheduled services begin stopping and starting without my intervention.
As a test, I began stopping the EC2 instances which host voting etcd2 nodes until quorum was lost. After the first etcd2 node was stopped, the above symptoms began. After a second node, services became unstable, with no observable change. Then, after the third was stopped quorum was lost and all units were unscheduled. I then started all three etcd2 nodes again and within 60 seconds the cluster had returned to a stable state.
Subsequent tests yield identical results.
Am I hitting a known bug in etcd2, fleet or CoreOS?
Is there a setting I can modify to keep units scheduled onto a node even if etcd is unavailable for any reason?
I've experienced the same thing. In my case, when I ran 1 specific unit it caused everything to blow up. Scheduled and perfectly fine running units were suddenly lost without any notice, even machines dropping out of the cluster.
I'm still not sure what the exact problem was, but I think it might have had something to do with etcd vs etcd2. I had a dependency of etcd.service in the unit file, which (I think, not sure) caused CoreOS to try and start etcd.service, while etcd2.service was already running. This might have caused the conflict in my case, and messed up the etcd registry of units and machines.
Something similar might be happening to you, so I suggest you check each host whether you're running etcd or etcd2 and check your unit files to see which one they depend on.
When running hadoop jobs, I noticed that sometimes the number of completed tasks decreases and number of canceled tasks increases.
How is this possible? Why does this happen?
I've only experienced this when our cluster was in a strange state, so I'm not sure if this is the same issue. Basically, map tasks would complete, and then the reducers would start... and then mappers would be reprocessed.
I believe that the problem is that mapper output hangs around on that data node waiting for reducers to pick it up. If that node has issues or it dies, the JobTracker decides that it needs to rerun that task again, even if it had completed. Our issue was that the system our NameNode was on was having some non-Hadoop related issues and once those were resolves it seemed to go away.
Sorry if my experience was not relevant to your issue. Perhaps, can you post more details? Do you see any error messages? Is there anything weird in your JobTracker or NameNode logs?