how to stop running task and continue in hadoop cluster - hadoop

I'm testing "shutting down servers using UPS" while hadoop task is running, and I have two questions.
I wonder if running task can be saved, and then it continues the remaining work again after rebooting. (at all nodes)
If "1" is not supported, is it safe to start shutting down process while hadoop tasks are running? Or, is there anything I have to do to preserve hadoop system? (cluster?)

No, you can't "save" the task in an intermediate state. If you shut down hadoop while some jobs are running, you could end up with intermediate data from abandoned jobs occupying space. Apart from that, you could shut down the system while jobs are running.

It is not possible to save the state of running tasks with Hadoop as of now. It would be an extremely difficult process since all of the resource allocations happen based on the current load of the system but after restarting your entire cluster there might be entirely different workload therefore restoring the state does not make sense.
Answering your second questions, Hadoop was designed to tolerate node failures or temporary problems with accessing files and network outages as well. Individual tasks might fail and then the system restarts them on a other node. It is safe to shut down nodes from the cluster point of view, the only thing to keep in mind that the job will ultimately fail and you need to re-submit it after bringing back the cluster to life. One problem might arise with shutting down the cluster using the power switch is that temporary files are not getting cleaned up. This is usually not a major problem.

Related

AWS EMR Metric Server - Cluster Driver is throwing Insufficient Memory Error

This is in relation to my previous post (here) regarding the OOM I'm experiencing on a driver after running some Spark steps.
I have a cluster with 2 nodes in addition to the master, running the job as client. It's a small job that is not very memory intensive.
I've paid particular attention to the hadoop processes via htop, they are the user generated ones and also the highest memory consumers. The main culprit is the amazon.emr.metric.server process, followed by the state pusher process.
As a test I killed the process, the memory as shown by Ganglia dropped quite drastically whereby I was then able to run 3-4 consecutive jobs before the OOM happened again. This behaviour repeats if I manually kill the process.
My question really is regarding the default behaviour of these processes and whether what I'm witnessing is the norm or whether something crazy is happening.

ensuring that a mesos task is not running after a TASK_LOST status update

I am trying to write a simple Mesos framework that can relaunch tasks that don't succeed.
The basic algorithm, which seems to be mostly working, is to read in a task list (e.g. shell commands) and then launch executors, waiting to hear back status messages. If I get TASK_FINISHED, that particular task is done. If I get TASK_FAILED/TASK_KILLED, I can retry the task elsewhere (or maybe give up).
The case I'm not sure about is TASK_LOST (or even slave lost). I am hoping to ensure that I don't launch another copy of a task that is already running. After getting TASK_LOST, Is it possible that the executor is still running somewhere, but a network problem has disconnected the slave from the master? Does Mesos deal with this case somehow, perhaps by having the executor kill itself (and the task) when it is unable to contact the master?
More generally, how can I make sure I don't have two of the same task running in this context?
Let me provide some background first and then try to answer your question.
1) The difference between TASK_LOST and other terminal unsuccessful states is that restarting a lost task could end in TASK_FINISHED, while failed or killed will most probably not.
2) Until you get a TASK_LOST you should assume your task is running. Imagine a Mesos Agent (Slave) dies for a while, but the tasks may still be running and will be successfully reconciled, even though the connection is temporarily lost.
3) Now to your original question. The problem is that it is utterly hard to have exactly once instance running (see e.g. [1] and [2]). If you have lost connection to your task, that can mean either a (temporary) network partition or that your task has died. You basically have to choose between two alternatives: either having the possibility of multiple instances running at the same time, or possibly having periods when there are no instances running.
4) It's not easy to guarantee that two tasks are not running concurrently. When you get a TASK_LOST update from Mesos it means either your task is dead or orphaned (it will be killed once reconciled). Now imagine a situation when a slave with your task is disconnected from the Mesos Master (due to a network partition): while you will get a TASK_LOST update and the Master ensures the task is killed once the connection is restored, your task will be running on the disconnected slave until then, which violates the guarantee given you have already started another instance once you got the TASK_LOST update.
5) Things you may want to look at:
recovery_timeout on Mesos slaves regulates when tasks commit suicide if the mesos-slave process dies
slave_reregister_timeout on the Mesos Master specifies how much time do slaves have to reregister with the Mesos Master and have their tasks reconciled (basically, when you get TASK_LOST updates for unreachable tasks).
[1] http://antirez.com/news/78
[2] http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
You can assume that TASK_LOST really means your task is lost and there is nothing you can do but launch another instance.
Two things to keep in mind though:
Your framework may register with failover timeout which means if your framework cannot communicate with slave for any reason (network unstable, slave died, scheduler died etc.) then Mesos will kill tasks for this framework after they fail to recover within that timeout. You will get TASK_LOST status after the task is actually considered dead (e.g. when failover timeout expires).
When not using failover timeout tasks will be killed immediately when connectivity is lost for any reason.

Hadoop High Availability Not Working

Hi I am pretty new to the concept of Hadoop High Availability,I have done all the basic configurations needed for High Availability.When I manually killed the namenode process in one machine,the other node became active and this node went to standby mode.But when I shutdown the machine which has active node running,other node is not going to active state.
Any help is appreciated
thanks in advance.
It may be that when you kill the process, the name node goes into graceful shutdown which includes notifying the other name node to take it's place, which the other one does immediately.
On the other hand, when you shut down the machine, it may be that graceful shut down of name node is not executed, and so the other name node does not yet know that it should take over. Given enough time though, it should.
As far as I am concerned, if the node 'dies' in a unspected way, it will stop sending the heartbeat to it's zookeeper, so you must give enough time to that process (Zookeeper) to ensure the active NameNode is dead.
On a scenario with only NameNode process dying, there is another process in that machine, 'Failover controller' that will notify of NameNode death to the passive one, that is why it is a fast thing.
Please take a look at this thread, since it is well written.

Why does a number of completed tasks in Mapreduce decrease?

When running hadoop jobs, I noticed that sometimes the number of completed tasks decreases and number of canceled tasks increases.
How is this possible? Why does this happen?
I've only experienced this when our cluster was in a strange state, so I'm not sure if this is the same issue. Basically, map tasks would complete, and then the reducers would start... and then mappers would be reprocessed.
I believe that the problem is that mapper output hangs around on that data node waiting for reducers to pick it up. If that node has issues or it dies, the JobTracker decides that it needs to rerun that task again, even if it had completed. Our issue was that the system our NameNode was on was having some non-Hadoop related issues and once those were resolves it seemed to go away.
Sorry if my experience was not relevant to your issue. Perhaps, can you post more details? Do you see any error messages? Is there anything weird in your JobTracker or NameNode logs?

What's best practice for HA gearman job servers

From gearman's main page, they mention running with multiple job servers so if a job server dies, the clients can pick up a new job server. Given the statement and diagram below, it seems that the job servers do not communicate with each other.
Our question is what happens to those jobs that are queued in the job server that died? What is the best practice to have high-availability for these servers to make sure jobs aren't interrupted in a failure?
You are able to run multiple job servers and have the clients and workers connect to the first available job server they are configured with. This way if one job server dies, clients and workers automatically fail over to another job server. You probably don't want to run too many job servers, but having two or three is a good idea for redundancy.
Source
As far as I know there is no proper way to handle this at the moment, but as long as you run both job servers with permanent queues (using MySQL or another datastore - just don't use the same actual queue for both servers), you can simply restart the job server and it'll load its queue from the database. This will allow all the queued tasks to be submitted to available workers, even after the server has died.
There is however no automagical way of doing this when a job server goes down, so if both the job server and the datastore goes down (a server running both locally goes down) will leave the tasks in limbo until it gets back online.
The permanent queue is only read on startup (and inserted / deleted from as tasks are submitted and completed).
I'm not sure about the complexity required to add such functionality to gearmand and whether it's actually wanted, but simple "task added, task handed out, task completed"-notifications between servers shouldn't been too complicated to handle.

Resources