We use Spring to run scheduled tasks which works fine with single node. We want to run these scheduled tasks in cluster of N nodes such that the tasks are executed atmost by one node at a point of time. This is for enterprise use case and we may expect upto 10 to 20 nodes.
I looked into various options:
Use Quartz which seems to be a popular choice for running scheduled tasks in a cluster. Drawback: Database dependency which I want to avoid.
Use zookeeper and always run the scheduled tasks only on the leader/master node. Drawback: Task execution load is not distributed
Use zookeeper and have the scheduled tasks invoke on all nodes. But before the task runs acquire distributed lock and release once execution is complete.
Drawback: The system clock on all nodes should be in sync which may be an issue if application is overloaded causing system clock drift.
Use zookeeper and let the master node keep producing the task as per the schedule and assign it to a random worker. A new task is not assigned if previous scheduled task has not been worked on yet. Drawback: This appears to add too much complexity.
I am inclining towards using #3 which appears to be a safe solution assuming the zookeeper ensemble nodes run on a separate cluster with system clock in sync using NTP. This is also on assumption that if system clocks are in sync, then all nodes have equal chance of acquiring the lock to execute a task.
EDIT: After some more thought I realize this may not be a safe solution either since the system clock should be in sync between the nodes where the scheduled tasks are running not just the zookeeper cluster nodes. I am saying not safe because the nodes where the tasks are running can be overloaded with GC pauses and other reasons and there is possibility of clocks going out of sync. But again I would think this is a standard problem with distributed systems.
Could you please advise if my understanding on each of the options is accurate? Or may be there is a better approach than the listed options to solved this problem.
Well, you can improve the #3 like this.
Zookeeper provide watchers. That is, you can set a watcher on a given ZNode (say at path /some/path). All your nodes in the cluster are watching the same Znode. Whenever a node thinks(as scheduled or whatever way) it should now run the scheduled task,
First it create a PERSISTENT_SEQUENTIAL child node under /some/path (which all the nodes are watching). Also, you can set the data of that node as you wish. It may be a json string specifying the details about the task to be run. The new ZNode path will look like /some/path/prefix_<sequence-number>.
Then, all the nodes in the cluster will be notified about the child node created. All of them then fetch the newly created ZNode's data and decode the task.
Now, each node try to acquire a distributed lock. Whoever acquiring it first can execute it. Once executed, that node should report (Say by creating a new ZNode under /some/path/prefix_<sequence-number> with name success), that that task was executed. Then release the lock.
Whenever a node is trying to execute a task, before trying to acquire the distributed lock, it should check if that ZNode already has a success child node.
This design ensures that no task is run twice by checking the child node with name success under a given ZNode created to notify to start a task.
I have used the above design for an enterprise solution. Actually for a distributed command framework ;-)
Zookeeper or Etcd aren't the best tools for this use case.
If your environment allows you to use akka it would be easier for you to use akka cluster + smallest mailbox router or whatever cluster router you prefer. Then push schedule jobs to the ActorRef for the cluster. Easier to set up, you can set up thousands of nodes in a cluster using it (it uses swim the protocol cassandra and nomad use).
Scalecube also would do it rather easily again it uses SWIM.
Related
Does stateless node mean just being independent of each others? can you explain this concept w.r.t to hadoop
The explanation can be as follows: each mapper/reducer has no idea about all the other mappers/reducers (i.e. about their current states, their particular outputs if any, etc.). Such statelessness is not great for certain data processing workloads (e.g. graph data) but allows easy parallelization (a particular map/reduce task can be run on any node, meaning a failed mapper/reducer is not an issue, just start a new one on the same input split/mappers' outputs).
I would say that statefulness of the nodes in computing infrastructures has slightly different meaning from what you have defined. Remember there is always coordination process running somewhere, so there is no complete independence between the nodes.
What it can actually mean in computing infrastructures is that the nodes does not store anything about the computation they are performing on persistent storage. Consider the following, you have master running on some machine delegating the tasks to the workers, the workers maintain the information in RAM and retrieve it from RAM when necessary for task computation. Workers also write results into RAM. You can consider the worker nodes as stateless, since whenever the worker node fails (from power cut for example) it would not have any mechanism which would allow it to recover the execution from the point it has stopped at. But still master will know that the node has failed and would delegate the task to another machine in the cluster.
Regarding Hadoop, the architecture is statefull, first of all, because whenever the job is starting its execution it will transfer all the metadata to the worker node (the jar file, split location, etc). Secondly, when the job is scheduled on the node which does not contain the input data, it will be transferred there. Additionally, the intermediate data is being stored on the disk, exactly for failure recovery reasons, so the failure recovery mechanisms can resume the job from the point where execution has stopped.
Problem statement and Question :
(In the context of hadoop and jobs which can fork child jobs) Oozie launcher jobs will be RUNNING till the completion of child jobs. When too many jobs are scheduled, the child jobs will not be able to run as all of the resources are allocated to parent jobs. Parent jobs cant release resources till child jobs finish and Child cant run till they get resources -- a typical deadlock. What is efficient way to avoid these deadlocks ? (efficient way = best possible resource allocation)
Background
On Hadoop 2.5.0-cdh5.3.0 cluster configured with YARN and its Fair scheduler, I use Oozie to schedule jobs. I understand and account one extra launcher job(= a Map Task in a Container managed by its own Application Master) for resource allocations.
Current solutions and the problems associated :
Pre-calculate total maximum resources required to comfortably run a job and then using this value to limit maximum number of jobs to be allowed to run simultaneously.
Problems :
Under utilization of resources as the resources are reserved in advance.
Problems with scaling up and down of cluster size. Lets say a node dies or I add new node, then I have to update this settings(I think this is not the hadoop way of managing cluster)
Separate oozie launcher(parent) jobs from child jobs by putting them into different job queues, say "oozie_pool" and "default_pool". In fair scheduler config, I gave an appropriate weights to each queues(oozie_pool=10%; default_pool=90%).
Problem:
at this time 'fair schedular' is not robust. In my setup, it occasionally stops pre-emptying. One cause is it runs into infinite polling loop when a task requests 'X' units of memory but no single node alone has 'X' units free memory while the sum of available memory units in all nodes is greater than 'X'.
In the following, I refer to this article: Understanding the Parallelism of a Storm Topology by Michael G. Noll
It seems to me that a worker process may host an arbitrary number of executors (threads) to run an arbitrary number of tasks (instances of topology components). Why should I configure more than one worker per cluster node?
The only reason I see is that a worker can only run a subset of at most one topology. Hence, if I want to run multiple topologies on the same cluster, I would need to configure the same number of workers per cluster node as the number of topologies to be run.
(Example: This is because I would want to be flexible in case that some cluster nodes fail. If for example, only one cluster node remains I need at least as many worker processes as topologies running on that cluster in order to keep all topologies running.)
Is there any other reason? Especially, is there any reason to configure more than one worker per cluster node if running only one topology? (Better fail-safety, etc.)
To balance the costs of a supervisor daemon per node, and the risk of impact of a worker crashing. If you have one large, monolithic worker JVM, one crash impacts everything running in that worker, as well as bad behaving portions of your worker impact more residents. But by having more than one worker per node, you make your supervisor more efficient, and now have a bulkhead pattern somewhat, keeping from the all or nothing approach.
The shared resources I refer to could be yours or storm's; several pieces of storm's architecture are shared per JVM, and could create contention problems. I refer to the receive and send threads, and underlying network pieces, specifically. Documented here.
I am running a job timing analysis. I have a pre configured cluster with 8 nodes. I want to run a given job with 8 nodes, 6 nodes , 4 nodes and 2 nodes respectively and note down the corresponding run times. Is there a way i can do this programatically, i.e by using appropriate settings in the Job configuration in Java code ?
There are a couple of ways. Would prefer in the same order.
exclude files can be used to not allow some of the task trackers/data nodes connect to the job tracker/ name node. Check this faq. The properties to be used are mapreduce.jobtracker.hosts.exclude.filename and dfs.hosts.exclude. Note than once the files have been changed, the name node and the job tracker have to be refreshed using the mradmin and dfsadmin commands with the refreshNodes option and it might take some time for the cluster to settle because data blocks have to be moved from the excluded nodes.
Another way is to stop the task tracker on the nodes. Then the map/reduce tasks will not be scheduled on that node. But, the data will still be fetched from all the data nodes. So, the data nodes also need to be stopped. Make sure that the name node gets out of safe mode and the replication factor is also set properly (with 2 data nodes, the replication factor can't be 3).
A Capacity Scheduler can also be used to limit the usage of a cluster by a particular job. But, when resources are free/idle then the scheduler will allocate resources beyond capacity for better utilization of the cluster. I am not sure if this can be stopped.
Well are you good with scripting ? If so play around with start scripts of the daemons. Since this is an experimental setup, I think restarting hadoop for each experiment should be fine.
I'm trying to run the Fair Scheduler, but it's not assigning Map tasks to some nodes with only one job running. My understanding is that the Fair Scheduler will use the conf slot limits unless multiple jobs exist, at which point the fairness calculations kick in. I've also tried setting all queues to FIFO in fair-scheduler.xml, but I get the same results.
I've set the scheduler in all mapred-site.xml files with the mapreduce.jobtracker.taskscheduler parameter (although I believe only the JobTracker needs it) and some nodes have no problem receiving and running Map tasks. However, other nodes either never get any Map tasks, or get one round of Map tasks (ie, all slots filled once) and then never get any again.
I tried this as a prerequisite to developing my own LoadManager, so I went ahead and put a debug LoadManager together. From log messages, I can see that the problem nodes keep requesting Map tasks, and that their slots are empty. However, they're never assigned any.
All nodes work perfectly with the default scheduler. I just started having this issue when I enabled the Fair Scheduler.
Any ideas? Does someone have this working, and has taken a step that I've missed?
EDIT: It's worth noting that the Fair Scheduler web UI page indicates the correct Fair Share count, but that the Running column is always less. I'm using the default per-user pools and only have 1 user and 1 job at a time.
The reason was the undocumented mapred.fairscheduler.locality.delay parameter. The problematic nodes were located on a different rack with HDFS disabled, making all tasks on these nodes non-rack local. Because of this, they were incurring large delays due to the Fair Scheduler's Delay Scheduling algorithm, described here.