Is it possible to schedule any map reduce job on some specific nodes, instead of all nodes, in a Hadoop cluster? Say for example, on 4 slave nodes out of 10 available nodes. I tried searching on Google but didn't find any relevant result. This page says that by default all the jobs get scheduled on the complete cluster.
Reason of my requirement:
I have to implement a distributed relational database as a graduate level assignment work. I am using Hadoop and as per the assignment requirement we have to replicate data to the connected machines of the cluster. Now one of our replication model asks to run the query on a subset of available machines.
Suppose to process some data on hadoop cluster ,you have submitted a map reduce job ,now what it does is job tracker which plays the role of a master by assigning ,monitoring and coordinating different tasks for different task tracker.
Job tracker will talk to namenode which is again play role of a master , for the data which need to be processed ,since namenode holds all the information of a metadata ,so it would provide all the information where that particular data in residing in terms of which block is residing at which datanode to job tracker.
As a part of hadoop framework job tracker would invoke the task tracker of those datanodes where your data blocks are located ,worst scenario task tracker of that node where is closest to the datanode where some of the data blocks are there.
So summary is you we can't control which particular machines would be used that would depend where your data blocks are residing for that particular job. if it is located in 4 machines so at that moment 4 machines would be used if 10 then 10 would be used
Related
I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes.
Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager.
Core which runs Datanode and Tasktracker daemons.
Task which only runs TaskTracker only.
My question to you guys in why does EMR provide task nodes? Where as hadoop suggests that we should have Datanode daemon and Tasktracker daemon on the same node. What is Amazon's logic behind doing this? You can keep data in S3 stream it to HDFS on the core nodes, do the processing on HDFS other than sharing data from HDFS to task nodes which will increase IO over head in that case. Because as far as my knowledge in hadoop, TaskTrackers run on DataNodes which have data blocks for that particular task then why have TaskTrackers on different nodes?
According to AWS documentation [1]
The node types in Amazon EMR are as follows:
Master node: A node that manages the cluster by running software
components to coordinate the distribution of data and tasks among
other nodes for processing. The master node tracks the status of tasks
and monitors the health of the cluster. Every cluster has a master
node, and it's possible to create a single-node cluster with only the
master node.
Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your
cluster. Multi-node clusters have at least one core node.
Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
According to AWS documentation [2]
Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors.
Task nodes don't run the Data Node daemon, nor do they store data in HDFS.
Some Use cases are:
You can use Task nodes for processing streams from S3. In this case Network IO won't increase as the used data isn't on HDFS.
Task nodes can be added or removed as no HDFS daemons are running. Hence, no data on task nodes. Core nodes have HDFS daemons running and keep adding and removing new nodes isn't a good practice.
Resources:
[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters
[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task
One use case is if you use spot instances as task nodes. If its cheap enough, it may be worth while to add some compute power to your EMR cluster. This would be mostly for non-sensitive tasks.
Traditional Hadoop assumes all your workload requires high I/O, with EMR you can choose instance type based on your workload. For high IO needs example up to 100Gbps go with C type or R type, and you can use placement groups. And keep your Core Nodes to Task nodes ratio to 1:5 or lower, this will keep the I/O optimal and if you want higher throughput select C's or R's as your Core and Task. (edited - explaining barely any perf loss with EMR)
Task node's advantage it can scale up/down faster and can minimize compute cost. Traditional Hadoop Cluster it's hard to scale either ways since slaves also part of HDFS.
Task nodes are optional since core nodes can run Map and Reduce.
Core nodes takes longer to scale up/down depending on the tasks hence given the option of Task node for quicker auto scaling.
Reference: https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/
The reason why Hadoop suggest that we should have DataNode and Tasktracker Daemons on the same nodes is because it wants our processing power as close to data as possible.
But there also comes Rack level optimization when you have to deal with multi-nodes cluster. In my point of view AWS reducing I/O overhead by providing task nodes in the same rack in which Datanodes exists.
And the reason to provide Task nodes are that we need more processing over our data than to just focusing on storing them on HDFS. We would always want more TaskTracker than the Daemon nodes. So AWS has provided you the opportunity to increase it using a complete node benefiting RackLevel optimization.
And the way you want to get data into your cluster(using S3 and only core nodes) is a good option if you want good performance but using only a transient cluster.
As per my understanding, files stored in HDFS are divided into blocks and and each block is replicated to multiple nodes, 3 by default. How does Hadoop framework choose the node to run a map job, out of all the nodes on which a particular block is replicated.
As I know, there will be same amounts of map tasks as amounts of blocks.
See manual here.
Usually, framework choose those nodes close to the input block for reducing network bandwidth for map task.
That's all I know.
In Mapreduce 1 it depends on how many map task are running in that datanode which hosts a replica, because the number of map tasks is fixed in MR1. In MR2 there are no fixed slots, so it depends on number of tasks already running in that node.
I just want to clarify this quote "Code moves near data for computation",
does this mean all java MR written by developer deployed to all servers in cluster ?
If 1 is true, if someone changes a MR program, how its distributed to all the servers ?
Thanks
Hadoop put MR job's jar to the HDFS - its distributed file system. The task trackers which needed it will take it from there. So it distributed to some nodes and then loaded on-demand by nodes which actually needs them. Usually this needs mean that node is going to process local data.
Hadoop cluster is "stateless" in relation to the jobs. Each time job is viewed as something new and "side effects" of the previous job are not used.
Indeed, when some small number of files (or splits to be precise) are to be processed on large cluster, optimization of sending jar to only few hosts where data indeed reside might somewhat reduce the job latency. I do not know if such optimization is planned.
In hadoop cluster you use the same nodes for data and computation. That means your hdfs datanode is setup on the same cluster used by task tracker for computation. So now when you execute MR jobs job tracker looks where your data is stored. Whereas in other computation model data is not stored in the same cluster and you may have to move data while you are doing your computation on some compute node.
After you start a job all the map functions will get splits of your input file. These map functions are executed so that split of input file is closer to them or in other words in the same rack. This is what we mean by computation is done closer to data.
So to clarify your question, every time you run MR job its code is copied to all the nodes. So if we change a code a new code is copied to all the nodes.
I have Hadoop running on a cluster that has non-dedicated nodes (i.e. it shares nodes with other applications/users). When the other users are using a cluster's node, it is not allowed to run Hadoop jobs in that node. Thus, it is possible that only a few nodes are available in a given moment, and that this few nodes do not have all data blocks (replicas) need by the Hadoop job.
I also have a big Network-Attached Storage that is used for backup. So, I am wondering if there is a way to use it as a secondary storage for Hadoop. For example, if some data block is missing in the cluster, Hadoop would get the block from the secondary/backup storage.
Any ideas?
Thanks in advance!
I am not aware about such a "mixed" storage mode for the hadoop. So I do not think that your scenario is directly supported by hadoop.
For me it looks like you need more "elastic" solution. If EMR would be available open source - it might be good choice - where NAS would play the role of S3.
I would suggest the following solution in Your case:
Install and run data nodes on all available servers. They are not as resource hungy as task trackers - since they are only sequentially read/write data.
Install task trackers on all machines also, but run only on these which are not used now. Hadoop is smart enough to preserve data locality when possible. In the same time hadoop will takes change in number of task trackers much easier then disappearing data nodes.
Alternatively you can build cluster of task trackers only, not use HDFS and run jobs against the NAS.
In all cases the main interference with other users I still expect is network congestions - during shuffle stage hadoop is usually saturating the network.
I am running a job timing analysis. I have a pre configured cluster with 8 nodes. I want to run a given job with 8 nodes, 6 nodes , 4 nodes and 2 nodes respectively and note down the corresponding run times. Is there a way i can do this programatically, i.e by using appropriate settings in the Job configuration in Java code ?
There are a couple of ways. Would prefer in the same order.
exclude files can be used to not allow some of the task trackers/data nodes connect to the job tracker/ name node. Check this faq. The properties to be used are mapreduce.jobtracker.hosts.exclude.filename and dfs.hosts.exclude. Note than once the files have been changed, the name node and the job tracker have to be refreshed using the mradmin and dfsadmin commands with the refreshNodes option and it might take some time for the cluster to settle because data blocks have to be moved from the excluded nodes.
Another way is to stop the task tracker on the nodes. Then the map/reduce tasks will not be scheduled on that node. But, the data will still be fetched from all the data nodes. So, the data nodes also need to be stopped. Make sure that the name node gets out of safe mode and the replication factor is also set properly (with 2 data nodes, the replication factor can't be 3).
A Capacity Scheduler can also be used to limit the usage of a cluster by a particular job. But, when resources are free/idle then the scheduler will allocate resources beyond capacity for better utilization of the cluster. I am not sure if this can be stopped.
Well are you good with scripting ? If so play around with start scripts of the daemons. Since this is an experimental setup, I think restarting hadoop for each experiment should be fine.