Setting up hadoop cluster - hadoop

Does the worker nodes in a hadoop cluster need hadoop installed on each one ?
What if I need only the computing power of some PCs can I use only map-reduce without installing HDFS on each node ?

When you say worker nodes it includes both DataNodes and TaskTracker. So in that sense you need them on each machine if you wish to run MR jobs.
But the main point here is what would you do with MR alone. I mean running MR jobs on data stored in local FS is not gonna be of much use as you can't harness the power of distributed data storage and parallelism provided by Hadoop in that situation.

To use computing power of node you need to run TaskTracker on that node. Hence, Hadoop must be installed.
If you don't need HDFS, you can run only TaskTracker and don't start DataNode.

Related

How huge amount of data is inputted in hadoop?

I am new to big data and hadoop. I would like to know are name node, data node, secondary name node, job tracker, task tracker different systems ? If i want to process 1000 PB data, How data is divided and who is doing that task and where should i input 1000 PB data.
Yes namenode, dataNode, secondaryNameNode, jobTracker, taskTracker are all different virtual machines (JVMs you can call them). You can start them all in one physical machine (pseudo/local mode) or you can start them on different physical machines (distributed mode). These are all in Hadoop1.
Hadoop2 has introduced containers with YARN in which jobTracker and taskTracer are removed with more efficient resourceManager, applicationManager, nodeManager etc. You can find more info hadoop-yarn-site
Data are stored in HDFS (Hadoop Distributed File System) and are stored in blocks, default to 64MB. When data is loaded to hdfs, hadoop distributes the data equally in the cluster with the defined block size. When a job is run code is distributed to the nodes in cluster so that each processing occurs where the data is residing except in shuffle and sorting cases.
I hope you must have general idea of how hadoop and hdfs works. Followings are some links for you to start with
Map Reduce programming
cluster setup
hadoop commands

Amazon Emr - What is the need of Task nodes when we have Core nodes?

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes.
Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager.
Core which runs Datanode and Tasktracker daemons.
Task which only runs TaskTracker only.
My question to you guys in why does EMR provide task nodes? Where as hadoop suggests that we should have Datanode daemon and Tasktracker daemon on the same node. What is Amazon's logic behind doing this? You can keep data in S3 stream it to HDFS on the core nodes, do the processing on HDFS other than sharing data from HDFS to task nodes which will increase IO over head in that case. Because as far as my knowledge in hadoop, TaskTrackers run on DataNodes which have data blocks for that particular task then why have TaskTrackers on different nodes?
According to AWS documentation [1]
The node types in Amazon EMR are as follows:
Master node: A node that manages the cluster by running software
components to coordinate the distribution of data and tasks among
other nodes for processing. The master node tracks the status of tasks
and monitors the health of the cluster. Every cluster has a master
node, and it's possible to create a single-node cluster with only the
master node.
Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your
cluster. Multi-node clusters have at least one core node.
Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
According to AWS documentation [2]
Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors.
Task nodes don't run the Data Node daemon, nor do they store data in HDFS.
Some Use cases are:
You can use Task nodes for processing streams from S3. In this case Network IO won't increase as the used data isn't on HDFS.
Task nodes can be added or removed as no HDFS daemons are running. Hence, no data on task nodes. Core nodes have HDFS daemons running and keep adding and removing new nodes isn't a good practice.
Resources:
[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters
[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task
One use case is if you use spot instances as task nodes. If its cheap enough, it may be worth while to add some compute power to your EMR cluster. This would be mostly for non-sensitive tasks.
Traditional Hadoop assumes all your workload requires high I/O, with EMR you can choose instance type based on your workload. For high IO needs example up to 100Gbps go with C type or R type, and you can use placement groups. And keep your Core Nodes to Task nodes ratio to 1:5 or lower, this will keep the I/O optimal and if you want higher throughput select C's or R's as your Core and Task. (edited - explaining barely any perf loss with EMR)
Task node's advantage it can scale up/down faster and can minimize compute cost. Traditional Hadoop Cluster it's hard to scale either ways since slaves also part of HDFS.
Task nodes are optional since core nodes can run Map and Reduce.
Core nodes takes longer to scale up/down depending on the tasks hence given the option of Task node for quicker auto scaling.
Reference: https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/
The reason why Hadoop suggest that we should have DataNode and Tasktracker Daemons on the same nodes is because it wants our processing power as close to data as possible.
But there also comes Rack level optimization when you have to deal with multi-nodes cluster. In my point of view AWS reducing I/O overhead by providing task nodes in the same rack in which Datanodes exists.
And the reason to provide Task nodes are that we need more processing over our data than to just focusing on storing them on HDFS. We would always want more TaskTracker than the Daemon nodes. So AWS has provided you the opportunity to increase it using a complete node benefiting RackLevel optimization.
And the way you want to get data into your cluster(using S3 and only core nodes) is a good option if you want good performance but using only a transient cluster.

DataNode and TaskTracker on separate machines?

I am fairly new to Hadoop and I have following questions on Hadoop framework. Can anybody please guide on this?
Is DataNode and TaskTracker located physically on separate machines in a production environment?
When does Hadoop splits a file into blocks? Does this happen when you copy a file from local filesystem into HDFS?
Short answer
Most of the time, but not necessarily.
Yes.
Long Answer
1)
An installation of Hadoop on a cluster will have 2 main types of nodes:
Master Nodes
Data Nodes
Master Nodes typically run at least:
CLDB
Zookeeper
JobTracker
Data Nodes typically run at least:
TaskTracker
The DataNode service can run on a different node than the TaskTracker service. However, the Hadoop Docs for the DataNode service recommend to run DataNode and TaskTracker on the same nodes so that MapReduce operations are performed close to the data.
For the MapR distribution of Hadoop, the two server roles typically run:
MapR Control Node
ZooKeeper *
CLDB *
JobTracker *
HBaseMaster
NFS Gateway
Webserver
MapR Data Node
TaskTracker *
RegionServer (sometimes)
Zookeeper (sometimes)
2)
While most filesystems store data in blocks, HDFS distributes & replicates the blocks across DataNodes. When you first store data in HDFS, it will break it into blocks and store it across different nodes according to the specified replication factor. However, if you add new DataNodes to the cluster, it will not automatically rebalance old blocks across them unless the replication factor is not met.
(Thanks to #javadba for clarifying this!)
Given TrinitronX has already answered #1 - though the Short Answer should be NO - the datanode/task tracker MAY be on different physical machines, but it is uncommon. You are best to start off with "slave" machines being datanode plus task tracker.
So this is an answer to the second part of the question
2) When does Hadoop splits a file into blocks? Does this happen when you copy a file from local filesystem into HDFS?
Yes. The file is broken into blocks upon loading into HDFS.
Data-node and Job tracker can be run on different machines.
Hadoop always stores a files as a blocks during all operations on hadoop
Refer
1.Hadoop Job tracker and task tracker
2.Hadoop block size and Replication

Differences between MapReduce and Yarn

I was searching about hadoop and mapreduce with respect to straggler problems and the papers in this problem
but yesterday I found that there is hadoop 2 with Yarn ,,
unfortunately no paper is talking about straggler problem in Yarn
So I want to know what is difference between MapReduce and Yarn in the part straggler?
is Yarn suffer from straggler problem?
and when MRmaster asks resource manger for resources , resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities ?
thanks so much,,
Here are the MapReduce 1.0 and MapReduce 2.0 (YARN)
MapReduce 1.0
In a typical Hadoop cluster, racks are interconnected via core switches. Core switches should connect to top-of-rack switches Enterprises using Hadoop should consider using 10GbE, bonded Ethernet and redundant top-of-rack switches to mitigate risk in the event of failure. A file is broken into 64MB chunks by default and distributed across Data Nodes. Each chunk has a default replication factor of 3, meaning there will be 3 copies of the data at any given time. Hadoop is “Rack Aware” and HDFS has replicated chunks on nodes on different racks. JobTracker assign tasks to nodes closest to the data depending on the location of nodes and helps the NameNode determine the ‘closest’ chunk to a client during reads. The administrator supplies a script which tells Hadoop which rack the node is in, for example: /enterprisedatacenter/rack2.
Limitations of MapReduce 1.0 – Hadoop can scale up to 4,000 nodes. When it exceeds that limit, it raises unpredictable behavior such as cascading failures and serious deterioration of overall cluster. Another issue being multi-tenancy – it is impossible to run other frameworks than MapReduce 1.0 on a Hadoop cluster.
MapReduce 2.0
MapReduce 2.0 has two components – YARN that has cluster resource management capabilities and MapReduce.
In MapReduce 2.0, the JobTracker is divided into three services:
ResourceManager, a persistent YARN service that receives and runs applications on the cluster. A MapReduce job is an application.
JobHistoryServer, to provide information about completed jobs
Application Master, to manage each MapReduce job and is terminated when the job completes.
TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a node. NodeManager is responsible for launching containers that could either be a map or reduce task.
This new architecture breaks JobTracker model by allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and lets Hadoop clusters scale up to larger configurations than 4000 nodes. This architecture also allows simultaneous execution of a variety of programming models such as graph processing, iterative processing, machine learning, and general cluster computing, including the traditional MapReduce.
You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.
"when MRmaster asks resource manger for resources?"
when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.
"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities"
I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time.
There is no YARN in MapReduce 1. In MapReduce there is Yarn.
If for straggler problem you mean that if first guy waits 'something' which then causes more waits along a road who depends on that first guy then I guess there is always this problem in MR jobs. Getting allocated resources naturally participate to this problem along with all other things which may cause components to wait something.
Tez which is supposed to be a drop-in replacement for MR job runtime makes a things differently. Instead of doing task runs in a same way current MR Appmaster does it tries to use DAG of tasks which does a much better job of not getting into bad straggler problem.
You need to understand a relationship between MR and YARN. YARN is simply a dummy resource scheduler meaning it doesn't schedule 'tasks'. What it gives to MR Appmaster is a set or resources(in a sense it's only combination of memory and cpu and location). It's then MR Appmaster responsibility to decide what to do with those resources.

Is it possible to specify which takstrackers to use in a MapReduce job?

We have two types of jobs in our Hadoop cluster. One job uses MapReduce HBase scanning, the other one is just pure manipulation of raw files in HDFS. Within our HDFS cluster, part of the datanodes are also HBase regionservers, but others aren't. We would like to run the HBase scans only in the regionservers (to take advantage of the data locality), and run the other type of jobs in all the datanodes. Is this idea possible at all? Can we specify which tasktrackers to use in the MapReduce job configuration?
Any help is appreciated.

Resources