Queries regarding map-reduce execution in hadoop - hadoop

Assume the data is not present in its node and present in some other machine,
How will the task tracker know which node contains data?
Does it talk to that data node directly? Or it will contact its own data node and it will take that responsibilty to copy that data?

How will the task tracker know which node contains data?
The TaskTracker does not know it. The JobTracker contacts the Namenode, gets the locations of the data, and tries its best to allocate data from one node to a TaskTracker on the same node (or as close as possible).
Does it talk to that data node directly? Or it will contact its own data node and it will take that responsibilty to copy that data?
It talks to the Datanode directly.

Related

What is the connection/relationship between Datanodes in HDFS and node manager on Yarn?

I am reading the basics about Yarn and Hadoop FileSystem. I was told by some blogs online that Yarn is just resource management system and HDFS is about storage. But I encountered the following lines in the book Hadoop Definitive Guide:
In this line, I can infer that there should be some connection between the location of Datanodes and Node Manager Node. Maybe they can be in the same place. That contradicts the knowledge I got from the blog.
Can anyone helps to explain this?
I googled a lot by"connection between Datanode and Node Manager" and I can not find direct answer to that.
Yarn is the OS, the compute power.
HDFS is the Disk.
If beneficial to move the compute to a node where the data is located. A node will often have a node manager that manages the compute(yarn) and a data node(HDFS). So both a container, and files for a yarn/hadoop job, can be colocated on 1 node/server. It's also the case you can just have a node manager on a node that isn't a data node. And you could have a data node, that wasn't a nodemanager. The two are independent, but frequently it makes sense to collocate them, to take advantage of data locality. After-all who wants a OS without a disk? (Their is actually a use case for this but lets not get into "compute nodes")

Amazon Emr - What is the need of Task nodes when we have Core nodes?

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes.
Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager.
Core which runs Datanode and Tasktracker daemons.
Task which only runs TaskTracker only.
My question to you guys in why does EMR provide task nodes? Where as hadoop suggests that we should have Datanode daemon and Tasktracker daemon on the same node. What is Amazon's logic behind doing this? You can keep data in S3 stream it to HDFS on the core nodes, do the processing on HDFS other than sharing data from HDFS to task nodes which will increase IO over head in that case. Because as far as my knowledge in hadoop, TaskTrackers run on DataNodes which have data blocks for that particular task then why have TaskTrackers on different nodes?
According to AWS documentation [1]
The node types in Amazon EMR are as follows:
Master node: A node that manages the cluster by running software
components to coordinate the distribution of data and tasks among
other nodes for processing. The master node tracks the status of tasks
and monitors the health of the cluster. Every cluster has a master
node, and it's possible to create a single-node cluster with only the
master node.
Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your
cluster. Multi-node clusters have at least one core node.
Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
According to AWS documentation [2]
Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors.
Task nodes don't run the Data Node daemon, nor do they store data in HDFS.
Some Use cases are:
You can use Task nodes for processing streams from S3. In this case Network IO won't increase as the used data isn't on HDFS.
Task nodes can be added or removed as no HDFS daemons are running. Hence, no data on task nodes. Core nodes have HDFS daemons running and keep adding and removing new nodes isn't a good practice.
Resources:
[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters
[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task
One use case is if you use spot instances as task nodes. If its cheap enough, it may be worth while to add some compute power to your EMR cluster. This would be mostly for non-sensitive tasks.
Traditional Hadoop assumes all your workload requires high I/O, with EMR you can choose instance type based on your workload. For high IO needs example up to 100Gbps go with C type or R type, and you can use placement groups. And keep your Core Nodes to Task nodes ratio to 1:5 or lower, this will keep the I/O optimal and if you want higher throughput select C's or R's as your Core and Task. (edited - explaining barely any perf loss with EMR)
Task node's advantage it can scale up/down faster and can minimize compute cost. Traditional Hadoop Cluster it's hard to scale either ways since slaves also part of HDFS.
Task nodes are optional since core nodes can run Map and Reduce.
Core nodes takes longer to scale up/down depending on the tasks hence given the option of Task node for quicker auto scaling.
Reference: https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/
The reason why Hadoop suggest that we should have DataNode and Tasktracker Daemons on the same nodes is because it wants our processing power as close to data as possible.
But there also comes Rack level optimization when you have to deal with multi-nodes cluster. In my point of view AWS reducing I/O overhead by providing task nodes in the same rack in which Datanodes exists.
And the reason to provide Task nodes are that we need more processing over our data than to just focusing on storing them on HDFS. We would always want more TaskTracker than the Daemon nodes. So AWS has provided you the opportunity to increase it using a complete node benefiting RackLevel optimization.
And the way you want to get data into your cluster(using S3 and only core nodes) is a good option if you want good performance but using only a transient cluster.

How to shard on user id with hdfs?

I'd like to use a hadoop/hdfs-based system, but I'm a bit concerned as I think I will want to have all data for one user on the same physical machine. Is there a way of accomplishing this in the hadoop-based universe?
During hdfs data write process, the datablock is written first in to node from which the client is accessing the cluster if the node is a datanode.
In order to solve your problem. The edge nodes will also be datanodes. Edge nodes are from where the user starts interacting to the cluster.
But using datanodes as edgenodes has some disadvantages. One of them include Data distribution. The data distribution will not be even and if the node fails, cluster re-balancing will be very costly.

Does data remain in HDFS when Hadoop cluster is down?

I am new to Qubole and wanted to know if data remains in HDFS after Hadoop cluster is down?
Any help is appreciated.
Thank you.
No data on HDFS is gone. We don't backup/restore HDFS. The model of computation on EC2/S3 is that the long-lived data always lives on S3 and HDFS is used only for intermediate and control data. We also use HDFS (and local disk), sometimes, as a cache.
That depends on what is down in the cluster. There are daemons in Hadoop, Namenode, data node, Resource manager, AppMaster and etc.
So if Namenode is down (Master node), then the data remains as is in the cluster, BUT you will not be able to access it at all. Because, Name node holds the meta data of the data nodes.
If a Data node is down on a cluster (slave node), then you will not be able to access the data from this node, but by default data will be stored in 3 locations in the cluster for fault tolerance. So you can still access the data from other two nodes.

Decommission/Recommission Hadoop Node

I can find info regarding the process of decommissioning of a node in hadoop cluster but not what happens to the data in the decommissioned node.
Only info i found is the answer given by Tariq at Does decomissioning a node remove data from that node?
My doubt is regarding the recommission of an old node.
If data is lost, it is same as commissioning a new node. Problem solved.
But, if the data is not lost, when we recommission it, the blocks present in this node become redundant since these blocks were already copied during decommission process. This leads to inconsistency in replicas of the blocks.
How does the hadoop framework takes care of this situation?

Resources