DataNode and TaskTracker on separate machines? - hadoop

I am fairly new to Hadoop and I have following questions on Hadoop framework. Can anybody please guide on this?
Is DataNode and TaskTracker located physically on separate machines in a production environment?
When does Hadoop splits a file into blocks? Does this happen when you copy a file from local filesystem into HDFS?

Short answer
Most of the time, but not necessarily.
Yes.
Long Answer
1)
An installation of Hadoop on a cluster will have 2 main types of nodes:
Master Nodes
Data Nodes
Master Nodes typically run at least:
CLDB
Zookeeper
JobTracker
Data Nodes typically run at least:
TaskTracker
The DataNode service can run on a different node than the TaskTracker service. However, the Hadoop Docs for the DataNode service recommend to run DataNode and TaskTracker on the same nodes so that MapReduce operations are performed close to the data.
For the MapR distribution of Hadoop, the two server roles typically run:
MapR Control Node
ZooKeeper *
CLDB *
JobTracker *
HBaseMaster
NFS Gateway
Webserver
MapR Data Node
TaskTracker *
RegionServer (sometimes)
Zookeeper (sometimes)
2)
While most filesystems store data in blocks, HDFS distributes & replicates the blocks across DataNodes. When you first store data in HDFS, it will break it into blocks and store it across different nodes according to the specified replication factor. However, if you add new DataNodes to the cluster, it will not automatically rebalance old blocks across them unless the replication factor is not met.
(Thanks to #javadba for clarifying this!)

Given TrinitronX has already answered #1 - though the Short Answer should be NO - the datanode/task tracker MAY be on different physical machines, but it is uncommon. You are best to start off with "slave" machines being datanode plus task tracker.
So this is an answer to the second part of the question
2) When does Hadoop splits a file into blocks? Does this happen when you copy a file from local filesystem into HDFS?
Yes. The file is broken into blocks upon loading into HDFS.

Data-node and Job tracker can be run on different machines.
Hadoop always stores a files as a blocks during all operations on hadoop
Refer
1.Hadoop Job tracker and task tracker
2.Hadoop block size and Replication

Related

How huge amount of data is inputted in hadoop?

I am new to big data and hadoop. I would like to know are name node, data node, secondary name node, job tracker, task tracker different systems ? If i want to process 1000 PB data, How data is divided and who is doing that task and where should i input 1000 PB data.
Yes namenode, dataNode, secondaryNameNode, jobTracker, taskTracker are all different virtual machines (JVMs you can call them). You can start them all in one physical machine (pseudo/local mode) or you can start them on different physical machines (distributed mode). These are all in Hadoop1.
Hadoop2 has introduced containers with YARN in which jobTracker and taskTracer are removed with more efficient resourceManager, applicationManager, nodeManager etc. You can find more info hadoop-yarn-site
Data are stored in HDFS (Hadoop Distributed File System) and are stored in blocks, default to 64MB. When data is loaded to hdfs, hadoop distributes the data equally in the cluster with the defined block size. When a job is run code is distributed to the nodes in cluster so that each processing occurs where the data is residing except in shuffle and sorting cases.
I hope you must have general idea of how hadoop and hdfs works. Followings are some links for you to start with
Map Reduce programming
cluster setup
hadoop commands

Does hbase really scales linearly?

I started to learn hbase and I don't understand how it scales linearly.
The problem is that before you install hbase you have to have an hdfs cluster. The HDFS cluster have a master node which can be only one in the whole cluster, so it is a bottleneck. Ofcourse we can run 1 more master node (it is possible to run only 1 more master node) but it will be in the standby state.
As I understand hbase uses the HDFS cluster to store data. So, for me it is logically that it have no sense to run more than one Hmaster because all requests will go to the hdfs active master which performance can suffer if we have too much requests.
Also I don't understand properly do we need to install hbase on the same nodes with hdfs or separately. What are the benefits if we run hbase separately from HDFS.
As for me it is logically to install hbase cluster on the same nodes with hdfs as in the following example:
HDFS active master - HMaster
HDFS standby master - HMaster backup
HDFS Data node - HRegion server
for me it is the most logically structure because if we separate hdfs master from hmaster then probability to loose hbase cluster will be two times bigger.
I will be very happy if someone can share information about all these stuff. Because I really don't understand how hbase can linearly scales and how it works with hdfs.
First if you want you can install HBase over any supported file system. It is not mandatory to use it over Hdfs but using it with Hdfs give advantage to it like
Fault taulrence , Data replication, checksums etc.
That's why it is recommended to use HBase over hdfs
Moreover although there is a bottleneck of namenode in hdfs but it does not effect HBase efficiency because it is not that every operation internal working is dependent on namenode of hdfs for instance Region servers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly while Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process. Which means that reading and writing of data is independent of creating and deleting of table.
You can refer https://www.mapr.com/blog/in-depth-look-hbase-architecture for more details about hdfs.
Also see this webinar on HBase by lars george. https://m.youtube.com/watch?v=_HLoH_PgrLk
Hope this will clear your doubts.

Setting up hadoop cluster

Does the worker nodes in a hadoop cluster need hadoop installed on each one ?
What if I need only the computing power of some PCs can I use only map-reduce without installing HDFS on each node ?
When you say worker nodes it includes both DataNodes and TaskTracker. So in that sense you need them on each machine if you wish to run MR jobs.
But the main point here is what would you do with MR alone. I mean running MR jobs on data stored in local FS is not gonna be of much use as you can't harness the power of distributed data storage and parallelism provided by Hadoop in that situation.
To use computing power of node you need to run TaskTracker on that node. Hence, Hadoop must be installed.
If you don't need HDFS, you can run only TaskTracker and don't start DataNode.

Hadoop doesn't use one node for job

I've got a four node YARN cluster set up und running. I recently had to format the namenode due to a smaller problem.
Later I ran Hadoop's PI example to verify every node was still taking part in the calculation, which they all did. However when I start my own job now one of the nodes is not being used at all.
I figured this might be because this node doesn't have any data to work on. So I tried to balance the cluster using the balancer. This doesn't work and the balancer tells me the cluster is balanced.
What am I missing?
While processing, your ApplicationMaster would negoriate with the NodeManager for containers and NodeManager in turn would try to obtain the nearest datanode resource. Since your replication factor is 3, HDFS would try to place 1 whole copy on a single datanode and distribute the rest across all the datanodes.
1) Change the replication factor to 1 (Since you are only trying to benchmark, reducing replication should not be a big issue).
2) Make sure your client(machine from where you would give your -copyFromLocal command) does not have a datanode running on it. If not, HDFS will tend to place most of the data in this node since it would have reduced latency.
3) Control the file distribution using dfs.blocksize property.
4) Check the status of your datanodes using hdfs dfsadmin -report.
Make sure your node is joinig the resourcemanager. Look into nodemanager log on t the problem node, see if there are errors. Look into the resourcemanager Web UI (:8088 by default) make sure the node is listed there.
Make sure the node is bringing enough resources to the pool to be able to run a job. Check yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb in yarn-site.xml on the node. The memory should be more than the minimum memory requested by a container (see yarn.scheduler.minimum-allocation-mb).

Hadoop Distribute File System

I have a doubht in HDFS architecture..
Is there any difference between Name Node and Job tracker? and Data Node and Task tracker?
both are same or each has some specific functionality?
I came to know that Name is considered to the master node . It has namespace in RAM that has all information about the metadata.
Is there any difference between Name Node and Job tracker?
Two unrelated components. Namenode is part of the HDFS, while Jobtracker is part of mapreduce. Apples and oranges. Ditto for datanode (HDFS) and tasktracker (mapreduce).
Hadoop core consists of two systems: the HDFS filesystem and the mapreduce components. HDFS is file system, it consists at minimum of one namenode (the central catalog) and several datanodes (the actual storage). Mapreduce consists of the job tracker (central 'brain' of mapreduce) and several tasktrackers (executors).
While deployed together, and getting +synergy from how they interact (data locality for compute), they are distinct. There is no point in asking what is common or different between them.

Resources