I have a doubht in HDFS architecture..
Is there any difference between Name Node and Job tracker? and Data Node and Task tracker?
both are same or each has some specific functionality?
I came to know that Name is considered to the master node . It has namespace in RAM that has all information about the metadata.
Is there any difference between Name Node and Job tracker?
Two unrelated components. Namenode is part of the HDFS, while Jobtracker is part of mapreduce. Apples and oranges. Ditto for datanode (HDFS) and tasktracker (mapreduce).
Hadoop core consists of two systems: the HDFS filesystem and the mapreduce components. HDFS is file system, it consists at minimum of one namenode (the central catalog) and several datanodes (the actual storage). Mapreduce consists of the job tracker (central 'brain' of mapreduce) and several tasktrackers (executors).
While deployed together, and getting +synergy from how they interact (data locality for compute), they are distinct. There is no point in asking what is common or different between them.
Related
I am new to big data and hadoop. I would like to know are name node, data node, secondary name node, job tracker, task tracker different systems ? If i want to process 1000 PB data, How data is divided and who is doing that task and where should i input 1000 PB data.
Yes namenode, dataNode, secondaryNameNode, jobTracker, taskTracker are all different virtual machines (JVMs you can call them). You can start them all in one physical machine (pseudo/local mode) or you can start them on different physical machines (distributed mode). These are all in Hadoop1.
Hadoop2 has introduced containers with YARN in which jobTracker and taskTracer are removed with more efficient resourceManager, applicationManager, nodeManager etc. You can find more info hadoop-yarn-site
Data are stored in HDFS (Hadoop Distributed File System) and are stored in blocks, default to 64MB. When data is loaded to hdfs, hadoop distributes the data equally in the cluster with the defined block size. When a job is run code is distributed to the nodes in cluster so that each processing occurs where the data is residing except in shuffle and sorting cases.
I hope you must have general idea of how hadoop and hdfs works. Followings are some links for you to start with
Map Reduce programming
cluster setup
hadoop commands
If I copy a set of files to HDFS in a Hadoop 7 node cluster, would HDFS take care of automatically balancing out the data across the 7 nodes, is there any way I can tell HDFS to constrain/force data to a particular node in the cluster?
NameNode is 'the' master who decides about where to put data blocks on different nodes in the cluster. In theory, you should not alter this behavior as it is not recommended. If you copy files to hadoop cluster, NameNode will automatically take care of distributing them almost equally on all the DataNodes.
If you want to force change this behaviour (not recommended), these posts could be useful:
How to put files to specific node?
How to explicilty define datanodes to store a particular given file in HDFS?
I am fairly new to Hadoop and I have following questions on Hadoop framework. Can anybody please guide on this?
Is DataNode and TaskTracker located physically on separate machines in a production environment?
When does Hadoop splits a file into blocks? Does this happen when you copy a file from local filesystem into HDFS?
Short answer
Most of the time, but not necessarily.
Yes.
Long Answer
1)
An installation of Hadoop on a cluster will have 2 main types of nodes:
Master Nodes
Data Nodes
Master Nodes typically run at least:
CLDB
Zookeeper
JobTracker
Data Nodes typically run at least:
TaskTracker
The DataNode service can run on a different node than the TaskTracker service. However, the Hadoop Docs for the DataNode service recommend to run DataNode and TaskTracker on the same nodes so that MapReduce operations are performed close to the data.
For the MapR distribution of Hadoop, the two server roles typically run:
MapR Control Node
ZooKeeper *
CLDB *
JobTracker *
HBaseMaster
NFS Gateway
Webserver
MapR Data Node
TaskTracker *
RegionServer (sometimes)
Zookeeper (sometimes)
2)
While most filesystems store data in blocks, HDFS distributes & replicates the blocks across DataNodes. When you first store data in HDFS, it will break it into blocks and store it across different nodes according to the specified replication factor. However, if you add new DataNodes to the cluster, it will not automatically rebalance old blocks across them unless the replication factor is not met.
(Thanks to #javadba for clarifying this!)
Given TrinitronX has already answered #1 - though the Short Answer should be NO - the datanode/task tracker MAY be on different physical machines, but it is uncommon. You are best to start off with "slave" machines being datanode plus task tracker.
So this is an answer to the second part of the question
2) When does Hadoop splits a file into blocks? Does this happen when you copy a file from local filesystem into HDFS?
Yes. The file is broken into blocks upon loading into HDFS.
Data-node and Job tracker can be run on different machines.
Hadoop always stores a files as a blocks during all operations on hadoop
Refer
1.Hadoop Job tracker and task tracker
2.Hadoop block size and Replication
Does the worker nodes in a hadoop cluster need hadoop installed on each one ?
What if I need only the computing power of some PCs can I use only map-reduce without installing HDFS on each node ?
When you say worker nodes it includes both DataNodes and TaskTracker. So in that sense you need them on each machine if you wish to run MR jobs.
But the main point here is what would you do with MR alone. I mean running MR jobs on data stored in local FS is not gonna be of much use as you can't harness the power of distributed data storage and parallelism provided by Hadoop in that situation.
To use computing power of node you need to run TaskTracker on that node. Hence, Hadoop must be installed.
If you don't need HDFS, you can run only TaskTracker and don't start DataNode.
We have two types of jobs in our Hadoop cluster. One job uses MapReduce HBase scanning, the other one is just pure manipulation of raw files in HDFS. Within our HDFS cluster, part of the datanodes are also HBase regionservers, but others aren't. We would like to run the HBase scans only in the regionservers (to take advantage of the data locality), and run the other type of jobs in all the datanodes. Is this idea possible at all? Can we specify which tasktrackers to use in the MapReduce job configuration?
Any help is appreciated.