in hadoop,will files copied to master nodes or slave nodes - hadoop

Shall we copyFromlocal/put file to hdfs before processing map-reduce job? When I run mapreduce example I was taught to format hdfs in master node and copyFromLocal files to that hdfs space in master.
Then why some tutorials said master nodes just inform metadata to client.The laptop(client) will copy file blocks to data nodes not to master? e.g. http://www.youtube.com/watch?v=ziqx2hJY8Hg at 25:50. My understanding based on this tutorial is that the file (splitted by blocks) will be copied to slave nodes. so we do not need to copyFromlocal /put files to master nodes. I was so confused. Can anybody explain where will files copied/replicated to?

Blocks will not be copied to master node.
The master (Namenode) sends meta data to the client containing the data node locations
for placing each block by the client.
No actual block data is transferred to the NameNode.
I found this comic to be a good hdfs explanation

The master node (Namenode) in hadoop just deals with the Metadata (Datanode<->data information). It does not deal with the actual files. The actual files are instead stored only in the datanodes.

Related

How to take the backup of datanode in the hadoop cluster

Where as i find many solutions for taking the back up of metadata in name node and would like to know how to take the back up of datanode? leaving replication factor aside but want to know the detail process to take the back up of data node in the production level for 20 node cluster.
distcp command in hadoop, can copy data from source clusster to target
for example :
hadoop distcp hftp://cdh57-namenode:50070/hbase hdfs://CDH59-nameservice/hbase
this command copy hbase folder from cdh57-namenode to CDH59-nameservice
more information can obtain from this link
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_admin_distcp_data_cluster_migrate.html

Copy files/chunks from HDFS to local file system of slave nodes

In Hadoop, I understand that the master node(Namenode) is responsible for storing the blocks of data in the slave machines(Datanode).
When we use -copyToLocal or -get, from the master, the files could be copied from the HDFS to the local storage of the master node. Is there any way the slaves can copy the blocks(data) that are stored in them, to their own local file system?
For ex, a file of 128 MB could be split among 2 slave nodes storing 64MB each. Is there any way for the slave to identify and load this chunk of data to its local file system? If so, how can this be done programmatically? Can the commands -copyToLocal or -get be used in this case also? Please help.
Short Answer: No
The data/files cannot be copied directly from Datandode's. The reason is, Datanodes store the data but they don't have any metadata information about the stored files. For them, they are just block of bits and bytes. The metadata of the files is stored in the Namenode. This metadata contains all the information about the files (name, size, etc.). Along with this, Namenode keeps track of which blocks of the file are stored on which Datanodes. The Datanodes are also not aware of the ordering of the blocks, when actual files are splits in multiple blocks.
Can the commands -copyToLocal or -get be used in this case also?
Yes, you can simply run these from the slave. The slave will then contact the namenode (if you've configured it properly) and download the data to your local filesystem.
What it doesn't do is a "short-circuit" copy, in which it would just copy the raw blocks between directories. There is also no guarantee it will read the blocks from the local machine at all, as your commandline client doesn't know its location.
HDFS blocks are stored on the slaves local FS only . you can dig down the directory defined under property "dfs.datanode.dir"
But you wont get any benefit of reading blocks directly (without HDFS API). Also reading and editing block files directory can corrupt the file on HDFS.
If you want to store data on different slave local then you will have to implement your logic of maintaining block metadata (which is already written in Namenode and do for you).
Can you elaborate more why you want to distribute blocks by yourself when Hadoop takes care of all challenges faced in distributed data?
You can copy particular file or directory from one slave to another slave by using distcp
Usage: distcp slave1address slave2address

Where the data is stored in HDFS? Is there a way to change where its stored?

I'm a novice. I have a 3-Node Cluster. The Name Node, Job Tracker and Secondary Name Node are running in one node and two data nodes (HData1, HData2) in the other two cluster. If I store data from my local system to HDFS, how to find in which node it resides? Is there a way I can explicitly specify in which data node it has to be stored?
Thanks in advance!
Yes you can find it using hadoop fsck path
you can refer below links
how does hdfs choose a datanode to store
How to explicilty define datanodes to store a particular given file in HDFS?

HDFS' Location Awareness

Introduction
According to several documentation 1, 2, 3 HDFS' Location Awareness is about knowing the physical location of nodes and replicating data on different racks to reduce the impact of rack issues due to, e.g. power supply and/or switch issues.
Question
How does HDFS know the physical location of nodes and racks and subsequently decide to replicate data to nodes located on other racks?
Rack-awareness is configured when the cluster is set up. This can be done either manually for each node or through a script.
Each DataNode is given a network location which is simple a string, much like a file system path.
Example:
datacenter-1/rack-1/node1
datacenter-1/rack-1/node2
datacenter-1/rack-2/node3
The NameNode then builds a network topology (basically a tree structure) using the network locations of each DataNode. This topology is then used to determine block replica placement.
somebody needs to know where Data Nodes are located in the network topology and use that information to make an intelligent decision about where data replicas should exist in the cluster. That “somebody” is the Name Node.
The Name node stores this information and is the the namespace.
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

Does decomissioning a node remove data from that node?

In Hadoop, if I decommission a node Hadoop will redistribute the files across the cluster so they are properly replicated. Will the data be deleted from the decomissioned node?
I am trying to balance the data across the disks on a particular node. I plan to do this by decomissioning the node and then recomissioning the node. Do I need to delete the data from that node after decomissioning is complete, or will it be enough to simply recomission it (remove it from the excludes file and run hadoop dfsadmin -refreshNodes)?
UPDATE: It worked for me to decomission a node, delete all the data on that node, and then recomission it.
AFAIK, data is not removed from a DataNode when you decommission it. Further writes on that DataNode will not be possible though. When you decommission a DataNode, the replicas held by that DataNode are marked as "decommissioned" replicas, which are still eligible for read access.
But why do you want to perform this decomissioning/recomissioning cycle?Why don't you just specify all the disks as a comma separated value to the dfs.data.dir property in your hdfs-site.xml and restart the DataNode daemon. Run the balancer after the restart.
Hadoop currently doesn't support doing this automatically. But there might be hacks around to do that automatically.
Decommissioning and then replication, will be slow in my opinion, then manually moving blocks across different disks.
You can do the balancing manually though across the disks, something like this -
1.Take down the HDFS or only the datanode you are targeting.
2.Use the UNIX mv command to move the individual blocks and meta pairs from one directory to another on the host machine. E.g. move pairs of blk data file and blk.meta files to accross the disks on the same host.
3.Restart the HDFS or the datanode
Reference link for the procedure
Addendum:
You need to probably move pairs of blk_* and blk_*.meta files to and from inside the dfs/current directory of each data disk. E.g. pair files - blk_3340211089776584759 and blk_3340211089776584759_1158.meta
If you don't want do this manually, you can probably write a custom script to detect how much is occupied in the dfs/current directory of the each of your data disks and re-balance them accordingly i.e. move pairs of blk_* and blk_*.meta from one to another.

Resources