hdfs filesystem difference between namenode and datanode - hadoop

requesting to please help me in one small query of mine.
when we issue hdfs dfs command it will show the filesystem of namenode or datanode. ?
how can we see filesystem of namenode and datanode separately?
in my project, when i issue hdfs dfs -ls command it shows me files and directories. if i create a file it will create the file by default on it's choice of data node or somewhere else.
TIA

dfs commands will communicate with both the namenode and datanode
Namenode has no "filesystem" for listing content - its just metadata in memory. There of course is a local disk directory for a namenode, for checkpointing and backups, but the primary operations are against the memory storage for quick lookup.
A single datanode holds a subset of files called blocks. The namenode maintains the block locations and how files are collectively built from those blocks throughout the cluster
Persistence of FileSystem Metadata
Creation of HDFS files cannot target specific datanodes, and again, the file is split apart to many datanodes, whose locations are placed into the namenode memory

Related

Hadoop - data balanced automatically on copying to HDFS?

If I copy a set of files to HDFS in a Hadoop 7 node cluster, would HDFS take care of automatically balancing out the data across the 7 nodes, is there any way I can tell HDFS to constrain/force data to a particular node in the cluster?
NameNode is 'the' master who decides about where to put data blocks on different nodes in the cluster. In theory, you should not alter this behavior as it is not recommended. If you copy files to hadoop cluster, NameNode will automatically take care of distributing them almost equally on all the DataNodes.
If you want to force change this behaviour (not recommended), these posts could be useful:
How to put files to specific node?
How to explicilty define datanodes to store a particular given file in HDFS?

When and who exactly creates the input splits for MapReduce in Hadoop?

When I copy the data file to HDFS by using -copyFromLocal command` data gets copied into to HDFS. When I see this file through web browser, it shows that the replication factor is 3 and file is in location "/user/hduser/inputData/TestData.txt" with a size of 250 MB.
I have 3 CentOS servers as DataNodes, CentOS Desktop as NameNode and client.
When I copy from local to the above mentioned path, where exactly it copies to?
Does it copy to NameNode or DataNode as blocks of 64 MB?
Or, it won't replicate until I run MapReduce job and map prepares splits and replicates the data to DataNodes?
Please clarify my queries.
1 . When i copy from local to this above mentioned path. Where exactly it copies to ? Ans: The data gets copied to HDFS or HADOOP Distributed file system. which consists of data node and name node. The data that you copy resides in data nodes as blocks (64MB or multiple of 64 MB) and the information of which blocks resides in which data node and its replica is stored in namenode.
2. is it copies to namenode or datanode as many splits of 64 MB ? or Ans: your file will be stored in data node as blocks of 64MB and the location and order of the splits is stored in name node.
3 it wont replicate untill i run MapReduce Job. and map prepares splits and replicates to datanodes. Ans: This is not true. As soon as the data is copied in HDFS, Filesystem replicates the data based on the set replication ratio irrespective of process used to copy the data.

Size of the NAMENODE and DATANODE in hadoop

What is the size of the NAMENODE and DATANODE and wheather the block and datanode is different or not in hadoop
if input file is 200mb size then how many datanode will be created and how many blocks will created.
NameNode, SecondaryNameNode, JobTracker, DataNode are components in a hadoop cluster.
Blocks are how data is stored on the HDFS.
Datanode can be simply seen as a hard disk(lot of memory - TBs) with a ram and a processor on the network.
Just like you can store many files on a hard disk, you can do that hear as well.
Now if you want to store a 200mb file on a Hadoop-
v1.0 system or lower, it will be split into 4 blocks of 64mb size
v2.0 System or higher, it will be split into 2 blocks of 128mb size
After this small explanation, I'd revert you to #charles_Babbage's suggestion to go and start with a book or tutorials on youtube.

When I store files in HDFS, will they be replicated?

I am new to Hadoop.
When I store Excel files using hadoop -fs put commoad, it is stored in HDFS.
Replication factor is 3.
My question is: Does it take 3 copies and store them into 3 nodes each?
Here is a comic for HDFS working.
https://docs.google.com/file/d/0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit?pli=1
Does it take 3 copies and store them into 3 nodes each.
answer is: NO
Replication is done in pipelining
that is it copies some part of file to datanode1 and then copies to datanode2 from datanode1 and to datanode3 from datanode1
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Replication+Pipelining
see here for Replication Pipelining
Your HDFS Client (hadoop fs in this case) will be given the block names and datanode locations (the first being the closest location if the NameNode can determine this from the rack awareness script) of where to store these files by the NameNode.
The client then copies the blocks to the closest Data node. The data node is then responsible for copying the block to a second datanode (preferably on another rack), where finally the second will copy to the third (on the same rack as the third).
So your client will only copy data to one of the data nodes, and the framework will take care of the replication between datanodes.
It will store the original file to one (or more in case of large files) blocks. These blocks will be replicated to two other nodes.
Edit: My answer applies to Hadoop 2.2.0. I have no experience with prior versions.
Yes it will be replicated in 3 nodes (maximum upto 3 nodes).
The Hadoop Client is going to break the data file into smaller “Blocks”, and place those blocks on different machines throughout the cluster. The more blocks you have, the more machines that will be able to work on this data in parallel. At the same time, these machines may be prone to failure, so it is safe to insure that every block of data is on multiple machines at once to avoid data loss.
So each block will be replicated in the cluster as its loaded. The standard setting for Hadoop is to have (3) copies of each block in the cluster. This can be configured with the dfs.replication parameter in the file hdfs-site.xml.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Yes it make n(replications factor) number copies in hdfs
use this command to find out the location of file, find #rack it is stored, what is the block name on all racks
hadoop fsck /path/to/your/directory -files -blocks -locations -racks
Use this command to load data into hdfs with replication
hadoop fs -Ddfs.replication=1 -put big.file /tmp/test1.file
and -Ddfs.replication=1 you can define number of replication copy will created while to loading data into hdfs

How is data copied to HDFS when there's multiple dfs.data.dir

dfs.data.dir allows more than one directory for datanode to store data blocks. When data is copied to the HDFS, how is the data distributed across the directories?
When dfs.data.dir has multiple values, data is copied to the HDFS in a round-robin fashion. If one of the directory's disk is full, round-robin data copy will continue on the rest of the directories.

Resources