Does Hadoop create multiple copies of input files, one copy per node - hadoop

If I wish to copy a file from a local directory to a HDFS, do I need to physically copy the file on each Hadoop node?
Or if I use the hadoop dfs command, Hadoop will internally create a copy of this file on each node?
Am I correct to assume that each node needs to have a copy of the file?

When you will copy the file (any data) Hadoop (HDFS) will store it on any Datanode and metadata information will be stored on Namenode. The replication of the file (data) will be taken care by Hadoop, you need not to copy it multiple times.
You can use of the below command to copy files from local to HDFS
hdfs dfs -put <source> <destination>
hdfs dfs -copyFromLocal <source> <destination>
The replication factor configuration is stored in hdfs-site.xml file.
Am I correct to assume that each node needs to have a copy of the file?
This is not necessarily true. HDFS creates replica as per the configuration found in the hdfs-site.xml file. The default for the replication is 3.

Yeah hadoop distributed file system replicates data in minimum 3 datanodes. But nowadays trend is on spark which is also run on top of hadoop. And this is 100 times faster than hadoop.
spark http://spark.apache.org/downloads.html

You are not required to copy the file from local machine to every node in the cluster.
You could use client utiles like hadoop fs or hadoop dfs commands to do so.
It is not necessary that your file will be copied to all the nodes in the cluster, the number of replications is controlled by the dfs.replication property from the hdfs-site.xml configuration file, where its default value is 3, means that 3 copies of your file be stored across the cluster on some random nodes.
Please refer the more details below,
hadoop dfs command first contacts the Namenode with the given
files's details.
The Namenode computes the number of blocks that the file has to
splitted according to the block size configured in hdfs-site.xml
The Namenode returns the list of chosen Datanodes for every
computed block of the given file. This count of Datanodes in every
list is equal to the replication factor configured in the
hdfs-site.xml
Then the hadoop client starts storing every blocks of the file to the
given Datanode through Hadoop Streaming.
For each block, the hadoop client just prepares the data pipe line
in which all chosen Datanodes chosen to store the block are formed
as a Data queue.
The hadoop client just copies the current block only to the first
Datanode in the queue.
Upon completion of the copy, the first Datanode cascades the block to
second Datanode in the Queue and so on.
All the block details of the files and the details of Datanodes which
have the copy of them are maintained in Namenode's metadata.

You do not need to copy files manually to all nodes.
Hadoop will take care of distributing data to different nodes.
you can use simple commands to upload data to HDFS
hadoop fs -copyFromLocal </path/to/local/file> </path/to/hdfs>
OR
hadoop fs -put </path/to/local/file> </path/to/hdfs>
You can read more how data is internally written on HDFS here : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
You can also download file from HDFS to local filessytem wothout manually copying files from each datanode using command:
hadoop fs -copyToLocal </path/to/hdfs/file> </path/to/local>
OR
hadoop fs -get </path/to/hdfs/file> </path/to/local>

Related

how do you create a hdfs data directory?

everytime my hadoop server reboots, I have to format the namenode to start the hadoop. This removes all of the files in my hadoop installation.
I need to move my hadoop hdfs location from /tmp file to permenant location where whenever the server reboots, I don't have to format the namenode etc.
I am very new to hadoop.
How do I create a hdfs file in another directory?
How do I reference this data directory in config file so that I don't have to format the namenode?
These two properties of the hdfs-site.xml determine where local files are stored.
The defaults are under /tmp
dfs.namenode.name.dir
dfs.datanode.data.dir
You typically have to format a namenode only when the HDFS processes failed to terminate correctly (such as a power failure or forced shutdown). It is encouraged to run a standby Namenode to prevent these scenarios.

hadoop on windows, how to add D:\folder1 and E:\folder1 to hdfs?

hadoop fs -put popularNames.txt /user/hadoop/dir1/popularNames.txt
My folders are very huge, about 3 TB.
I don't want to copy the folder, I want to set the hdfs to the location. How to make it?
HDFS: Hadoop distributed file system.
You can't add a link to point to a location, because the data must be present in the HDFS(not on local). The whole point of using hadoop is distributed computation, which is possible when your data is distributed on a cluster.
hadoop fs -put had to be used to move the file from your local to the hdfs in order to use hadoop framework.

How to read Hadoop HDFS 64b compressed files from one Hadoop cluster on another Hadoop cluster

I have 64 b compressed Hadoop HDFS files(FSImage, edit logs, blk_*.meta files etc) of one Hadoop cluster and I want to read them on my local VM. As an option I'm thinking like:
Taking backup of my DataNode, NameNode and SecondaryNameNode
Format the DataNode, NameNode and SecondaryNameNode
Copy the files into respective location
Map the configuration properties
Restart the HDFS.
This might be the one of the way. Please help me to figure out if there is any alternative for the above approach. My apologies that i can not share the data.

Explanation of the hadoop file system

Can any one help me understand the data storage concept of hadoop?
As I understand it, hadoop deals with fs image and data blocks, and fsimage and edit logs paths are stored hdfs-site.xml. But what about the data blocks? Can anyone help me in this? I am little bit confused where the /user and /tmp dir is actually present in the filesystem.
I used this link to set up a single node hadoop cluster: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Files are split into blocks and stored in the Hadoop Distributed File System (HDFS). Consult the HDFS module of Yahoo's Hadoop Tutorial for a description of HDFS. The directories stored in HDFS can be viewed by typing the following command into a terminal: hadoop dfs -ls
The Namenode's FSImage keeps track of which Datanode has which files. In the hdfs-site.xml file, the configuration 'dfs.data.dir' defines where the datanode stores the underlying files on the filesystem. This can be a comma separated list of directories (think multiple disks).

hadoop putting data on master not relecting on all clusters

I have set up a hadoop cluster of 3 nodes for testing. Everything works fine, but when i upload the file into master node using the following command
./bin/hadoop dfs -copyFromLocal localfolderName hdfsfolderName
The files are reflected only in data node of master node.
I thought that hadoop will split the input files and spread the chunks across all the slave nodes.
I want to know if i am missing any configuration, or its how the hadoop behaves??

Resources