hadoop putting data on master not relecting on all clusters - hadoop

I have set up a hadoop cluster of 3 nodes for testing. Everything works fine, but when i upload the file into master node using the following command
./bin/hadoop dfs -copyFromLocal localfolderName hdfsfolderName
The files are reflected only in data node of master node.
I thought that hadoop will split the input files and spread the chunks across all the slave nodes.
I want to know if i am missing any configuration, or its how the hadoop behaves??

Related

How many datanodes used to do mapper for multi small files in one hadoop job?

I have one NameNode(Hadoop-Master) and three DataNodes(Hadoop-Master,Hadoop-Slave1,Hadoop-Slave2). Hadoop-Master is used for both NameNode and a DataNode. The OS is Ubuntu 16.04. The version of Hadoop is 2.7.6. The block size is 128MB. HDFS is configured with two replication.
I have ten files in HDFS (/file1,/file2,.../file10) , each is 127MB, so each file only in one block. Those files are distributed in three datanodes. But I don't know why all files have a replication in Hadoop-Master (as DataNode). The other two DataNodes (Hadoop-Slave1 and Hadoop-Slave2) have different files' replication.
I write a Java MapReduce program to process all ten files. Using following codes to add all files for the Job:
FileInputFormat.addInputPath( job , new Path("/file1"));
FileInputFormat.addInputPath( job , new Path("/file2"));
... ...
FileInputFormat.addInputPath( job , new Path("/file10"));
After the Job is done, I read the console log. It seems only Hadoop-Master do all the mapper and reducer work. The other two datanodes do not appear in the console log.
I actually want all DataNode process these files parallelly. For example, Hadoop-Master process file1~file4, Hadoop-Slave1 process file5~file7, and Hadoop-Slave2 process file8~file10.
What should I do to make all DataNode process the Job.

Hadoop : swap DataNode & NameNode without losing any HDFS data

I have a cluster of 5 machines:
1 big NameNode
4 standard DataNodes
I want to change my current NameNode with a DataNode without losing the data stored in HDFS, so my cluster could become:
1 standard NameNode
3 standard DataNodes
1 big DataNode
Does someone know a simple way to do that?
Thank you very much
Decomission data node where namenode will be moved.
Stop the cluster.
Create a tar of dfs.name.dir from current namenode.
Copy all hadoop config files from current NN to target NN.
Replace the name/ip of target namenode by modifying core-site.xml.
Restore tarball of dfs.name.dir. Make sure that full path is same.
Now start the cluster by starting new namenode and one less datanode.
Verify that everything is working perfectly.
Add old namenode as datanode by configuring it as datanode.
I would suggest to uninstall and then install hadoop on both the nodes so that previous configuration does not cause any problem.

transfer file from local machine of 1 cluster to hdfs of another cluster

I have 2 hadoop clusters(A and B) and want to transfer a file from local of cluster A to HDFS of cluster B. Is there a way to do it?
I tried copyFromLocal and put but looks like they don't copy the file over to the HDFS of cluster B and show that they are not supported:
copyFromLocal: Not supported
fyi: connection looks open as I am able to read HDFS of cluster B from local of cluster A(hadoop fs -ls hdfs://NNofB:port/path)
Not sure if there is a direct way from HDFS->HDFS, but you could get from HDFS on a node in ClusterA, scp the data to a node in ClusterB, then put that data into HDFS from that node in ClusterB.

Does Hadoop create multiple copies of input files, one copy per node

If I wish to copy a file from a local directory to a HDFS, do I need to physically copy the file on each Hadoop node?
Or if I use the hadoop dfs command, Hadoop will internally create a copy of this file on each node?
Am I correct to assume that each node needs to have a copy of the file?
When you will copy the file (any data) Hadoop (HDFS) will store it on any Datanode and metadata information will be stored on Namenode. The replication of the file (data) will be taken care by Hadoop, you need not to copy it multiple times.
You can use of the below command to copy files from local to HDFS
hdfs dfs -put <source> <destination>
hdfs dfs -copyFromLocal <source> <destination>
The replication factor configuration is stored in hdfs-site.xml file.
Am I correct to assume that each node needs to have a copy of the file?
This is not necessarily true. HDFS creates replica as per the configuration found in the hdfs-site.xml file. The default for the replication is 3.
Yeah hadoop distributed file system replicates data in minimum 3 datanodes. But nowadays trend is on spark which is also run on top of hadoop. And this is 100 times faster than hadoop.
spark http://spark.apache.org/downloads.html
You are not required to copy the file from local machine to every node in the cluster.
You could use client utiles like hadoop fs or hadoop dfs commands to do so.
It is not necessary that your file will be copied to all the nodes in the cluster, the number of replications is controlled by the dfs.replication property from the hdfs-site.xml configuration file, where its default value is 3, means that 3 copies of your file be stored across the cluster on some random nodes.
Please refer the more details below,
hadoop dfs command first contacts the Namenode with the given
files's details.
The Namenode computes the number of blocks that the file has to
splitted according to the block size configured in hdfs-site.xml
The Namenode returns the list of chosen Datanodes for every
computed block of the given file. This count of Datanodes in every
list is equal to the replication factor configured in the
hdfs-site.xml
Then the hadoop client starts storing every blocks of the file to the
given Datanode through Hadoop Streaming.
For each block, the hadoop client just prepares the data pipe line
in which all chosen Datanodes chosen to store the block are formed
as a Data queue.
The hadoop client just copies the current block only to the first
Datanode in the queue.
Upon completion of the copy, the first Datanode cascades the block to
second Datanode in the Queue and so on.
All the block details of the files and the details of Datanodes which
have the copy of them are maintained in Namenode's metadata.
You do not need to copy files manually to all nodes.
Hadoop will take care of distributing data to different nodes.
you can use simple commands to upload data to HDFS
hadoop fs -copyFromLocal </path/to/local/file> </path/to/hdfs>
OR
hadoop fs -put </path/to/local/file> </path/to/hdfs>
You can read more how data is internally written on HDFS here : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
You can also download file from HDFS to local filessytem wothout manually copying files from each datanode using command:
hadoop fs -copyToLocal </path/to/hdfs/file> </path/to/local>
OR
hadoop fs -get </path/to/hdfs/file> </path/to/local>

How to remove a hadoop node from DFS but not from Mapred?

I am fairly new to hadoop. For running some benchmarks, I need variety of hadoop configuration for comparison.
I want to know a method to remove a hadoop slave from DFS (not running datanode daemon anymore) but not from Mapred (keep running tasktracker), or vice-versa.
AFAIK, there is a single slave file for such hadoop nodes and not separate slave files for DFS and Mapred.
Currently, I am trying to start both DFS and Mapred on the slave node , and then killing datanode on the slave. But it takes a while to put that node in to 'dead nodes' on HDFS GUI. Any parameter can be tuned to make this timeout quicker ?
Thankssss
Try using dfs.hosts and dfs.hosts.exclude in the hdfs-site.xml, mapred.hosts and mapred.hosts.exclude in mapred-site.xml. These are for allowing/excluding hosts to connect to the NameNode and the JobTracker.
Once the list of nodes in the files has been updated appropriately, the NameNode and the JobTracker have to be refreshed using the hadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes command respectively.
Instead of using slaves file to start all processes on your cluster, you can start only required daemons on each machine if you have few nodes.

Resources