transfer file from local machine of 1 cluster to hdfs of another cluster - shell

I have 2 hadoop clusters(A and B) and want to transfer a file from local of cluster A to HDFS of cluster B. Is there a way to do it?
I tried copyFromLocal and put but looks like they don't copy the file over to the HDFS of cluster B and show that they are not supported:
copyFromLocal: Not supported
fyi: connection looks open as I am able to read HDFS of cluster B from local of cluster A(hadoop fs -ls hdfs://NNofB:port/path)

Not sure if there is a direct way from HDFS->HDFS, but you could get from HDFS on a node in ClusterA, scp the data to a node in ClusterB, then put that data into HDFS from that node in ClusterB.

Related

hadoop on windows, how to add D:\folder1 and E:\folder1 to hdfs?

hadoop fs -put popularNames.txt /user/hadoop/dir1/popularNames.txt
My folders are very huge, about 3 TB.
I don't want to copy the folder, I want to set the hdfs to the location. How to make it?
HDFS: Hadoop distributed file system.
You can't add a link to point to a location, because the data must be present in the HDFS(not on local). The whole point of using hadoop is distributed computation, which is possible when your data is distributed on a cluster.
hadoop fs -put had to be used to move the file from your local to the hdfs in order to use hadoop framework.

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

Does Hadoop create multiple copies of input files, one copy per node

If I wish to copy a file from a local directory to a HDFS, do I need to physically copy the file on each Hadoop node?
Or if I use the hadoop dfs command, Hadoop will internally create a copy of this file on each node?
Am I correct to assume that each node needs to have a copy of the file?
When you will copy the file (any data) Hadoop (HDFS) will store it on any Datanode and metadata information will be stored on Namenode. The replication of the file (data) will be taken care by Hadoop, you need not to copy it multiple times.
You can use of the below command to copy files from local to HDFS
hdfs dfs -put <source> <destination>
hdfs dfs -copyFromLocal <source> <destination>
The replication factor configuration is stored in hdfs-site.xml file.
Am I correct to assume that each node needs to have a copy of the file?
This is not necessarily true. HDFS creates replica as per the configuration found in the hdfs-site.xml file. The default for the replication is 3.
Yeah hadoop distributed file system replicates data in minimum 3 datanodes. But nowadays trend is on spark which is also run on top of hadoop. And this is 100 times faster than hadoop.
spark http://spark.apache.org/downloads.html
You are not required to copy the file from local machine to every node in the cluster.
You could use client utiles like hadoop fs or hadoop dfs commands to do so.
It is not necessary that your file will be copied to all the nodes in the cluster, the number of replications is controlled by the dfs.replication property from the hdfs-site.xml configuration file, where its default value is 3, means that 3 copies of your file be stored across the cluster on some random nodes.
Please refer the more details below,
hadoop dfs command first contacts the Namenode with the given
files's details.
The Namenode computes the number of blocks that the file has to
splitted according to the block size configured in hdfs-site.xml
The Namenode returns the list of chosen Datanodes for every
computed block of the given file. This count of Datanodes in every
list is equal to the replication factor configured in the
hdfs-site.xml
Then the hadoop client starts storing every blocks of the file to the
given Datanode through Hadoop Streaming.
For each block, the hadoop client just prepares the data pipe line
in which all chosen Datanodes chosen to store the block are formed
as a Data queue.
The hadoop client just copies the current block only to the first
Datanode in the queue.
Upon completion of the copy, the first Datanode cascades the block to
second Datanode in the Queue and so on.
All the block details of the files and the details of Datanodes which
have the copy of them are maintained in Namenode's metadata.
You do not need to copy files manually to all nodes.
Hadoop will take care of distributing data to different nodes.
you can use simple commands to upload data to HDFS
hadoop fs -copyFromLocal </path/to/local/file> </path/to/hdfs>
OR
hadoop fs -put </path/to/local/file> </path/to/hdfs>
You can read more how data is internally written on HDFS here : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
You can also download file from HDFS to local filessytem wothout manually copying files from each datanode using command:
hadoop fs -copyToLocal </path/to/hdfs/file> </path/to/local>
OR
hadoop fs -get </path/to/hdfs/file> </path/to/local>

Not able to access HDFS from the Datanodes in clusture

Have installed cloudera cdh4 on 3 node cluster.Facing problem when trying to access the data in HDFS through slave nodes(Datanodes).
When I am trying to create the new folder in HDFS using
hadoop fs -mkdir Flume(Foldername)
command not able to put the data or create the folder in the hdfs of the cluster from either of the slaves,but working from the master node,also flume ,hive ,pig all other process are running in the slaves
(Datanodes)
Tried
restarting the cluster
namenode format
Still not working!!
Secondly When I am doing
hadoop fs -ls /
results are not of from the hdfs but from the current directory path of the slave node from where I am usin this command.
And how to check if hdfs is working and installed properly in slave nodes(Datanodes) in cluster apart from creating the directory in HDFS.
Could anybody help?
Please verify the property "fs.default.name" in the core-site.xml.
<property>
<name>fs.default.name</name>
<value>hdfs://namenode:9000</value>
</property>

hadoop putting data on master not relecting on all clusters

I have set up a hadoop cluster of 3 nodes for testing. Everything works fine, but when i upload the file into master node using the following command
./bin/hadoop dfs -copyFromLocal localfolderName hdfsfolderName
The files are reflected only in data node of master node.
I thought that hadoop will split the input files and spread the chunks across all the slave nodes.
I want to know if i am missing any configuration, or its how the hadoop behaves??

Resources