Transfer of files from unsecured hdfs to secured hdfs cluster - hadoop

I wanted to transfer files from unsecured HDFS cluster to kerberized cluster. I am using distcp to transfer the files. I have used the following command.
hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true hdfs://<ip>:8020/<sourcedir> hdfs://<ip>:8020/<destinationdir>
I am getting the following error after I executed the above command in the kerberized cluster.
java.io.EOFException: End of File Exception between local host is: "<xxx>"; destination host is: "<yyy>; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException

this is error because:
cluster is blocked for RPC communication, in such cases, webhdfs
protocol can be used, so above distcp can be rewritten as
hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true hdfs://xxx:8020/src_path webhdfs://yyy:50070/target_path
this is very good blog post for distcp

Related

Hadoop copy file to remote server

How do i achive this : initiate a hadoop command from the server with Hadoop client installed to move the data from Hadoop cluster to a remote linux server which do not have hadoop client installed ?
Since you can't send all blocks of a file to a remote host, you would have to do this
hadoop fs -get /path/src/file
scp ./file user#host:/path/dest/file

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

Hadoop distcp not working

I am trying to copy data from one HDFS to another HDFS. Any suggestion why 1st one works but not 2nd one?
(works)
hadoop distcp hdfs://abc.net:8020/foo/bar webhdfs://def.net:14000/bar/foo
(does not work )
hadoop distcp webhdfs://abc.net:50070/foo/bar webhdfs://def:14000/bar/foo
Thanks!
If the two cluster are running incompatible version of HDFS, then
you can use the webhdfsprotocol to distcp between them.
hadoop distcp webhdfs://namenode1:50070/source/dir webhdfs://namenode2:50070/destination/dir
NameNode URI and NameNode HTTP port should be provided in the source and destination command, if you are using webhdfs.

Connection time out during map reduce process

I am transferring data from one cluster to other cluster using distcp command.i am getting below problem during map reduce process:
java.net.ConnectException: Connection timed out
I am using below command:
/home/hadoop/hadoop/bin/hadoop distcp -update -skipcrccheck "hftp://source:50070//hive/warehouse//tablename" "hdfs://destination:9000//hive/warehouse//tablename"
How can i solve this problem .Solutions will be appriciated.
If you are trying to transfer data from one HDFS to another then why you are using hftp command?
hftp is for transfer data from ftp server into hdfs.
try this for hdfs to hdfs
/home/hadoop/hadoop/bin/hadoop distcp -update -skipcrccheck "hdfs://source:50070/hive/warehouse/tablename" "hdfs://destination:9000/hive/warehouse/tablename"
For ftp to hdfs, use the correct ftp address.

distcp not working between hadoop version 2.0.0 and 0.20

During distcp between two version of hadoop i am getting below error:
Server IPC version 9 cannot communicate with client version 3
I am using below command:
hadoop distcp
Solutions will be appreciated.
distcp does not work between version from hdfs:// to hdfs://
You must run the distcp on the destination cluster and use the hftp:// protocol (a read-only protocol) on the source cluster.
Note: the default ports are different for different protocols, so the command ends up looking like:
hadoop distcp hftp://<source>:50070/<src path> hdfs://<dest>:8020/<dest path>
or, if you prefer fake values
hadoop distcp hftp://foo.company.com:50070/data/baz hdfs://bar.compnay.com:8020/data/

Resources