distcp not working between hadoop version 2.0.0 and 0.20 - hadoop

During distcp between two version of hadoop i am getting below error:
Server IPC version 9 cannot communicate with client version 3
I am using below command:
hadoop distcp
Solutions will be appreciated.

distcp does not work between version from hdfs:// to hdfs://
You must run the distcp on the destination cluster and use the hftp:// protocol (a read-only protocol) on the source cluster.
Note: the default ports are different for different protocols, so the command ends up looking like:
hadoop distcp hftp://<source>:50070/<src path> hdfs://<dest>:8020/<dest path>
or, if you prefer fake values
hadoop distcp hftp://foo.company.com:50070/data/baz hdfs://bar.compnay.com:8020/data/

Related

How to check the hadoop distribution used in my cluster?

How can I know whether my cluster has been setup using Hortonworks,Cloudera or normal installation of hadoop components?
Also how can I know the port number of various services?
It is difficult to identify hadoop distribution from port number, since Apache, Hortonworks, Cloudera distros uses different port numbers
Other options are to check for cluster management service agents (Cloudera Manager - agent start up script - /etc/init.d/cloudera-scm-agent , Hortonworks - Ambari agent start up script - /etc/init.d/ambari-agent, Vanilla Apache hadoop will not have any agents in the server
Another option is to check hadoop classpath, below command can be used to get the classpath.
`hadoop classpath`
Most of hadoop distributions include distro name in the classpath, If classpath doesn't contains any of below keywords, distribution/setup will be Apache/Normal installation.
hdp - (Hortonworks)
cdh - (Cloudera)
The simplest way is to run hadoop version command and in output you will see, what version of Hadoop you are having and also which distribution and its version you are running with. If you will find words like cdh or hdp then cdh stands for cloudera and hdp for hortonworks.
For example, here I am having cloudera and with hadoop version command below is output.
Here in first line Hadoop version followed by hadoop distribution and its version.
Hope this will help.
Command hdfs version will give you version of the hadoop and its distribution

Hadoop distcp not working

I am trying to copy data from one HDFS to another HDFS. Any suggestion why 1st one works but not 2nd one?
(works)
hadoop distcp hdfs://abc.net:8020/foo/bar webhdfs://def.net:14000/bar/foo
(does not work )
hadoop distcp webhdfs://abc.net:50070/foo/bar webhdfs://def:14000/bar/foo
Thanks!
If the two cluster are running incompatible version of HDFS, then
you can use the webhdfsprotocol to distcp between them.
hadoop distcp webhdfs://namenode1:50070/source/dir webhdfs://namenode2:50070/destination/dir
NameNode URI and NameNode HTTP port should be provided in the source and destination command, if you are using webhdfs.

Which Hadoop 0.23.8 jars are needed for HBase 0.94.8

I'm using Hadoop 0.23.8 pseudo distributed and HBase 0.94.8. My HBase master is failing with:
Server IPC version 5 cannot communicate with client version 4
I think this is because HBase is using hadoop-core-1.0.4.jar in its lib folder.
Now http://cloudfront.blogspot.in/2012/06/how-to-configure-habse-in-pseudo.html#.UYfPYkAW38s suggests I should replace this jar by copying:
the hadoop-core-*.jar from your HADOOP_HOME ...
but there are no hadoop-core-*.jars in 0.23.8.
Will this process work for 0.23.8, and if so, which jars should I be using?
TIA!
I gave up with this and am using hadoop 2.2.0 which works well (ish) with HBase.

Import data from inter cluster hadoop with different versions using command line

Can you tell me the exact command to import data from hdfs with two different haddop version one with hadoop 2.0.4 alpha and other 2.4.0 version? How can I use distcp command in this case?
When you have different versions use hftp instead of using the actual hdfs command. You can see examples on Cloudera website. Use the hftp on your source cluster and hdfs on your destination cluster address.

getting Check-sum mismatch during data transfer between two different version of hadoop

I am new with hadoop.I am transfering data between hadoop 0.20 and hadoop 2.2.0 using distcp command.
during transfer i am getting below error:
Check-sum mismatch between
hftp://10.0.3.28:50070/hive/warehouse/staging_precall_cdr/operator=idea/PRECALL_CDR_Assam_OCT_JAN.csv
and
hdfs://10.0.20.118:9000/user/hive/warehouse/PRECALL_CDR_Assam_OCT_JAN.csv
I have used -skipcrccheck and -Ddfs.checksum.type=CRC32 also but did not get any solution.
Solutions will be appreciated.
It looks like a known issue in Jira , copying data between 0.20 and 2.2.0 hadoop version https://issues.apache.org/jira/browse/HDFS-3054.
A workaround to this problem is to enable preserve block and check-sum in the distcp copying using -pbc.
hadoop distcp -pbc <SRC> <DEST>
OR
Use Skip CRC check using -skipcrccheck option
hadoop distcp -skipcrccheck -update <SRC> <DEST>

Resources