getting Check-sum mismatch during data transfer between two different version of hadoop - hadoop

I am new with hadoop.I am transfering data between hadoop 0.20 and hadoop 2.2.0 using distcp command.
during transfer i am getting below error:
Check-sum mismatch between
hftp://10.0.3.28:50070/hive/warehouse/staging_precall_cdr/operator=idea/PRECALL_CDR_Assam_OCT_JAN.csv
and
hdfs://10.0.20.118:9000/user/hive/warehouse/PRECALL_CDR_Assam_OCT_JAN.csv
I have used -skipcrccheck and -Ddfs.checksum.type=CRC32 also but did not get any solution.
Solutions will be appreciated.

It looks like a known issue in Jira , copying data between 0.20 and 2.2.0 hadoop version https://issues.apache.org/jira/browse/HDFS-3054.
A workaround to this problem is to enable preserve block and check-sum in the distcp copying using -pbc.
hadoop distcp -pbc <SRC> <DEST>
OR
Use Skip CRC check using -skipcrccheck option
hadoop distcp -skipcrccheck -update <SRC> <DEST>

Related

Hadoop distcp does not skip CRC checks

I have an issue with skipping CRC checks between source and target paths running distcp.
I copy and decrypt files on demand and their checksum is different, that is expected.
My command looks like following:
hadoop distcp -skipcrccheck -update -direct sftp://path s3a://path
When hadoop distcp starts, it prints configs and there is skipCRC=true
But job fails with error:
Mismatch in length of source:sftp://path (95066273) and target:s3a://path/.distcp.tmp.attempt_1675828993400_0012_m_000001_1 (95065888)
hadoop version - Hadoop 3.2.1-amzn-5
Have anyone had a luck with skipping CRC checks?
I updated EMR to 6.9.0 with hadoop 3.3.3
what was supposed to help based on this Jira. but it didn't and job still fails on CRC validation.

Hadoop distcp not working

I am trying to copy data from one HDFS to another HDFS. Any suggestion why 1st one works but not 2nd one?
(works)
hadoop distcp hdfs://abc.net:8020/foo/bar webhdfs://def.net:14000/bar/foo
(does not work )
hadoop distcp webhdfs://abc.net:50070/foo/bar webhdfs://def:14000/bar/foo
Thanks!
If the two cluster are running incompatible version of HDFS, then
you can use the webhdfsprotocol to distcp between them.
hadoop distcp webhdfs://namenode1:50070/source/dir webhdfs://namenode2:50070/destination/dir
NameNode URI and NameNode HTTP port should be provided in the source and destination command, if you are using webhdfs.

getting error during distcp between two different version of hadoop cluster

I am using distcp between hadoop 0.20 and hadoop 2.2.0 versions.I am getting error during data transfer between these clusters using below distcp command:
hadoop distcp -skipcrccheck -update
getting below error:
HTTP_OK expected, received 400
Solutions will be appreciated.

distcp not working between hadoop version 2.0.0 and 0.20

During distcp between two version of hadoop i am getting below error:
Server IPC version 9 cannot communicate with client version 3
I am using below command:
hadoop distcp
Solutions will be appreciated.
distcp does not work between version from hdfs:// to hdfs://
You must run the distcp on the destination cluster and use the hftp:// protocol (a read-only protocol) on the source cluster.
Note: the default ports are different for different protocols, so the command ends up looking like:
hadoop distcp hftp://<source>:50070/<src path> hdfs://<dest>:8020/<dest path>
or, if you prefer fake values
hadoop distcp hftp://foo.company.com:50070/data/baz hdfs://bar.compnay.com:8020/data/

Copying directories in HDFS using the JAVA API

How do I copy a directory in HDFS to another directory in HDFS?
I found the copyFromLocalFile functions that copy from the local FS to HDFS, but I want both of the source/destination to be in HDFS.
Thanks
Use distcp command.
The canonical use case for distcp is for transferring data between two HDFS clusters.
If the clusters are running identical versions of Hadoop, the hdfs scheme is
appropriate:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
If you want to do it through Java code, see class org.apache.hadoop.tools.DistCp and call it appropriately.
You can try FileUtil.copy
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileUtil.html

Resources