Hadoop distcp does not skip CRC checks - hadoop

I have an issue with skipping CRC checks between source and target paths running distcp.
I copy and decrypt files on demand and their checksum is different, that is expected.
My command looks like following:
hadoop distcp -skipcrccheck -update -direct sftp://path s3a://path
When hadoop distcp starts, it prints configs and there is skipCRC=true
But job fails with error:
Mismatch in length of source:sftp://path (95066273) and target:s3a://path/.distcp.tmp.attempt_1675828993400_0012_m_000001_1 (95065888)
hadoop version - Hadoop 3.2.1-amzn-5
Have anyone had a luck with skipping CRC checks?
I updated EMR to 6.9.0 with hadoop 3.3.3
what was supposed to help based on this Jira. but it didn't and job still fails on CRC validation.


what should be the correct flow of data in hadoop and mahout?

I am working with hadoop, hive and mahout technology.
I am processing some data with a mapreduce job in hadoop for recommendation purposes in mahout.
I want to know the correct workflow of above model, i.e when hadoop processes the data and stores it in HDFS, then how will mahout use this data and how will mahout get this data and after mahout processes the data, where will mahout put this recommended data?
Note: I am working with hadoop for processing the data and my colleague is working with mahout on a different machine .
Hope u got my question correctly.
If you want to take input from hadoop hdfs in mahout then you have to do following steps-
first copy input file to hdfs by command
hadoop dfs -copyFromLocal input /
Then run the mahout command for recommendation which take input from hdfs and save the output in hdfs
Assuming your JAVA_HOME is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command:
$ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/inputfile -o hdfs://localhost:9000/output --numRecommendations 25
Running the command will execute a series of jobs the final product of which will be an output file deposited to the directory specified in the command syntax. The output file will contain two columns: the userID and an array of itemIDs and scores.
It all depends on how Mahout is configured to run. Mahout can run in local mode or distributed mode. We need to set the "MAHOUT_LOCAL" variable.
MAHOUT_LOCAL set to anything other than an empty string to force
mahout to run locally even if
For example, If we don't configure MAHOUT_LOCAL and tries to execute any Mahout algorithm, Then you can see below in the console.
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop,
When running in distributed mode, Mahout treats all the paths as HDFS path's. So even after Mahout processing your data, final output will be stored in HDFS.

getting Check-sum mismatch during data transfer between two different version of hadoop

I am new with hadoop.I am transfering data between hadoop 0.20 and hadoop 2.2.0 using distcp command.
during transfer i am getting below error:
Check-sum mismatch between
I have used -skipcrccheck and -Ddfs.checksum.type=CRC32 also but did not get any solution.
Solutions will be appreciated.
It looks like a known issue in Jira , copying data between 0.20 and 2.2.0 hadoop version https://issues.apache.org/jira/browse/HDFS-3054.
A workaround to this problem is to enable preserve block and check-sum in the distcp copying using -pbc.
hadoop distcp -pbc <SRC> <DEST>
Use Skip CRC check using -skipcrccheck option
hadoop distcp -skipcrccheck -update <SRC> <DEST>

run pig 0.7.0 error : ERROR 2998: Unhandled internal error

I have to connect pig to a hadoop which changed a little from Hadoop 0.20.0. I choose pig 0.7.0, and setting PIG_CLASSPATH by
when I run pig, an error is reported like this:
ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage
So, I copy hadoop-core.jar in $HADOOP_HOME to overwrite hadoop20.jar in $PIG_HOME/lib, then "ant". Now, I can run pig, but when I use dump or store, another error:
Pig Stack Trace
ERROR 2998: Unhandled internal error. org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(Lorg/apache/hadoop/mapreduce/Job;Lorg/apache/ hadoop/fs/Path;)V
java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(Lorg/apache/hadoop/mapreduce/Job;Lorg/apache/hadoop/fs/ Path;)V
at org.apache.pig.builtin.BinStorage.setStoreLocation(BinStorage.java:369)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Does anyone have encountered this error, or is my compile way not right?
There is a section about this issue in the Pig FAQ which should give you a good idea what's wrong. Here is the outline taken from this page:
This usually happens when you are connecting hadoop cluster other than standard Apache hadoop 20.2 release. Pig bundles standard hadoop 20.2 jars in release. If you want to connect to other version of hadoop cluster, you need to replace bundled hadoop 20.2 jars with compatible jars. You can try:
do "ant"
copy hadoop jars from your hadoop installation to overwrite ivy/lib/Pig/hadoop-core-0.20.2.jar and ivy/lib/Pig/hadoop-test-0.20.2.jar
do "ant" again
cp pig.jar to overwrite pig-*-core.jar
Some other tricks is also possible. You can use "bin/pig -secretDebugCmd" to inspect the command line of Pig. Make sure you are using the right version of hadoop.
As pointed in this FAQ section, if nothing works I would advise just upgrading to a recent version of Pig after 0.9.1, Pig 0.7 is a bit old.
The Pig (core) jar has a bundled Hadoop dependency, which may differ from the version you want to use. If you have an old Pig version (< 0.9) the you have the option, to build a jar without Hadoop:
ant jar-withouthadoop
cp $PIG_HOME/build/pig-x.x.x-dev-withouthadoop.jar $PIG_HOME
Then start Pig:
cd $PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/hadoop-core-x.x.x.jar:$HADOOP_HOME/lib/*:$HADOOP_HOME/conf:$PIG_HOME/pig-x.x.x-dev-withouthadoop.jar; ./pig
Newer Pig versions contain the prebuilt withouthadoop version (see this ticket) so you can skip the building process. Furthermore when you run pig it will pick up the withouthadoop jar from PIG_HOME rather than the bundled version, so you don't need to add withouthadoop.jar
to the PIG_CLASSPATH either (provided, that you run Pig from $PIG_HOME/bin)
..Back to your question:
Hadoop 0.20 and its modified variant (0.20-append?) can work even with the latest Pig distribution (0.11.1) :
You just need to do the followings:
unpack Pig 0.11.1
cd $PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/hadoop-core-x.x.jar:$HADOOP_HOME/lib/*:$HADOOP_HOME/conf; ./pig
If you still get "Failed to create DataStorage" it's worth to start Pig with -secretDebugCmd as Charles Menguy suggested, so that you
can see whether Pig gets the right Hadoop version..etc.
Did you remember to run start-all.sh from /usr/local/bin? I ran into the same problem and I basically retraced my steps in configuring Hadoop itself. I am able to use Pig now.

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

How to use Hadoop Streaming with LZO-compressed Sequence Files?

I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.
For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."
What do I need to do in order to process these input files with Hadoop Streaming?
I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?
I've tried using a very simple identity as both my mapper and reducer
#!/usr/bin/env ruby
STDIN.each do |line|
puts line
but this doesn't work.
lzo is packaged as part of elastic mapreduce so there's no need to install anything.
i just tried this and it works...
hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
-inputformat SequenceFileAsTextInputFormat \
-output test_output \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
Lzo compression has been removed from Hadoop 0.20.x onwards due to licensing issues. If you want to process lzo-compressed sequence files, lzo native libraries have to be installed and configured in hadoop cluster.
Kevin's Hadoop-lzo project is the current working solution I am aware of. I have tried it. It works.
Install ( if not done already so ) lzo-devel packages at OS. These packages enable lzo compression at the OS level without which hadoop lzo compression won't work.
Follow the instructions specified in the hadoop-lzo readme and compile it. After build, you would get hadoop-lzo-lib jar and hadoop lzo native libraries. Ensure that you compile it from the machine ( or machine of same arch ) where your cluster is configured.
Hadoop standard native libraries are also required which have been provided in the distribution by default for linux. If you are using solaris, you would also need to build hadoop from source inorder to get standard hadoop native libraries.
Restart the cluster once all changes are made.
You may want to look at this https://github.com/kevinweil/hadoop-lzo
I have weird results use lzo and my problem get resolved with some other codec
-D mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
Then things just work. You don't need (maybe also shouldn't) to change the -inputformat.
Version: 0.20.2-cdh3u4, 214dd731e3bdb687cb55988d3f47dd9e248c5690
