Verify compression in hadoop is successful - hadoop

Hi I have used the below code to compress file present in hdfs
hadoop jar hadoop-streaming-2.6.0-cdh5.7.1.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input ${filename} \
-output location \
-mapper /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
And again decompress it using
hadoop jar hadoop-streaming-2.6.0-cdh5.7.1.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.input.compress=true \
-Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input ${filename} \
-output location \
-mapper /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
But when i check for the file size it is varying by a difference of few bytes.
for example the initial file size was 43704541167 bytes
and once i compressed and decompressed it the size was 43704541183
I would like to know if we have any way to confirm if the compression was successful without any data loss..
Thanks in Advance.

Related

How to pass multiple input directories to a hadoop command using a loop

To run a script using hadoop streaming - I use a bash script that looks like this -
IP1="/data/hdp/f1/part-*"
IP2="/data/hdp/f2/part-*"
OP="/data/hdp/op"
hadoop jar $HADOOP_JAR_PATH \
-file $MAPPER_FILE -mapper "$PY $MAPPER_FILE" \
-input $IP1 -input $IP2
-output $OP
How do I generalize this to a case where I have 20 input directories. One approach is specifying it as
-input $IP1 -input $IP2 -input $IP3 ... -input $IP20
But I would want to know if we can use the shell variables and loops/arrays to get it done like this:
IP_LIST=${!IP*}
IP_CMD=''
for ip in $IP_LIST
do
IP_CMD=$IP_CMD"-input $"$ip" "
done
IP_ARRAY=($IP_CMD)
hadoop jar $HADOOP_JAR_PATH \
-file $MAPPER_FILE -mapper "$PY $MAPPER_FILE" \
"${IP_ARRAY[#]}"
-output $OP
When I try this, I get an Input path does not exist: hdfs://... error.
FULL COMMAND THAT I AM USING AS IS...
IP1="/data/hdp/f1/part-*"
IP2="/data/hdp/f2/part-*"
OP="/data/hdp/op"
MAPPER_FILE="map_code.py"
REDUCER="reduce_code.py"
IP_LIST=${!IP*}
IP_CMD=''
for ip in $IP_LIST
do
IP_CMD=$IP_CMD"-input $"$ip" "
done
hadoop fs -rm -r -skipTrash $OP
cmd="hadoop jar $HADOOP_JAR_PATH \
-D mapred.reduce.tasks=00 \
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=\
org.apache.hadoop.io.compress.GzipCodec \
-file $MAPPER_FILE\
-file $REDUCER \
-mapper $PY $MAPPER_FILE\
-reducer $PY $REDUCER\
-output $OP -cacheFile $DC#ref\
$IP_CMD"
eval $cmd
You could make all the command as a string and after you finished use the eval command.
In your example-
add to IP_CMD the rest of the command and the use eval $IP_CMD

MapReduce: Writing Sequence file using Python[Streaming]

I am trying to write the sequence file in MapReduce. I did it with java successfully but I am not sure how to do it with python.
Thank you!
Hadoop accepts the Streaming command option -outputformat.
To generate output files as Sequence files, use-outputformat SequenceFileOutputFormat.
For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat\
-mapper MapperClass \
-reducer ReducerClass
By default, -inputformat and -outputformat are set as TextInputFormat and TextOutputFormat respectively.

Hadoop streaming flat-files to gzip

I've been trying to gzip files (pipe separated csv) in hadoop using the hadoop-streaming.jar. I've found the following thread on stackoverflow:
Hadoop: compress file in HDFS?
and I tried both solutions (cat/cut for the mapper). Although I end up with a gzipped file in HDFS it also now has a tab character at the end of each line. Any ideas how to get rid of these? The tab at the end is messing up my last column.
I've tried the following two commands (in lots of flavours):
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-Dmapred.reduce.tasks=0 \
-input <filename> \
-output <output-path> \
-mapper "cut -f 2"
and
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input <filename> \
-output <output-path> \
-mapper /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
I know that mapreduce outputs a key-value par that is tab separated but the "cut -f 2" (also tried "cut -f 2 -d,") should only return the value part, not the tab. So why does every line ends with a tab?
I hope someone can enlighten me.

Hadoop global variable with streaming

I understand that i can give some global value to my mappers via the Job and the Configuration.
But how can i do that using Hadoop Streaming(Python in my case)?
What is the right way?
Based on the docs you can specify a command line option (-cmdenv name=value) to set environment variables on each distributed machine that you can then use in your mappers/reducers:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input input.txt \
-output output.txt \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py \
-cmdenv MY_PARAM=thing_I_need

Hadoop Streaming Problems

I ran into these issues while using Hadoop Streaming. I'm writing code in python
1) Aggregate library package
According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29 ), there is an inbuilt Aggregate class which can work both as a mapper and a reducer.
Here is the command:
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -combiner aggregate -reducer NONE -input input_files -output output_path
Executing this command fails the mapper with this error:
java.io.IOException: Cannot run program "aggregate": java.io.IOException: error=2, No such file or directory
However, if you run this command using aggregate as the reducer and not the combiner, the job works fine.
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -reduce aggregate -reducer NONE -input input_files -output output_path
Does this mean I cannot use the aggregate class as the combiner?
2) Cannot use | as a seperator for the generic options
This is an example command from the above link
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
You cannot use | as an argument for map.output.key.field.separator. The error is
-D: command not found
11/08/03 10:48:02 ERROR streaming.StreamJob: Missing required options: input, output
(Update)You have to escape the | with a \ like this
-D stream.map.output.field.separator=\|
3) Cannot specify the -D options at the end of the command just like in the example. The Error is
-D: command not found
11/08/03 10:50:23 ERROR streaming.StreamJob: Unrecognized option: -D
Is the documentation flawed or I'm doing something wrong?
Any insight on what I'm doing wrong is appreciated. Thnx
This question was asked 3 years ago, but today I still got the problem with -D option so I will add a little information for other people if they have the same problem.
According to the manual of hadoop streaming:
bin/hadoop command [genericOptions] [commandOptions]
-D is a genereic option so you have to put it before any other options.
So in this case, the command should look like:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \

Resources