How to process 2 files with different inputformats in Hadoop Streaming? - hadoop

I have 2 files in different formats. One is SequenceFileInputFormat, other one is TextInputFormat. I know that for Hadoop Streaming there is a possibility to specify 2 input files like:
hadoop jar hadoop-streaming-2.8.0.jar \
-input '/user/foo/dir1' -input '/user/foo/dir2' \
(rest of the command)
But how to specify also different -inputformat for those files?
I found that it's possible to do for Java with MultipleInputs like:
MultipleInputs.addInputPath(job, new Path(args[0]), <Input_Format_Class_1>);
MultipleInputs.addInputPath(job, new Path(args[1]), <Input_Format_Class_2>);
Can I do somethink like this with Hadoop Streaming?

Hadoop Streaming Options contains various options for hadoop streaming, the one that might be of use in your case would be
-inputformat JavaClassName
The Default being TextInputFormat
I have tested this using only TextInputFormat, but i recon it should be like
hadoop jar hadoop-streaming-2.8.0.jar \
-input '/user/foo/dir1' -inputformat TextInputFormat \
-input '/user/foo/dir2' -inputformat SequenceFileInputFormat \
(rest of the command)
Here is what is tested and it worked :
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0*.jar \
-file mapperB.py -mapper mapperB.py -file reducerB.py -reducer reducerB.py \
-input /tempfiles/big.txt -inputformat TextInputFormat \
-input /tempfiles/t.txt -inputformat TextInputFormat \
-output /tempfiles/output-X
Note: file is deprecated,

Related

How can I run Hadoop Streaming on Hadoop Cluster?

Currently I have a Hadoop cluster with 3 nodes(ubuntu)
I want to run python / R scripts with Hadoop Streaming, but I am not sure whether just executing HS actually makes all nodes work or not
If it is possible, please give me direction to run Streaming on the cluster
Thanks
Hadoop streaming is a built-in jar/utility that allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
In the above command, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
-input: directory/file-name Input location for mapper.
-output: directory-name Output location for reducer.
-mapper: executable or script or JavaClassName Required Mapper executable.
-reducer: executable or script or JavaClassName Required Reducer executable.
-file file-name: Makes the mapper, reducer, or combiner executable available locally on the compute nodes.
Ex 1: A user-defined python executable as the mapper. The option "-file myPythonScript.py" causes the python executable shipped to the cluster machines as a part of job submission.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py
Ex 2:Send an Java class as an argument to the mapper and/or the reducer
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
Source: Hadoop Streaming jar

MapReduce: Writing Sequence file using Python[Streaming]

I am trying to write the sequence file in MapReduce. I did it with java successfully but I am not sure how to do it with python.
Thank you!
Hadoop accepts the Streaming command option -outputformat.
To generate output files as Sequence files, use-outputformat SequenceFileOutputFormat.
For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat\
-mapper MapperClass \
-reducer ReducerClass
By default, -inputformat and -outputformat are set as TextInputFormat and TextOutputFormat respectively.

Hadoop streaming flat-files to gzip

I've been trying to gzip files (pipe separated csv) in hadoop using the hadoop-streaming.jar. I've found the following thread on stackoverflow:
Hadoop: compress file in HDFS?
and I tried both solutions (cat/cut for the mapper). Although I end up with a gzipped file in HDFS it also now has a tab character at the end of each line. Any ideas how to get rid of these? The tab at the end is messing up my last column.
I've tried the following two commands (in lots of flavours):
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-Dmapred.reduce.tasks=0 \
-input <filename> \
-output <output-path> \
-mapper "cut -f 2"
and
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input <filename> \
-output <output-path> \
-mapper /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
I know that mapreduce outputs a key-value par that is tab separated but the "cut -f 2" (also tried "cut -f 2 -d,") should only return the value part, not the tab. So why does every line ends with a tab?
I hope someone can enlighten me.

Hadoop global variable with streaming

I understand that i can give some global value to my mappers via the Job and the Configuration.
But how can i do that using Hadoop Streaming(Python in my case)?
What is the right way?
Based on the docs you can specify a command line option (-cmdenv name=value) to set environment variables on each distributed machine that you can then use in your mappers/reducers:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input input.txt \
-output output.txt \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py \
-cmdenv MY_PARAM=thing_I_need

Hadoop Streaming Problems

I ran into these issues while using Hadoop Streaming. I'm writing code in python
1) Aggregate library package
According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29 ), there is an inbuilt Aggregate class which can work both as a mapper and a reducer.
Here is the command:
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -combiner aggregate -reducer NONE -input input_files -output output_path
Executing this command fails the mapper with this error:
java.io.IOException: Cannot run program "aggregate": java.io.IOException: error=2, No such file or directory
However, if you run this command using aggregate as the reducer and not the combiner, the job works fine.
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -reduce aggregate -reducer NONE -input input_files -output output_path
Does this mean I cannot use the aggregate class as the combiner?
2) Cannot use | as a seperator for the generic options
This is an example command from the above link
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
You cannot use | as an argument for map.output.key.field.separator. The error is
-D: command not found
11/08/03 10:48:02 ERROR streaming.StreamJob: Missing required options: input, output
(Update)You have to escape the | with a \ like this
-D stream.map.output.field.separator=\|
3) Cannot specify the -D options at the end of the command just like in the example. The Error is
-D: command not found
11/08/03 10:50:23 ERROR streaming.StreamJob: Unrecognized option: -D
Is the documentation flawed or I'm doing something wrong?
Any insight on what I'm doing wrong is appreciated. Thnx
This question was asked 3 years ago, but today I still got the problem with -D option so I will add a little information for other people if they have the same problem.
According to the manual of hadoop streaming:
bin/hadoop command [genericOptions] [commandOptions]
-D is a genereic option so you have to put it before any other options.
So in this case, the command should look like:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \

Resources