Split class org.apache.hadoop.hive.ql.io.orc.OrcSplit not found - hadoop

I am trying to use orc as inputformat for hadoop streaming
here is how i run it
export HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /home/mr/mapper.py -mapper /home/mr/mapper.py \
-file /home/mr/reducer.py -reducer /home/mr/reducer.py \
-input /user/cloudera/input/users/orc \
-output /user/cloudera/output/simple \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat \
But i am getting this error:
Error: java.io.IOException: Split class
org.apache.hadoop.hive.ql.io.orc.OrcSplit not found
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:363)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ClassNotFoundException: Class
org.apache.hadoop.hive.ql.io.orc.OrcSplit not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:361)
... 7 more
it looks like OrcSplit class should be in hive-exec.jar

An easier solution is to have hadoop-streaming distribute the lib jars for you by using the -libjars argument. This argument takes a comma-separated list jars. To take your example, you could do:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-libjars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
-file /home/mr/mapper.py -mapper /home/mr/mapper.py \
-file /home/mr/reducer.py -reducer /home/mr/reducer.py \
-input /user/cloudera/input/users/orc \
-output /user/cloudera/output/simple \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

I found the answer. my problem was that i set HADOOP_CLASSPATH var only on one node. So i should either set it on everynode or use distrbuted cache


How can I run Hadoop Streaming on Hadoop Cluster?

Currently I have a Hadoop cluster with 3 nodes(ubuntu)
I want to run python / R scripts with Hadoop Streaming, but I am not sure whether just executing HS actually makes all nodes work or not
If it is possible, please give me direction to run Streaming on the cluster
Hadoop streaming is a built-in jar/utility that allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
In the above command, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
-input: directory/file-name Input location for mapper.
-output: directory-name Output location for reducer.
-mapper: executable or script or JavaClassName Required Mapper executable.
-reducer: executable or script or JavaClassName Required Reducer executable.
-file file-name: Makes the mapper, reducer, or combiner executable available locally on the compute nodes.
Ex 1: A user-defined python executable as the mapper. The option "-file myPythonScript.py" causes the python executable shipped to the cluster machines as a part of job submission.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py
Ex 2:Send an Java class as an argument to the mapper and/or the reducer
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
Source: Hadoop Streaming jar

How to process 2 files with different inputformats in Hadoop Streaming?

I have 2 files in different formats. One is SequenceFileInputFormat, other one is TextInputFormat. I know that for Hadoop Streaming there is a possibility to specify 2 input files like:
hadoop jar hadoop-streaming-2.8.0.jar \
-input '/user/foo/dir1' -input '/user/foo/dir2' \
(rest of the command)
But how to specify also different -inputformat for those files?
I found that it's possible to do for Java with MultipleInputs like:
MultipleInputs.addInputPath(job, new Path(args[0]), <Input_Format_Class_1>);
MultipleInputs.addInputPath(job, new Path(args[1]), <Input_Format_Class_2>);
Can I do somethink like this with Hadoop Streaming?
Hadoop Streaming Options contains various options for hadoop streaming, the one that might be of use in your case would be
-inputformat JavaClassName
The Default being TextInputFormat
I have tested this using only TextInputFormat, but i recon it should be like
hadoop jar hadoop-streaming-2.8.0.jar \
-input '/user/foo/dir1' -inputformat TextInputFormat \
-input '/user/foo/dir2' -inputformat SequenceFileInputFormat \
(rest of the command)
Here is what is tested and it worked :
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0*.jar \
-file mapperB.py -mapper mapperB.py -file reducerB.py -reducer reducerB.py \
-input /tempfiles/big.txt -inputformat TextInputFormat \
-input /tempfiles/t.txt -inputformat TextInputFormat \
-output /tempfiles/output-X
Note: file is deprecated,

MapReduce: Writing Sequence file using Python[Streaming]

I am trying to write the sequence file in MapReduce. I did it with java successfully but I am not sure how to do it with python.
Thank you!
Hadoop accepts the Streaming command option -outputformat.
To generate output files as Sequence files, use-outputformat SequenceFileOutputFormat.
For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat\
-mapper MapperClass \
-reducer ReducerClass
By default, -inputformat and -outputformat are set as TextInputFormat and TextOutputFormat respectively.

Hadoop global variable with streaming

I understand that i can give some global value to my mappers via the Job and the Configuration.
But how can i do that using Hadoop Streaming(Python in my case)?
What is the right way?
Based on the docs you can specify a command line option (-cmdenv name=value) to set environment variables on each distributed machine that you can then use in your mappers/reducers:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input input.txt \
-output output.txt \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py \
-cmdenv MY_PARAM=thing_I_need

Hadoop Streaming Problems

I ran into these issues while using Hadoop Streaming. I'm writing code in python
1) Aggregate library package
According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29 ), there is an inbuilt Aggregate class which can work both as a mapper and a reducer.
Here is the command:
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -combiner aggregate -reducer NONE -input input_files -output output_path
Executing this command fails the mapper with this error:
java.io.IOException: Cannot run program "aggregate": java.io.IOException: error=2, No such file or directory
However, if you run this command using aggregate as the reducer and not the combiner, the job works fine.
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -reduce aggregate -reducer NONE -input input_files -output output_path
Does this mean I cannot use the aggregate class as the combiner?
2) Cannot use | as a seperator for the generic options
This is an example command from the above link
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
You cannot use | as an argument for map.output.key.field.separator. The error is
-D: command not found
11/08/03 10:48:02 ERROR streaming.StreamJob: Missing required options: input, output
(Update)You have to escape the | with a \ like this
-D stream.map.output.field.separator=\|
3) Cannot specify the -D options at the end of the command just like in the example. The Error is
-D: command not found
11/08/03 10:50:23 ERROR streaming.StreamJob: Unrecognized option: -D
Is the documentation flawed or I'm doing something wrong?
Any insight on what I'm doing wrong is appreciated. Thnx
This question was asked 3 years ago, but today I still got the problem with -D option so I will add a little information for other people if they have the same problem.
According to the manual of hadoop streaming:
bin/hadoop command [genericOptions] [commandOptions]
-D is a genereic option so you have to put it before any other options.
So in this case, the command should look like:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
