Split class org.apache.hadoop.hive.ql.io.orc.OrcSplit not found - hadoop

I am trying to use orc as inputformat for hadoop streaming
here is how i run it
export HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /home/mr/mapper.py -mapper /home/mr/mapper.py \
-file /home/mr/reducer.py -reducer /home/mr/reducer.py \
-input /user/cloudera/input/users/orc \
-output /user/cloudera/output/simple \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat \
But i am getting this error:
Error: java.io.IOException: Split class
org.apache.hadoop.hive.ql.io.orc.OrcSplit not found
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:363)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ClassNotFoundException: Class
org.apache.hadoop.hive.ql.io.orc.OrcSplit not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:361)
... 7 more
it looks like OrcSplit class should be in hive-exec.jar

An easier solution is to have hadoop-streaming distribute the lib jars for you by using the -libjars argument. This argument takes a comma-separated list jars. To take your example, you could do:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-libjars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
-file /home/mr/mapper.py -mapper /home/mr/mapper.py \
-file /home/mr/reducer.py -reducer /home/mr/reducer.py \
-input /user/cloudera/input/users/orc \
-output /user/cloudera/output/simple \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

I found the answer. my problem was that i set HADOOP_CLASSPATH var only on one node. So i should either set it on everynode or use distrbuted cache

Related

How can I run Hadoop Streaming on Hadoop Cluster?

Currently I have a Hadoop cluster with 3 nodes(ubuntu)
I want to run python / R scripts with Hadoop Streaming, but I am not sure whether just executing HS actually makes all nodes work or not
If it is possible, please give me direction to run Streaming on the cluster
Thanks
Hadoop streaming is a built-in jar/utility that allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
In the above command, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
-input: directory/file-name Input location for mapper.
-output: directory-name Output location for reducer.
-mapper: executable or script or JavaClassName Required Mapper executable.
-reducer: executable or script or JavaClassName Required Reducer executable.
-file file-name: Makes the mapper, reducer, or combiner executable available locally on the compute nodes.
Ex 1: A user-defined python executable as the mapper. The option "-file myPythonScript.py" causes the python executable shipped to the cluster machines as a part of job submission.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py
Ex 2:Send an Java class as an argument to the mapper and/or the reducer
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
Source: Hadoop Streaming jar

How to process 2 files with different inputformats in Hadoop Streaming?

I have 2 files in different formats. One is SequenceFileInputFormat, other one is TextInputFormat. I know that for Hadoop Streaming there is a possibility to specify 2 input files like:
hadoop jar hadoop-streaming-2.8.0.jar \
-input '/user/foo/dir1' -input '/user/foo/dir2' \
(rest of the command)
But how to specify also different -inputformat for those files?
I found that it's possible to do for Java with MultipleInputs like:
MultipleInputs.addInputPath(job, new Path(args[0]), <Input_Format_Class_1>);
MultipleInputs.addInputPath(job, new Path(args[1]), <Input_Format_Class_2>);
Can I do somethink like this with Hadoop Streaming?
Hadoop Streaming Options contains various options for hadoop streaming, the one that might be of use in your case would be
-inputformat JavaClassName
The Default being TextInputFormat
I have tested this using only TextInputFormat, but i recon it should be like
hadoop jar hadoop-streaming-2.8.0.jar \
-input '/user/foo/dir1' -inputformat TextInputFormat \
-input '/user/foo/dir2' -inputformat SequenceFileInputFormat \
(rest of the command)
Here is what is tested and it worked :
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0*.jar \
-file mapperB.py -mapper mapperB.py -file reducerB.py -reducer reducerB.py \
-input /tempfiles/big.txt -inputformat TextInputFormat \
-input /tempfiles/t.txt -inputformat TextInputFormat \
-output /tempfiles/output-X
Note: file is deprecated,

MapReduce: Writing Sequence file using Python[Streaming]

I am trying to write the sequence file in MapReduce. I did it with java successfully but I am not sure how to do it with python.
Thank you!
Hadoop accepts the Streaming command option -outputformat.
To generate output files as Sequence files, use-outputformat SequenceFileOutputFormat.
For example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat\
-mapper MapperClass \
-reducer ReducerClass
By default, -inputformat and -outputformat are set as TextInputFormat and TextOutputFormat respectively.

Hadoop global variable with streaming

I understand that i can give some global value to my mappers via the Job and the Configuration.
But how can i do that using Hadoop Streaming(Python in my case)?
What is the right way?
Based on the docs you can specify a command line option (-cmdenv name=value) to set environment variables on each distributed machine that you can then use in your mappers/reducers:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input input.txt \
-output output.txt \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py \
-cmdenv MY_PARAM=thing_I_need

Hadoop Streaming Problems

I ran into these issues while using Hadoop Streaming. I'm writing code in python
1) Aggregate library package
According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29 ), there is an inbuilt Aggregate class which can work both as a mapper and a reducer.
Here is the command:
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -combiner aggregate -reducer NONE -input input_files -output output_path
Executing this command fails the mapper with this error:
java.io.IOException: Cannot run program "aggregate": java.io.IOException: error=2, No such file or directory
However, if you run this command using aggregate as the reducer and not the combiner, the job works fine.
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -reduce aggregate -reducer NONE -input input_files -output output_path
Does this mean I cannot use the aggregate class as the combiner?
2) Cannot use | as a seperator for the generic options
This is an example command from the above link
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
You cannot use | as an argument for map.output.key.field.separator. The error is
-D: command not found
11/08/03 10:48:02 ERROR streaming.StreamJob: Missing required options: input, output
(Update)You have to escape the | with a \ like this
-D stream.map.output.field.separator=\|
3) Cannot specify the -D options at the end of the command just like in the example. The Error is
-D: command not found
11/08/03 10:50:23 ERROR streaming.StreamJob: Unrecognized option: -D
Is the documentation flawed or I'm doing something wrong?
Any insight on what I'm doing wrong is appreciated. Thnx
This question was asked 3 years ago, but today I still got the problem with -D option so I will add a little information for other people if they have the same problem.
According to the manual of hadoop streaming:
bin/hadoop command [genericOptions] [commandOptions]
-D is a genereic option so you have to put it before any other options.
So in this case, the command should look like:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \

Resources