Hadoop Streaming 1.0.3 Unrecognized -D command - hadoop

I am trying to chain some Streaming jobs( jobs written in Python). I did it, but I have problem with -D commands. Here is the code,
public class OJs extends Configured implements Tool
{
public int run( String[] args) throws Exception
{
//DOMINATION
Path domin = new Path( "diploma/join.txt");
//dominationm.py
Path domout = new Path( "mapkeyout/");
//dominationr.py
String[] dom = new String[]
{
"-D mapred.reduce.tasks=0",
"-file" , "/home/hduser/optimizingJoins/dominationm.py" ,
"-mapper" , "dominationm.py" ,
"-file" , "/home/hduser/optimizingJoins/dominationr.py" ,
"-reducer" , "dominationr.py",
"-input" , domin.toString() ,
"-output" , domout.toString()
};
JobConf domConf = new StreamJob().createJob( dom);
//run domination job
JobClient.runJob( domConf);
return 0;
}//end run
public static void main( String[] args) throws Exception
{
int res = ToolRunner.run( new Configuration(), new OJs(), args);
System.exit( res);
}//end main
}//end OJs
My problem is with command "-D mapred.reduce.tasks=0". I get this error,
ERROR streaming.StreamJob: Unrecognized option: -D...
where the ... include any possible syntax combination, i.e.
"-D mapred.reduce.tasks=0"
"-Dmapred.reduce.tasks=0"
"-D", "mapred.reduce.tasks=0"
"-D", "mapred.reduce.tasks=", "0"
" -D mapred.reduce.tasks=0"
etc.
When I have a space before -D, then this command is ignored. I don't have the number of reducers I specified. When I don't have this space, I get the error I mentioned.
What am I doing wrong?
EDIT
Substituting -D option with -jobconf doesn't solve the problem. Here is the whole error output,
Warning: $HADOOP_HOME is deprecated.
12/10/04 00:25:02 ERROR streaming.StreamJob: Unrecognized option: -jobconf mapred.reduce.tasks=0
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-io <identifier> Optional.
-verbose
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
Exception in thread "main" java.lang.IllegalArgumentException:
at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:549)
at org.apache.hadoop.streaming.StreamJob.exitUsage(StreamJob.java:486)
at org.apache.hadoop.streaming.StreamJob.parseArgv(StreamJob.java:246)
at org.apache.hadoop.streaming.StreamJob.createJob(StreamJob.java:143)
at OJs.run(OJs.java:135)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at OJs.main(OJs.java:183)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Moreover, I can't understand, why when I run a job straight with Streaming, Streaming recognizes -D option, but when I run a job with Streaming through JobClient, -D option recognition fails. Is a problem of Streaming or a problem of sun.reflect? Where is sun.reflect package in Ubuntu?

Looks like StreamJob doesn't support the -Dkey=value generic configuration options.
See http://wiki.apache.org/hadoop/HadoopStreaming, but looks like you need to use (and is explicitly called out as an example on that page):
-jobconf mapred.reduce.tasks=0

To begin with, the line
..."-D mapred.reduce.tasks=0"...
should be written as
..."-D", "mapred.reduce.tasks=0"...
This is the standard pattern of commands,
"-commandname", "value"
To continue, a program generally may accepts or not some arguments. These arguments in Hadoop context are called options. There are two kinds of them, generic and streaming, job specific. The generic options are handled from GenericOptionsParser. Job specific options in the context of Hadoop Streaming are handled from StreamJob.
So, the way -D option is set in the code of the initial question, is wrong. This is because -D is a generic option. StreamJob can't handle generic options. StreamJob can handle -jobconf however, which is a job specific option. So the line
..."-D", "mapred.reduce.tasks=0"...
is writtern correctly as
..."-jobconf", "mapred.reduce.tasks=0"...
With -jobconf this warning is raised,
WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
To avoid this warning -D option is needed and consequently a GenericOptionsParser is needed to parse -D option.
To move on, when someone runs a streaming job using the command
bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-*.jar [ generic options] [ streaming( job specific) options]
what really happens? Why in this case there is no problem? In this case, both generic and job specific options are parsed properly. This is possible because of the Tool interface that takes care of the generic options through GenericOptionsParser. The job specific options are handled from the StreamJob() inside hadoop-streaming-*.jar.
Indeed hadoop-streaming-*.jar has a file "HadoopStreaming.java" responsible for jobs submitted the way above. The HadoopStreaming class calls ToolRunner.run() with two arguments. The first argument is a new StreamJob object and the second consists of all the command line options i.e. [ generic options] and [ streaming( job specific) options]. The GenericOptionsParser separates generic from job specific options by parsing only the generic ones. Then, the GenericOptionsParser returns the rest of the options i.e. job specific which are parsed from the StreamJob(). StreamJob is invoked through Tool.run( [ job specific args]) where Tool = StreamJob. See this and this to have an intuition why Tool = StreamJob.
In conclusion,
GenericOptionsParser -> generic options,
StreamJob -> streaming( job specific) options.

Related

Reading file in hadoop streaming

I am trying to read an auxiliary file in my mapper and here are my codes and commands.
mapper code:
#!/usr/bin/env python
from itertools import combinations
from operator import itemgetter
import sys
storage = {}
with open('inputData', 'r') as inputFile:
for line in inputFile:
first, second = line.split()
storage[(first, second)] = 0
for line in sys.stdin:
do_something()
And here is my command:
hadoop jar hadoop-streaming-2.7.1.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options='-k1,1 -k2,2' \
-D mapred.map.tasks=20 \
-D mapred.reduce.tasks=10 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper mapper.py -file mapper.py \
-reducer reducer.py -file reducer.py \
-file inputData \
-input /data \
-output /result
But I keep getting this error, which indicates that my mapper fails to read from stdin. After deleting the read file part, my code works, So I have pinppointed the place where the error occurs, but I don't know what should be the correct way of reading from it. Can anyone help?
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
The error you are getting means your mapper failed to write to their stdout stream for too long.
For example, a common reason for error is that in your do_something() function, you have a for loop that contains continue statement with certain conditions. Then when that condition happens too often in your input data, your script runs over continue many times consecutively, without generating any output to stdout. Hadoop waits for too long without seeing anything, so the task is considered failed.
Another possibility is that your input data file is too large, and it took too long to read. But I think that is considered setup time because it is before the first line of output. I am not sure though.
There are two relatively easy ways to solve this:
(developer side) Modify your code to output something every now and then. In the case of continue, write a short dummy symbol like '\n' to let Hadoop know your script is alive.
(system side) I believe you can set the following parameter with -D option, which controls for the waitout time in milli-seconds
mapreduce.reduce.shuffle.read.timeout
I have never tried option 2. Usually I'd avoid streaming on data that requires filtering. Streaming, especially when done with scripting language like Python, should be doing as little work as possible. My use cases are mostly post-processing output data from Apache Pig, where filtering will already be done in Pig scripts and I need something that is not available in Jython.

Command line for hadoop streaming

I am trying to use hadoop streaming where I have a java class which is used as mapper. To keep the problem simple let us assume the java code is like the following:
import java.io.* ;
class Test {
public static void main(String args[]) {
try {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String input ;
while ((input = br.readLine()) != null) {
System.out.println(input) ;
}
} catch (IOException io) {
io.printStackTrace() ;
}
}
}
I can compile it as "javac Test.java" run it from command line as follows:
[abhattac#eat1-hcl4014 java]$ cat a.dat
abc
[abhattac#eat1-hcl4014 java]$ cat a.dat | java Test
abc
[abhattac#eat1-hcl4014 java]
Let us assume that I have a file in HDFS: a.dat
[abhattac#eat1-hcl4014 java]$ hadoop fs -cat /user/abhattac/a.dat
Abc
[abhattac#eat1-hcl4014 java]$ jar cvf Test.jar Test.class
added manifest
adding: Test.class(in = 769) (out= 485)(deflated 36%)
[abhattac#eat1-hcl4014 java]$
Now I try to use (Test.java) as mapper in hadoop streaming. What do I provide for
[1] -mapper command line option. Should it be like the following?
[2] -file command line option. Do I need to make a jar file out of Test.class? If that is the case do I need to include MANIFEST.MF file to indicate the main class?
I tried all these options but none of them seem to work. Any help will be appreciated.
hadoop jar /export/apps/hadoop/latest/contrib/streaming/hadoop-streaming-1.2.1.45.jar -file Test.jar -mapper 'java Test' -input /user/abhattac/a.dat -output /user/abhattac/output
The command above doesn't work. The error message in task log is:
stderr logs
Exception in thread "main" java.lang.NoClassDefFoundError: Test
Caused by: java.lang.ClassNotFoundException: Test
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
Since hadoop streaming is just shoveling work through stdin to a command line executable you can just run "java Test" on your Test.class like you would locally. There's no need to package to a jar.
I ran this successfully myself using your code:
hadoop jar hadoop-streaming.jar -file Test.class -mapper 'java Test' -input /input -output /output
SelimN is right that this is a pretty odd way to go about it though since you could just as well be writing a native java mapper.
Streaming is usually used when you want to use a scripting language such as bash or python instead of using Java.

common.AbstractJob: Unexpected -libjars while processing Job-Specific Options

all!
When I use RecommenderJob in my project, I met an unexpected error. The arguments passed to the job is a String array which has values as follows:
[-libjars, /path/to/xxx.jar,/path/to/yyy.jar,
--input, hdfs://localhost:9000/tmp/x,
--output, hdfs://localhost:9000/tmp/y,
--similarityClassname,
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity,
--numRecommendations, 6,
--tempDir, hdfs://localhost:9000/tmp/z]
After I run the job via following code:
job.run(args);
It print an ERROR as follows:
ERROR common.AbstractJob: Unexpected -libjars while processing Job-Specific Options:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in the
classpath.
Unexpected -libjars while processing Job-Specific Options:
Usage:
...
Does anybody know how to solve it. Thanks in advance!
Finally, I have found the solution by myself. We should not use
job.run(args);
to run the job, which only deals with the Job-Specific Options. It is correct to use ToolRunner to run the job which processes the Generic Options followed by Job-Specific Options, and hence solving the problem.
ToolRunner.run(conf, job, args);

Is there a combine Input format for hadoop streaming?

I have many small input files, and I want to combine them using some input format like CombineFileInputFormat to launch fewer mapper tasks. I know I can use Java API to do this, but I don't know whether there's a streaming jar library to support this function while I'm using Hadoop streaming.
Hadoop streaming uses TextInputFormat by default but any other input format can be used, including CombineFileInputFormat. You can change the input format from the command line, using the option -inputformat. Be sure to use the old API and implement org.apache.hadoop.mapred.lib.CombineFileInputFormat. The new API isn't supported yet.
$HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar \
-inputformat foo.bar.MyCombineFileInputFormat \
-Dmapred.max.split.size=524288000 \
-Dstream.map.input.ignoreKey=true \
...
Example of CombineFileInputFormat

Combiner hack for hadoop streaming

The current version of hadoop-streaming requires a Java class for the combiner, but i read somewhere that we can use a hack like the following:
hadoop jar ./contrib/streaming/hadoop-0.20.2-streaming.jar -input /testinput -output /testoutput -mapper "python /code/triples-mapper.py | sort | python /code/triples-reducer.py" -reducer /code/triples-reducer.py
However, this does not seem to work. What am i doing wrong?
I suspect that your problem lies somewhere in the following source:
http://svn.apache.org/viewvc/hadoop/common/tags/release-0.20.2/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeMapRed.java?view=markup
line 69 splitArgs() method which is tokenizing up the command you passed:
python /code/triples-mapper.py | sort | python /code/triples-reducer.py
into a command to run: /code/triples-mapper.py (line 131/132), and then a set of arguments to pass in. All the tokens are passed to ProcessBuilder (line 164)
Java Api for ProcessBuilder
So your pipes are not being interpreted by the OS, more passed in as arguments to your mapper (you should be able to confirm this by dumping the args passed inside your mapper code)
This is all for 0.20.2, so may have been 'fixed' to meet your purposes in later version of hadoop.

Resources