Command line for hadoop streaming - hadoop

I am trying to use hadoop streaming where I have a java class which is used as mapper. To keep the problem simple let us assume the java code is like the following:
import java.io.* ;
class Test {
public static void main(String args[]) {
try {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String input ;
while ((input = br.readLine()) != null) {
System.out.println(input) ;
}
} catch (IOException io) {
io.printStackTrace() ;
}
}
}
I can compile it as "javac Test.java" run it from command line as follows:
[abhattac#eat1-hcl4014 java]$ cat a.dat
abc
[abhattac#eat1-hcl4014 java]$ cat a.dat | java Test
abc
[abhattac#eat1-hcl4014 java]
Let us assume that I have a file in HDFS: a.dat
[abhattac#eat1-hcl4014 java]$ hadoop fs -cat /user/abhattac/a.dat
Abc
[abhattac#eat1-hcl4014 java]$ jar cvf Test.jar Test.class
added manifest
adding: Test.class(in = 769) (out= 485)(deflated 36%)
[abhattac#eat1-hcl4014 java]$
Now I try to use (Test.java) as mapper in hadoop streaming. What do I provide for
[1] -mapper command line option. Should it be like the following?
[2] -file command line option. Do I need to make a jar file out of Test.class? If that is the case do I need to include MANIFEST.MF file to indicate the main class?
I tried all these options but none of them seem to work. Any help will be appreciated.
hadoop jar /export/apps/hadoop/latest/contrib/streaming/hadoop-streaming-1.2.1.45.jar -file Test.jar -mapper 'java Test' -input /user/abhattac/a.dat -output /user/abhattac/output
The command above doesn't work. The error message in task log is:
stderr logs
Exception in thread "main" java.lang.NoClassDefFoundError: Test
Caused by: java.lang.ClassNotFoundException: Test
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

Since hadoop streaming is just shoveling work through stdin to a command line executable you can just run "java Test" on your Test.class like you would locally. There's no need to package to a jar.
I ran this successfully myself using your code:
hadoop jar hadoop-streaming.jar -file Test.class -mapper 'java Test' -input /input -output /output
SelimN is right that this is a pretty odd way to go about it though since you could just as well be writing a native java mapper.
Streaming is usually used when you want to use a scripting language such as bash or python instead of using Java.

Related

Reading file in hadoop streaming

I am trying to read an auxiliary file in my mapper and here are my codes and commands.
mapper code:
#!/usr/bin/env python
from itertools import combinations
from operator import itemgetter
import sys
storage = {}
with open('inputData', 'r') as inputFile:
for line in inputFile:
first, second = line.split()
storage[(first, second)] = 0
for line in sys.stdin:
do_something()
And here is my command:
hadoop jar hadoop-streaming-2.7.1.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options='-k1,1 -k2,2' \
-D mapred.map.tasks=20 \
-D mapred.reduce.tasks=10 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper mapper.py -file mapper.py \
-reducer reducer.py -file reducer.py \
-file inputData \
-input /data \
-output /result
But I keep getting this error, which indicates that my mapper fails to read from stdin. After deleting the read file part, my code works, So I have pinppointed the place where the error occurs, but I don't know what should be the correct way of reading from it. Can anyone help?
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
The error you are getting means your mapper failed to write to their stdout stream for too long.
For example, a common reason for error is that in your do_something() function, you have a for loop that contains continue statement with certain conditions. Then when that condition happens too often in your input data, your script runs over continue many times consecutively, without generating any output to stdout. Hadoop waits for too long without seeing anything, so the task is considered failed.
Another possibility is that your input data file is too large, and it took too long to read. But I think that is considered setup time because it is before the first line of output. I am not sure though.
There are two relatively easy ways to solve this:
(developer side) Modify your code to output something every now and then. In the case of continue, write a short dummy symbol like '\n' to let Hadoop know your script is alive.
(system side) I believe you can set the following parameter with -D option, which controls for the waitout time in milli-seconds
mapreduce.reduce.shuffle.read.timeout
I have never tried option 2. Usually I'd avoid streaming on data that requires filtering. Streaming, especially when done with scripting language like Python, should be doing as little work as possible. My use cases are mostly post-processing output data from Apache Pig, where filtering will already be done in Pig scripts and I need something that is not available in Jython.

Elasticsearch standalone JDBC river feeder missing main class

I'm trying to setup the feeder following this instruction https://github.com/jprante/elasticsearch-jdbc#installation
I downloaded and unzipped the feeder
I don't quite understand this step:
run script with a command that starts org.xbib.tools.JDBCImporter with the lib directory on the classpath
what am I suppposed to do?
if I try to run a sample script from bin I get:
Bad substitution
Error: Could not find or load main class org.xbib.elasticsearch.plugin.jdbc.feeder.Runner
where do I get the java classes org.xbib.elasticsearch.plugin.jdbc.feeder.Runner \
org.xbib.elasticsearch.plugin.jdbc.feeder.JDBCFeeder?
figured out the solution
it was to set the installation folder in script (not the elasticsearch folder but the jdbc folder!)
#!/bin/bash
#JDBC Directory -> important, change accordingly!
export JDBC_IMPORTER_HOME=~/Downloads/elasticsearch-jdbc-1.6.0.0
bin=$JDBC_IMPORTER_HOME/bin
lib=$JDBC_IMPORTER_HOME/lib
echo '{
...
...
}
}' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
org.xbib.tools.JDBCImporter

hadoop map reduce -archives not unpacking archives

hope you can help me. I've got a head-scratching problem with hadoop map-reduce. I've been using the "-files" option successfully on a map-reduce, with hadoop version 1.0.3. However, when I use the "-archives" option, it copies the files, but does not uncompress them. What am I missing? The documentation says "Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes", but that's not what I'm seeing.
I have created 3 files - a text file "alice.txt", a zip file "bob.zip" (containing b1.txt and bdir/b2.txt), and a tar file "claire.tar" (containing c1.txt and cdir/c2.txt). I then invoke the hadoop job via
hadoop jar myJar myClass -files ./etc/alice.txt -archives ./etc/bob.zip,./etc/claire.tar <input_path> <output_path>
The files are indeed there and well-formed:
% ls -l etc/alice.txt etc/bob.zip etc/claire.tar
-rw-rw-r-- 1 hadoop hadoop 6 Aug 20 18:44 etc/alice.txt
-rw-rw-r-- 1 hadoop hadoop 282 Aug 20 18:44 etc/bob.zip
-rw-rw-r-- 1 hadoop hadoop 10240 Aug 20 18:44 etc/claire.tar
% tar tf etc/claire.tar
c1.txt
cdir/c2.txt
I then have my mapper test for the existence of the files in question, like so, where 'lineNumber' is the key passed into the mapper:
String key = Long.toString(lineNumber.get());
String [] files = {
"alice.txt",
"bob.zip",
"claire.tar",
"bdir",
"cdir",
"b1.txt",
"b2.txt",
"bdir/b2.txt",
"c1.txt",
"c2.txt",
"cdir/c2.txt"
};
String fName = files[ (int) (lineNumber.get() % files.length)];
String val = codeFile(fName);
output.collect(new Text(key), new Text(val));
The support routine 'codeFile' is:
private String codeFile(String fName) {
Vector<String> clauses = new Vector<String>();
clauses.add(fName);
File f = new File(fName);
if (!f.exists()) {
clauses.add("nonexistent");
} else {
if (f.canRead()) clauses.add("readable");
if (f.canWrite()) clauses.add("writable");
if (f.canExecute()) clauses.add("executable");
if (f.isDirectory()) clauses.add("dir");
if (f.isFile()) clauses.add("file");
}
return Joiner.on(',').join(clauses);
}
Using the Guava 'Joiner' class.
The output values from the mapper look like this:
alice.txt,readable,writable,executable,file
bob.zip,readable,writable,executable,dir
claire.tar,readable,writable,executable,dir
bdir,nonexistent
b1.txt,nonexistent
b2.txt,nonexistent
bdir/b2.txt,nonexistent
cdir,nonexistent
c1.txt,nonexistent
c2.txt,nonexistent
cdir/c2.txt,nonexistent
So you see the problem - the archive files are there, but they are not unpacked. What am I missing? I have also tried using DistributedCache.addCacheArchive() instead of using -archives, but the problem is still there.
the distributed cache doesn't unpack the archives files to the local working directory of your task - there's a location on each task tracker for job as a whole, and it's unpacked there.
You'll need to check the DistributedCache to find this location and look for the files there. The Javadocs for DistributedCache show an example mapper pulling this information.
You can use symbolic linking when defining the -files and -archives generic options and a symlink will be created in the local working directory of the map / reduce tasks making this easier:
hadoop jar myJar myClass -files ./etc/alice.txt#file1.txt \
-archives ./etc/bob.zip#bob,./etc/claire.tar#claire
And then you can use the fragment names in your mapper when trying to open files in the archive:
new File("bob").isDirectory() == true

Hadoop Streaming 1.0.3 Unrecognized -D command

I am trying to chain some Streaming jobs( jobs written in Python). I did it, but I have problem with -D commands. Here is the code,
public class OJs extends Configured implements Tool
{
public int run( String[] args) throws Exception
{
//DOMINATION
Path domin = new Path( "diploma/join.txt");
//dominationm.py
Path domout = new Path( "mapkeyout/");
//dominationr.py
String[] dom = new String[]
{
"-D mapred.reduce.tasks=0",
"-file" , "/home/hduser/optimizingJoins/dominationm.py" ,
"-mapper" , "dominationm.py" ,
"-file" , "/home/hduser/optimizingJoins/dominationr.py" ,
"-reducer" , "dominationr.py",
"-input" , domin.toString() ,
"-output" , domout.toString()
};
JobConf domConf = new StreamJob().createJob( dom);
//run domination job
JobClient.runJob( domConf);
return 0;
}//end run
public static void main( String[] args) throws Exception
{
int res = ToolRunner.run( new Configuration(), new OJs(), args);
System.exit( res);
}//end main
}//end OJs
My problem is with command "-D mapred.reduce.tasks=0". I get this error,
ERROR streaming.StreamJob: Unrecognized option: -D...
where the ... include any possible syntax combination, i.e.
"-D mapred.reduce.tasks=0"
"-Dmapred.reduce.tasks=0"
"-D", "mapred.reduce.tasks=0"
"-D", "mapred.reduce.tasks=", "0"
" -D mapred.reduce.tasks=0"
etc.
When I have a space before -D, then this command is ignored. I don't have the number of reducers I specified. When I don't have this space, I get the error I mentioned.
What am I doing wrong?
EDIT
Substituting -D option with -jobconf doesn't solve the problem. Here is the whole error output,
Warning: $HADOOP_HOME is deprecated.
12/10/04 00:25:02 ERROR streaming.StreamJob: Unrecognized option: -jobconf mapred.reduce.tasks=0
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-io <identifier> Optional.
-verbose
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
Exception in thread "main" java.lang.IllegalArgumentException:
at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:549)
at org.apache.hadoop.streaming.StreamJob.exitUsage(StreamJob.java:486)
at org.apache.hadoop.streaming.StreamJob.parseArgv(StreamJob.java:246)
at org.apache.hadoop.streaming.StreamJob.createJob(StreamJob.java:143)
at OJs.run(OJs.java:135)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at OJs.main(OJs.java:183)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Moreover, I can't understand, why when I run a job straight with Streaming, Streaming recognizes -D option, but when I run a job with Streaming through JobClient, -D option recognition fails. Is a problem of Streaming or a problem of sun.reflect? Where is sun.reflect package in Ubuntu?
Looks like StreamJob doesn't support the -Dkey=value generic configuration options.
See http://wiki.apache.org/hadoop/HadoopStreaming, but looks like you need to use (and is explicitly called out as an example on that page):
-jobconf mapred.reduce.tasks=0
To begin with, the line
..."-D mapred.reduce.tasks=0"...
should be written as
..."-D", "mapred.reduce.tasks=0"...
This is the standard pattern of commands,
"-commandname", "value"
To continue, a program generally may accepts or not some arguments. These arguments in Hadoop context are called options. There are two kinds of them, generic and streaming, job specific. The generic options are handled from GenericOptionsParser. Job specific options in the context of Hadoop Streaming are handled from StreamJob.
So, the way -D option is set in the code of the initial question, is wrong. This is because -D is a generic option. StreamJob can't handle generic options. StreamJob can handle -jobconf however, which is a job specific option. So the line
..."-D", "mapred.reduce.tasks=0"...
is writtern correctly as
..."-jobconf", "mapred.reduce.tasks=0"...
With -jobconf this warning is raised,
WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
To avoid this warning -D option is needed and consequently a GenericOptionsParser is needed to parse -D option.
To move on, when someone runs a streaming job using the command
bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-*.jar [ generic options] [ streaming( job specific) options]
what really happens? Why in this case there is no problem? In this case, both generic and job specific options are parsed properly. This is possible because of the Tool interface that takes care of the generic options through GenericOptionsParser. The job specific options are handled from the StreamJob() inside hadoop-streaming-*.jar.
Indeed hadoop-streaming-*.jar has a file "HadoopStreaming.java" responsible for jobs submitted the way above. The HadoopStreaming class calls ToolRunner.run() with two arguments. The first argument is a new StreamJob object and the second consists of all the command line options i.e. [ generic options] and [ streaming( job specific) options]. The GenericOptionsParser separates generic from job specific options by parsing only the generic ones. Then, the GenericOptionsParser returns the rest of the options i.e. job specific which are parsed from the StreamJob(). StreamJob is invoked through Tool.run( [ job specific args]) where Tool = StreamJob. See this and this to have an intuition why Tool = StreamJob.
In conclusion,
GenericOptionsParser -> generic options,
StreamJob -> streaming( job specific) options.

Combiner hack for hadoop streaming

The current version of hadoop-streaming requires a Java class for the combiner, but i read somewhere that we can use a hack like the following:
hadoop jar ./contrib/streaming/hadoop-0.20.2-streaming.jar -input /testinput -output /testoutput -mapper "python /code/triples-mapper.py | sort | python /code/triples-reducer.py" -reducer /code/triples-reducer.py
However, this does not seem to work. What am i doing wrong?
I suspect that your problem lies somewhere in the following source:
http://svn.apache.org/viewvc/hadoop/common/tags/release-0.20.2/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeMapRed.java?view=markup
line 69 splitArgs() method which is tokenizing up the command you passed:
python /code/triples-mapper.py | sort | python /code/triples-reducer.py
into a command to run: /code/triples-mapper.py (line 131/132), and then a set of arguments to pass in. All the tokens are passed to ProcessBuilder (line 164)
Java Api for ProcessBuilder
So your pipes are not being interpreted by the OS, more passed in as arguments to your mapper (you should be able to confirm this by dumping the args passed inside your mapper code)
This is all for 0.20.2, so may have been 'fixed' to meet your purposes in later version of hadoop.

Resources