common.AbstractJob: Unexpected -libjars while processing Job-Specific Options - hadoop

all!
When I use RecommenderJob in my project, I met an unexpected error. The arguments passed to the job is a String array which has values as follows:
[-libjars, /path/to/xxx.jar,/path/to/yyy.jar,
--input, hdfs://localhost:9000/tmp/x,
--output, hdfs://localhost:9000/tmp/y,
--similarityClassname,
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity,
--numRecommendations, 6,
--tempDir, hdfs://localhost:9000/tmp/z]
After I run the job via following code:
job.run(args);
It print an ERROR as follows:
ERROR common.AbstractJob: Unexpected -libjars while processing Job-Specific Options:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in the
classpath.
Unexpected -libjars while processing Job-Specific Options:
Usage:
...
Does anybody know how to solve it. Thanks in advance!

Finally, I have found the solution by myself. We should not use
job.run(args);
to run the job, which only deals with the Job-Specific Options. It is correct to use ToolRunner to run the job which processes the Generic Options followed by Job-Specific Options, and hence solving the problem.
ToolRunner.run(conf, job, args);

Related

java.lang.IllegalArgumentException: Both source file listing and source paths present

I am trying to copy files from HDFS to S3 using distcp by executing the following command
hadoop distcp -fs.s3a.access.key=AccessKey -fs.s3a.secret.key=SecrerKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
But I am getting the following error:
17/09/05 02:59:30 ERROR tools.DistCp: Invalid arguments:
java.lang.IllegalArgumentException: Both source file listing and source paths present
at org.apache.hadoop.tools.OptionsParser.parseSourceAndTargetPaths(OptionsParser.java:341)
at org.apache.hadoop.tools.OptionsParser.parse(OptionsParser.java:89)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:112)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:436)
Invalid arguments: Both source file listing and source paths present
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
To pass configuration parameters, you have to use -D
hadoop distcp -Dfs.s3a.access.key=AccessKey -Dfs.s3a.secret.key=SecrerKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
Old Command
hadoop distcp -Dfs.s3a.access.key=AccessKey -Dfs.s3a.secret.key=SecretKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
Rectified Command
hadoop distcp -Dfs.s3n.awsAccessKeyId=AccessKey -Dfs.s3n.awsSecretAccessKey=SecretKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test

What is the intention of Sqoop options? -single --double munus difference?

In the given example: username followed by one - where as --connect and --table other commands followed by double -- what is the intention of such Sqoop options? Where should I use single and where double?
sqoop-import --connect jdbc:mysql://localhost:3306/db1 -username
root -password password --table tableName --hive-table tableName
--create-hive-table --hive-import --hive-home path/to/hive_home
List item
Generic Hadoop arguments are preceded by a single dash character (-), whereas sqoop arguments start with two dashes (--), unless they are single character arguments such as -P.
Generic hadoop command-line arguments supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
FYI: You must supply the generic hadoop arguments -conf, -D, and so on after the tool name but before any tool-specific arguments (such as --connect).

How can multiple files be specified with "-files" in the CLI of Amazon for EMR?

I am trying to start an amazon cluster via the amazon CLI, but I am a little bit confused how I should specify multiple files. My current call is as follows:
aws emr create-cluster --steps Type=STREAMING,Name='Intra country development',ActionOnFailure=CONTINUE,Args=[-files,s3://betaestimationtest/mapper.py,-
files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-
input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
--ami-version 3.1.0
--instance-groupsInstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
--log-uri s3://betaestimationtest/logs
However, Hadoop now complains that it cannot find the reducer file:
Caused by: java.io.IOException: Cannot run program "reducer.py": error=2, No such file or directory
What am I doing wrong? The file does exist in the folder I specify
For passing multiple files in a streaming step, you need to use file:// to pass the steps as a json file.
AWS CLI shorthand syntax uses comma as delimeter to separate a list of args. So when we try to pass in parameters like: "-files","s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py", then the shorthand syntax parser will treat mapper.py and reducer.py files as two parameters.
The workaround is to use the json format. Please see the examples below.
aws emr create-cluster --steps file://./mysteps.json --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate --log-uri s3://betaestimationtest/logs
mysteps.json looks like:
[
{
"Name": "Intra country development",
"Type": "STREAMING",
"ActionOnFailure": "CONTINUE",
"Args": [
"-files",
"s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py",
"-mapper",
"mapper.py",
"-reducer",
"reducer.py",
"-input",
" s3://betaestimationtest/output_0_inte",
"-output",
" s3://betaestimationtest/output_1_intra"
]}
]
You can also find examples here: https://github.com/aws/aws-cli/blob/develop/awscli/examples/emr/create-cluster-examples.rst. See example 13.
Hope it helps!
You are specifying -files twice, you only need to specify once. I forget if the CLI needs the separator to be a space or a comma for multiple values, but you can try that out.
You should replace:
Args=[-files,s3://betaestimationtest/mapper.py,-files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
with:
Args=[-files,s3://betaestimationtest/mapper.py s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
or if that fails, with:
Args=[-files,s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
Add an escape for comma separating files:
Args=[-files,s3://betaestimationtest/mapper.py\\,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]

Passing directories to hadoop streaming : some help needed

The context is that I am trying to run a streaming job on Amazon EMR (the web UI) with a bash script that I run like:
-input s3://emrdata/test_data/input -output s3://emrdata/test_data/output -mapper
s3://emrdata/test_data/scripts/mapperScript.sh -reducer NONE
The input directory has sub-directories in it and these sub-directories have gzipped data files.
The relevant part of mapperScript.sh that fails is :
for filename in "$input"/*; do
dir_name=`dirname $filename`
fname=`basename $filename`
echo "$fname">/dev/stderr
modelname=${fname}.model
modelfile=$model_location/$modelname
echo "$modelfile">/dev/stderr
inputfile=$dirname/$fname
echo "$inputfile">/dev/stderr
outputfile=$output/$fname
echo "$outputfile">/dev/stderr
# Will do some processing on the files in the sub-directories here
done # this is the loop for getting input from all sub-directories
Basically, I need to read the sub-directories in streaming mode and when I run this, hadoop complains saying :
2013-03-01 10:41:26,226 ERROR
org.apache.hadoop.security.UserGroupInformation (main):
PriviledgedActionException as:hadoop cause:java.io.IOException: Not a
file: s3://emrdata/test_data/input/data1 2013-03-01 10:41:26,226
ERROR org.apache.hadoop.streaming.StreamJob (main): Error Launching
job : Not a file: s3://emrdata/test_data/input/data1
I am aware that a similar q has been asked here
The suggestion there was to write one's own InputFormat. I am wondering if I am missing something else in the way my script is written / EMR inputs are given, or whether writing my own InputFormat in Java is my only choice.
I have tried giving my input with a "input/*" to EMR as well, but no luck.
It seems that while there may be some temporary workarounds to this, but inherently hadoop doesn't support this yet as you may see that there is an open ticket on this here.
So inputpatth/*/* may work for 2 level of subdierctories it may fail for further nesting.
The best thing you can do for now is get the listing of the files/folders-without-any-subdirectory and add them recursively after creating a csv list of inputPaths. You may use sinple tools like s3cmd for this.

Hadoop Streaming 1.0.3 Unrecognized -D command

I am trying to chain some Streaming jobs( jobs written in Python). I did it, but I have problem with -D commands. Here is the code,
public class OJs extends Configured implements Tool
{
public int run( String[] args) throws Exception
{
//DOMINATION
Path domin = new Path( "diploma/join.txt");
//dominationm.py
Path domout = new Path( "mapkeyout/");
//dominationr.py
String[] dom = new String[]
{
"-D mapred.reduce.tasks=0",
"-file" , "/home/hduser/optimizingJoins/dominationm.py" ,
"-mapper" , "dominationm.py" ,
"-file" , "/home/hduser/optimizingJoins/dominationr.py" ,
"-reducer" , "dominationr.py",
"-input" , domin.toString() ,
"-output" , domout.toString()
};
JobConf domConf = new StreamJob().createJob( dom);
//run domination job
JobClient.runJob( domConf);
return 0;
}//end run
public static void main( String[] args) throws Exception
{
int res = ToolRunner.run( new Configuration(), new OJs(), args);
System.exit( res);
}//end main
}//end OJs
My problem is with command "-D mapred.reduce.tasks=0". I get this error,
ERROR streaming.StreamJob: Unrecognized option: -D...
where the ... include any possible syntax combination, i.e.
"-D mapred.reduce.tasks=0"
"-Dmapred.reduce.tasks=0"
"-D", "mapred.reduce.tasks=0"
"-D", "mapred.reduce.tasks=", "0"
" -D mapred.reduce.tasks=0"
etc.
When I have a space before -D, then this command is ignored. I don't have the number of reducers I specified. When I don't have this space, I get the error I mentioned.
What am I doing wrong?
EDIT
Substituting -D option with -jobconf doesn't solve the problem. Here is the whole error output,
Warning: $HADOOP_HOME is deprecated.
12/10/04 00:25:02 ERROR streaming.StreamJob: Unrecognized option: -jobconf mapred.reduce.tasks=0
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-io <identifier> Optional.
-verbose
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
Exception in thread "main" java.lang.IllegalArgumentException:
at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:549)
at org.apache.hadoop.streaming.StreamJob.exitUsage(StreamJob.java:486)
at org.apache.hadoop.streaming.StreamJob.parseArgv(StreamJob.java:246)
at org.apache.hadoop.streaming.StreamJob.createJob(StreamJob.java:143)
at OJs.run(OJs.java:135)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at OJs.main(OJs.java:183)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Moreover, I can't understand, why when I run a job straight with Streaming, Streaming recognizes -D option, but when I run a job with Streaming through JobClient, -D option recognition fails. Is a problem of Streaming or a problem of sun.reflect? Where is sun.reflect package in Ubuntu?
Looks like StreamJob doesn't support the -Dkey=value generic configuration options.
See http://wiki.apache.org/hadoop/HadoopStreaming, but looks like you need to use (and is explicitly called out as an example on that page):
-jobconf mapred.reduce.tasks=0
To begin with, the line
..."-D mapred.reduce.tasks=0"...
should be written as
..."-D", "mapred.reduce.tasks=0"...
This is the standard pattern of commands,
"-commandname", "value"
To continue, a program generally may accepts or not some arguments. These arguments in Hadoop context are called options. There are two kinds of them, generic and streaming, job specific. The generic options are handled from GenericOptionsParser. Job specific options in the context of Hadoop Streaming are handled from StreamJob.
So, the way -D option is set in the code of the initial question, is wrong. This is because -D is a generic option. StreamJob can't handle generic options. StreamJob can handle -jobconf however, which is a job specific option. So the line
..."-D", "mapred.reduce.tasks=0"...
is writtern correctly as
..."-jobconf", "mapred.reduce.tasks=0"...
With -jobconf this warning is raised,
WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
To avoid this warning -D option is needed and consequently a GenericOptionsParser is needed to parse -D option.
To move on, when someone runs a streaming job using the command
bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-*.jar [ generic options] [ streaming( job specific) options]
what really happens? Why in this case there is no problem? In this case, both generic and job specific options are parsed properly. This is possible because of the Tool interface that takes care of the generic options through GenericOptionsParser. The job specific options are handled from the StreamJob() inside hadoop-streaming-*.jar.
Indeed hadoop-streaming-*.jar has a file "HadoopStreaming.java" responsible for jobs submitted the way above. The HadoopStreaming class calls ToolRunner.run() with two arguments. The first argument is a new StreamJob object and the second consists of all the command line options i.e. [ generic options] and [ streaming( job specific) options]. The GenericOptionsParser separates generic from job specific options by parsing only the generic ones. Then, the GenericOptionsParser returns the rest of the options i.e. job specific which are parsed from the StreamJob(). StreamJob is invoked through Tool.run( [ job specific args]) where Tool = StreamJob. See this and this to have an intuition why Tool = StreamJob.
In conclusion,
GenericOptionsParser -> generic options,
StreamJob -> streaming( job specific) options.

Resources