Error when running Hadoop Map Reduce for map-only job - hadoop

I want to run a map-only job in Hadoop MapReduce, here's my code:
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("import");
job.setMapperClass(Map.class);//Custom Mapper
job.setInputFormatClass(TextInputFormat.class);
job.setNumReduceTasks(0);
TextInputFormat.setInputPaths(job, new Path("/home/jonathan/input"));
But I get the error:
13/07/17 18:22:48 ERROR security.UserGroupInformation: PriviledgedActionException
as: jonathan cause:org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
Then I tried to use this:
job.setOutputFormatClass(org.apache.hadoop.mapred.lib.NullOutputFormat.class);
But it gives me a compilation error:
java: method setOutputFormatClass in class org.apache.hadoop.mapreduce.Job
cannot be applied to given types;
required: java.lang.Class<? extends org.apache.hadoop.mapreduce.OutputFormat>
found: java.lang.Class<org.apache.hadoop.mapred.lib.NullOutputFormat>
reason: actual argument java.lang.Class
<org.apache.hadoop.mapred.lib.NullOutputFormat> cannot be converted to
java.lang.Class<? extends org.apache.hadoop.mapreduce.OutputFormat>
by method invocation conversion
What am I doing wrong?

Map-only jobs still need an output location specified. As the error says, you're not specifying this.
I think you mean that your job produces no output at all. Hadoop still wants you to specify an output location, though nothing need be written.
You want org.apache.hadoop.mapreduce.lib.output.NullOutputFormat not org.apache.hadoop.mapred.lib.NullOutputFormat, which is what the second error indicates though it's subtle.

Related

Hadoop WordCount MapReduce: Getting invalid argument error for setInputFormatClass

I am trying to run a wordcount program but I am getting error for below code
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Error:- "The method setInputFormatClass(Class)
in the type Job is not applicable for the arguments
(Class)"
The likely problem (without seeing all of your code) is that you've mixed the two mapreduce APIs, the mapred and mapreduce.
Check the imports for the two classes. I'm guessing yours probably look like:
org.apache.hadoop.mapred.TextInputFormat
When it should be:
org.apache.hadoop.mapreduce.lib.input.TextInputFormat

Hadoop jar command error

While executing the JAR file command on HDFS getting error as below
#hadoop jar WordCountNew.jar WordCountNew /MRInput57/Input-Big.txt /MROutput57
15/11/06 19:46:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/11/06 19:46:32 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:8020/var/lib/hadoop-0.20/cache/mapred/mapred/staging/root/.staging/job_201511061734_0003
15/11/06 19:46:32 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /MRInput57/Input-Big.txt already exists
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /MRInput57/Input-Big.txt already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:921)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:526)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:556)
at MapReduce.WordCountNew.main(WordCountNew.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
My Driver class Program is as below
public static void main(String[] args) throws IOException, Exception {
// Configutation details w. r. t. Job, Jar file
Configuration conf = new Configuration();
Job job = new Job(conf, "WORDCOUNTJOB");
// Setting Driver class
job.setJarByClass(MapReduceWordCount.class);
// Setting the Mapper class
job.setMapperClass(TokenizerMapper.class);
// Setting the Combiner class
job.setCombinerClass(IntSumReducer.class);
// Setting the Reducer class
job.setReducerClass(IntSumReducer.class);
// Setting the Output Key class
job.setOutputKeyClass(Text.class);
// Setting the Output value class
job.setOutputValueClass(IntWritable.class);
// Adding the Input path
FileInputFormat.addInputPath(job, new Path(args[0]));
// Setting the output path
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// System exit strategy
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Can someone please rectify the issue in my code?
Regards
Pranav
You need to check that the output directory doesn't already exist and delete it if it does. MapReduce can't (or won't) write files to a directory that exists. It needs to create the directory to be sure.
Add this:
Path outPath = new Path(args[1]);
FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
if (dfs.exists(outPath)) {
dfs.delete(outPath, true);
}
Output directory which you are trying to create to store output is already present.So try to delete the previous directory of same name or change the name of output directory.
Output directory should not exist prior to execution of program. Either delete existing directory or provide new directory or remove output directory in your program.
I prefer deletion of output directory from command prompt before executing your program from command prompt.
From command prompt:
hdfs dfs -rm -r <your_output_directory_HDFS_URL>
From java:
Chris Gerken code is good enough.
As others have noted, you are getting the error because the output directory already exists, most likely because you have tried executing this job before.
You can remove the existing output directory right before you run the job, i.e.:
#hadoop fs -rm -r /MROutput57 && \
hadoop jar WordCountNew.jar WordCountNew /MRInput57/Input-Big.txt /MROutput57

Hadoop : ClassNotFound Error at MapReduce

Just to state my setup before posing the question,
Hadoop Version : 1.0.3
The default WordCount example is running fine. But when I created a new WordCount program according to this page http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html
I compiled it and jar-ed it in similar fashion as given in the tutorial. But when I ran it using :
/usr/local/hadoop$ bin/hadoop jar wordcount.jar org.myorg.WordCount ../Space/input/ ../Space/output
I got the following error,
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.myorg.WordCount$Map
The whole error log has been pasted here : http://pastebin.com/GNbsfpg3
Where did I go wrong?
There are some clues in the error messages:
12/07/14 18:09:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/07/14 18:09:38 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
You'll need to share your driver code with us (where you create and configure the job), but it appears you are not configuring the 'job jar', that is to say the job client is not given a hint as to where your code is bundled into a jar, and hence when you run your job, the classes cannot be found when the map instances actually run.
You probably want something like
jobConf.setJarByClass(org.myorg.WordCount.class);
I had exactly the same problem, and got it fixed by adding the following on the main code
jobConf.setJarByClass(org.myorg.WordCount.class);
here you can find the full main function
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(org.myorg.WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
I also got the above error.
What I did is that I just copied the jar files into all nodes of the cluster and set the classpath such that every slave node can access this jar.
And this worked for me.
It might help you.
before runJob, set conf like this:
conf.setJar("Your jar file name");
it may work, try it !

Giraph Shortest Paths Example ClassNotFoundException

I am trying to run the shortest paths example from the giraph incubator (https://cwiki.apache.org/confluence/display/GIRAPH/Shortest+Paths+Example). However instead of executing the example from the giraph-*-dependencies.jar, I have created my own job jar. When I created a single Job file as presented in the example, I was getting
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.Test$SimpleShortestPathsVertexInputFormat
Then I have moved the inner classes (SimpleShortestPathsVertexInputFormat and SimpleShortestPathsVertexOutputFormat) to separates files and renamed them just in case (SimpleShortestPathsVertexInputFormat_v2, SimpleShortestPathsVertexOutputFormat_v2); the classes are not static anymore. This have solved the issues of class not found for the SimpleShortestPathsVertexInputFormat_v2, however I am still getting the same error for the SimpleShortestPathsVertexOutputFormat_v2. Below is my stack trace.
INFO mapred.JobClient: Running job: job_201205221101_0003
INFO mapred.JobClient: map 0% reduce 0%
INFO mapred.JobClient: Task Id : attempt_201205221101_0003_m_000005_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.utils.SimpleShortestPathsVertexOutputFormat_v2
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:898)
at org.apache.giraph.graph.BspUtils.getVertexOutputFormatClass(BspUtils.java:134)
at org.apache.giraph.bsp.BspOutputFormat.getOutputCommitter(BspOutputFormat.java:56)
at org.apache.hadoop.mapred.Task.initialize(Task.java:490)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:352)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.utils.SimpleShortestPathsVertexOutputFormat_v2
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:866)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:890)
... 9 more
I have inspected my job jar and all classes are there. Furthermore I am using hadoop 0.20.203 in a pseudo distributed mode. The way I launch my job is presented below.
hadoop jar giraphJobs.jar org.test.giraph.Test -libjars /path/to/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar /path/to/input /path/to/output 0 3
Also I have defined HADOOP_CLASSPATH for the giraph-*-dependencies.jar. I can run the PageRankBenchmark example without a problem (directly from the giraph-*-dependencies.jar), and the shortes path example works as well (also directly from the giraph-*-dependencies.jar). Other hadoop jobs work without a problem (somewhere I have read to test if my "cluster" works correctly). Does anyone came across similar problem? Any help will be appreciated.
Solution (sorry to post it like this but I can't answer my own question for a couple of more hours)
To solve this issue I had to add my Job jar to the -libjars (no changes to HADOOP_CLASSPATH where made). The command to launch job now looks like this.
hadoop jar giraphJobs.jar org.test.giraph.Test -libjars /path/to/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar,/path/to/job.jar /path/to/input /path/to/output 0 3
List of jars has to be comma separated. Though this has solved my problem. I am still curious why I have to pass my job jar as a "classpath" parameter? Can someone explain me what is the rational behind this? As I found it strange (to say the least) to invoke my job jar and then pass it again as a "classpath" jar. I am really curious about the explanation.
I found an alternative programmatic solution to the problem.
We need to modify the run() method in the following way -
...
#Override
public int run(String[] argArray) throws Exception {
Preconditions.checkArgument(argArray.length == 4,
"run: Must have 4 arguments <input path> <output path> " +
"<source vertex id> <# of workers>");
GiraphJob job = new GiraphJob(getConf(), getClass().getName());
// This is the addition - it will make hadoop look for other classes in the same jar that contains this class
job.getInternalJob().setJarByClass(getClass());
job.setVertexClass(getClass());
...
}
setJarByClass() will make hadoop look for the missing classes in the same jar that contains the class returned by getClass(), and we will not need to add the job jar name separately to the -libjars option.

How to set system environment variable from Mapper Hadoop?

The problem below the line is solved but I am facing another problem.
I am doing this :
DistributedCache.createSymlink(job.getConfiguration());
DistributedCache.addCacheFile(new URI
("hdfs:/user/hadoop/harsh/libnative1.so"),conf.getConfiguration());
and in the mapper :
System.loadLibrary("libnative1.so");
(i also tried
System.loadLibrary("libnative1");
System.loadLibrary("native1");
But I am getting this error:
java.lang.UnsatisfiedLinkError: no libnative1.so in java.library.path
I am totally clueless what should I set java.library.path to ..
I tried setting it to /home and copied every .so from distributed cache to /home/ but still it didn't work :(
Any suggestions / solutions please?
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
I want to set the system environment variable (specifically, LD_LIBRARY_PATH) of the machine where the mapper is running.
I tried :
Runtime run = Runtime.getRuntime();
Process pr=run.exec("export LD_LIBRARY_PATH=/usr/local/:$LD_LIBRARY_PATH");
But it throws IOException.
I also know about
JobConf.MAPRED_MAP_TASK_ENV
But I am using hadoop version 0.20.2 which has Job & Configuration instead of JobConf.
I am unable to find any such variable, and this is also not a Hadoop specific environment variable but a system environment variable.
Any solution/suggestion?
Thanks in advance..
Why dont you export this variable on all nodes of the cluster ?
Anyways, use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("mapred.map.child.env",<string value>);
Job job = new Job(conf);
The format of the value is k1=v1,k2=v2

Resources