The main problem is that the program launches an
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://quickstart.cloudera:8020/user/davide/wordcount/input already exists
The command I run to launch the job is the following:
hadoop jar wordcount.jar org.wordcount.WordCount /user/davide/wordcount/input /user/davide/wordcount/output which seems correct (the output directory does not exist, as hadoop pretends).
In the java file the paths seem to be correctly set:
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
I tried several solutions, but couldn't figure out what the problem is.
Thanks in advance.
The problem lies in your argument numbering: args[0] is actually org.wordcount.WordCount, and so you need to use args[1] for input and args[2] for output. If you notice, the error says Output directory hdfs://quickstart.cloudera:8020/user/davide/wordcount/input already exists - it's trying to use the input folder as output.
To fix this:
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
Related
While executing the JAR file command on HDFS getting error as below
#hadoop jar WordCountNew.jar WordCountNew /MRInput57/Input-Big.txt /MROutput57
15/11/06 19:46:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/11/06 19:46:32 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:8020/var/lib/hadoop-0.20/cache/mapred/mapred/staging/root/.staging/job_201511061734_0003
15/11/06 19:46:32 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /MRInput57/Input-Big.txt already exists
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /MRInput57/Input-Big.txt already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:921)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:526)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:556)
at MapReduce.WordCountNew.main(WordCountNew.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
My Driver class Program is as below
public static void main(String[] args) throws IOException, Exception {
// Configutation details w. r. t. Job, Jar file
Configuration conf = new Configuration();
Job job = new Job(conf, "WORDCOUNTJOB");
// Setting Driver class
job.setJarByClass(MapReduceWordCount.class);
// Setting the Mapper class
job.setMapperClass(TokenizerMapper.class);
// Setting the Combiner class
job.setCombinerClass(IntSumReducer.class);
// Setting the Reducer class
job.setReducerClass(IntSumReducer.class);
// Setting the Output Key class
job.setOutputKeyClass(Text.class);
// Setting the Output value class
job.setOutputValueClass(IntWritable.class);
// Adding the Input path
FileInputFormat.addInputPath(job, new Path(args[0]));
// Setting the output path
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// System exit strategy
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Can someone please rectify the issue in my code?
Regards
Pranav
You need to check that the output directory doesn't already exist and delete it if it does. MapReduce can't (or won't) write files to a directory that exists. It needs to create the directory to be sure.
Add this:
Path outPath = new Path(args[1]);
FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
if (dfs.exists(outPath)) {
dfs.delete(outPath, true);
}
Output directory which you are trying to create to store output is already present.So try to delete the previous directory of same name or change the name of output directory.
Output directory should not exist prior to execution of program. Either delete existing directory or provide new directory or remove output directory in your program.
I prefer deletion of output directory from command prompt before executing your program from command prompt.
From command prompt:
hdfs dfs -rm -r <your_output_directory_HDFS_URL>
From java:
Chris Gerken code is good enough.
As others have noted, you are getting the error because the output directory already exists, most likely because you have tried executing this job before.
You can remove the existing output directory right before you run the job, i.e.:
#hadoop fs -rm -r /MROutput57 && \
hadoop jar WordCountNew.jar WordCountNew /MRInput57/Input-Big.txt /MROutput57
I have a jar called WordCountMain.jar. I would like to run this jar using hadoop command in multimode cluster.
but my user id is tagged to queue name as "omega". so if I run the above jar using the below command then I am getting a error that indicates that my id is not having submit_job access.
hadoop jar WordCountMain.jar /user/cloudera/inputs/words.txt /user/cloudera/output
So the above command not works in multimode cluster,but it works in single node CDH3 cluster
How do i include the queue name while running the above jar?
Configuration conf = new Configuration();
Job job = new Job(conf,"word count");
job.getConfiguration().set("mapreduce.job.queuename","omega");
job.setJarByClass(WordCountCombinerMain.class);
Path inputFilePath = new Path(args[0]);
Path outputFilePath = new Path(args[1]);
FileInputFormat.addInputPath(job, inputFilePath);
FileOutputFormat.setOutputPath(job, outputFilePath);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(CWordCountMapper.class);
job.setCombinerClass(CWordCountCombiner1.class);
job.setReducerClass(CWordCountCombiner1.class);
//job.setReducerClass(CwordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.waitForCompletion(true);
job.submit();
But i am getting the below error. This error says that my mapreduce job is get submitted on default queue.. Can someone help me on this
ERROR ipc.RPC: FailoverProxy: Failing this Call: submitJob for error(RemoteException): org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: User mytra cannot perform operation SUBMIT_JOB on queue default
Try the possible solutions in your driver class
Solution1: configuration.set("mapred.job.queue.name", "omega");
Solution2:
String queueName= "omega";
job.getConfiguration().set("mapreduce.job.queuename", queueName);
You can use
-Dmapred.job.queue.name=yourpoolname or -Dmapreduce.job.queuename=yourpoolname
as a parameter to submit Jobs to different queues.
Be aware that mapred.job.queue.name is a deprecated property name and new name is mapreduce.job.queuename after Hadoop 2.4.1.
I am trying to run a Hadoop job in Amazon Elastic Mapreduce. I have my data and jar located in aws s3. When i setup the job flow I pass the JAR Arguments as
s3n://my-hadoop/input s3n://my-hadoop/output
Below is my hadoop main function
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "MyMR");
job.setJarByClass(MyMR.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(CountryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
However my jobflow fails with the following log in stderr
Exception in thread "main" java.lang.ClassNotFoundException: s3n://my-hadoop/input
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:180)
So how do I specify my input and output paths in aws emr?
So basically this is a classic error of not-defining-the-main-class while trying to create an executable jar. when you do not let the jar have the knowledge of the main-class, the first argument is taken to be the main-class, and hence the error here.
So make sure that while you create the executable jar, you specify the main-class in the manifest.
OR
Use args[1] and args[2] respectively for input and output and execute the hadoop step something like following:
ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg com.somecompany.MyMainClass --arg s3:/input --arg s3:/output
I met with the same problem with you.It's because you need 3 aruguments(other than 2) when u submit the custom jar file. The first is your Main class name,the second is inputpath to your input file, the third is outputpath to your output folder.
I think you probably solved this problem,anyway.
Just to state my setup before posing the question,
Hadoop Version : 1.0.3
The default WordCount example is running fine. But when I created a new WordCount program according to this page http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html
I compiled it and jar-ed it in similar fashion as given in the tutorial. But when I ran it using :
/usr/local/hadoop$ bin/hadoop jar wordcount.jar org.myorg.WordCount ../Space/input/ ../Space/output
I got the following error,
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.myorg.WordCount$Map
The whole error log has been pasted here : http://pastebin.com/GNbsfpg3
Where did I go wrong?
There are some clues in the error messages:
12/07/14 18:09:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/07/14 18:09:38 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
You'll need to share your driver code with us (where you create and configure the job), but it appears you are not configuring the 'job jar', that is to say the job client is not given a hint as to where your code is bundled into a jar, and hence when you run your job, the classes cannot be found when the map instances actually run.
You probably want something like
jobConf.setJarByClass(org.myorg.WordCount.class);
I had exactly the same problem, and got it fixed by adding the following on the main code
jobConf.setJarByClass(org.myorg.WordCount.class);
here you can find the full main function
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(org.myorg.WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
I also got the above error.
What I did is that I just copied the jar files into all nodes of the cluster and set the classpath such that every slave node can access this jar.
And this worked for me.
It might help you.
before runJob, set conf like this:
conf.setJar("Your jar file name");
it may work, try it !
The problem below the line is solved but I am facing another problem.
I am doing this :
DistributedCache.createSymlink(job.getConfiguration());
DistributedCache.addCacheFile(new URI
("hdfs:/user/hadoop/harsh/libnative1.so"),conf.getConfiguration());
and in the mapper :
System.loadLibrary("libnative1.so");
(i also tried
System.loadLibrary("libnative1");
System.loadLibrary("native1");
But I am getting this error:
java.lang.UnsatisfiedLinkError: no libnative1.so in java.library.path
I am totally clueless what should I set java.library.path to ..
I tried setting it to /home and copied every .so from distributed cache to /home/ but still it didn't work :(
Any suggestions / solutions please?
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
I want to set the system environment variable (specifically, LD_LIBRARY_PATH) of the machine where the mapper is running.
I tried :
Runtime run = Runtime.getRuntime();
Process pr=run.exec("export LD_LIBRARY_PATH=/usr/local/:$LD_LIBRARY_PATH");
But it throws IOException.
I also know about
JobConf.MAPRED_MAP_TASK_ENV
But I am using hadoop version 0.20.2 which has Job & Configuration instead of JobConf.
I am unable to find any such variable, and this is also not a Hadoop specific environment variable but a system environment variable.
Any solution/suggestion?
Thanks in advance..
Why dont you export this variable on all nodes of the cluster ?
Anyways, use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("mapred.map.child.env",<string value>);
Job job = new Job(conf);
The format of the value is k1=v1,k2=v2