Running Hadoop Job Remotely - hadoop

I am trying to run a MapReduce job from outside the cluster.
e.g. Hadoop cluster is running on Linux machines.
We have one web application running on a Windows machine.
We want to run the hadoop job from this remote web application.
We want to retrieve the hadoop output directory and present it as a Graph.
We have written the following piece of code:
Configuration conf = new Configuration();
Job job = new Job(conf);
conf.set("mapred.job.tracker", "192.168.56.101:54311");
conf.set("fs.default.name", "hdfs://192.168.56.101:54310");
job.setJarByClass(Analysis.class) ;
//job.setOutputKeyClass(Text.class);
//job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
//job.set
job.setInputFormatClass(CustomFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.waitForCompletion(true);
And this is the error we get. Even if we shut down the hadoop 1.1.2 cluster, the error is still the same.
14/03/07 00:23:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/03/07 00:23:37 ERROR security.UserGroupInformation: PriviledgedActionException as:user cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-user\mapred\staging\user818037780\.staging to 0700
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-user\mapred\staging\user818037780\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:550)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:580)
at LineCounter.main(LineCounter.java:86)

While running from a remote system, you should run as remote user. You can do it in your main class as follows:
public static void main(String a[]) {
UserGroupInformation ugi
= UserGroupInformation.createRemoteUser("root");
try {
ugi.doAs(new PrivilegedExceptionAction<Void>() {
public Void run() throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
conf.set("hadoop.job.ugi", "root");
// write your remaining piece of code here.
return null;
}
});
} catch (Exception e) {
e.printStackTrace();
}
}
Also while submitting a mapreduce job, it should copy your java classes with their dependent jars to hadoop cluster, where it execute mapreduce job.You can read more here.
So you need to create a runnable jar of your code (with main class Analysis in your case) with all dependent jar files inits manifest classpath. Then run your jar file from your commandline using
java -jar job-jar-with-dependencies.jar arguments
HTH!

Related

Accumulo Datastore Throwing Exception about hadoop winutils.exe

I am trying to connect to my accumulo instance remotely , I started a project with maven and added all library needed , Here in this code I am setting the parameters of connection:
public class App{
public static void main(String [] argv){
HashMap<String,String> parametres=new HashMap<>();
parametres.put("accumulo.instance.id","******");
parametres.put("accumulo.zookeepers","accumulo-do");
parametres.put("accumulo.user","root");
parametres.put("accumulo.password","****");
parametres.put("accumulo.catalog","*******");
try
{
DataStore dataStore= DataStoreFinder.getDataStore(parametres);
System.out.println("Succés");
}catch (Exception e){
System.out.println("Exception de Accumulo");
System.out.println(e);
}
}
}
But I tried to run it I am getting this error :
> Unable to load native-hadoop library for your platform... using builtin-java >classes where applicable
>Failed to locate the winutils binary in the hadoop binary path
>java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at >org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at >org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at >org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at >org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:337)
at >org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:331)
at >org.locationtech.geomesa.accumulo.data.AccumuloDataStore.liftedTree1$1(AccumuloDataStore.scala:66)
> at org.locationtech.geomesa.accumulo.data.AccumuloDataStore.<init>(AccumuloDataStore.scala:65)
> at >org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory.createDataStore(>AccumuloDataStoreFactory.scala:50)
> at >org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory.createDataStore(>AccumuloDataStoreFactory.scala:37)
> at >org.geotools.data.DataAccessFinder.getDataStore(DataAccessFinder.java:130)
> at org.geotools.data.DataStoreFinder.getDataStore(DataStoreFinder.java:89)
> at test.App.main(App.java:48)
can you tell me what the cause of this error ?
i am not using hadoop on Windows , My hadoop Cluster is running on linux
How to prevent this ?

How to provide remotely located(like hdfs) keystore files to kafak client in a spark direct-stream application

public static void main(String[] args){
Map<String, Object> kafkaParams = new HashMap<String, Object>();
if(args.length < 2){
logger.error("Please provide ssl key location and which env to connect to");
}
else{
ReadAndRelay.path = args[0];
ReadAndRelay.env = args[1];
}
kafkaParams.put("security.protocol", "SSL");
try{
if(ReadAndRelay.env.equals("dev")){
kafkaParams.put("group.id" , "group_id");
kafkaParams.put("ssl.keystore.location", ReadAndRelay.path+"/keystore.jks");
kafkaParams.put("ssl.truststore.location", ReadAndRelay.path+"/truststore.jks");
kafkaParams.put("bootstrap.servers", "bootstrap_servers");
}
}catch(Exception e){
e.printStackTrace();
}
kafkaParams.put("ssl.truststore.password", "truststore_password");
kafkaParams.put("ssl.keystore.password", "keystore_password");
kafkaParams.put("ssl.key.password", "key_password");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("topic1","topic2");
SparkConf sparkConf = new SparkConf().setAppName("kafka-stream").setMaster("local[4]");
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, new Duration(2000));
final JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
}
Above code works with path of keystore files being local. How to have to keystore files in a common location(like HDFS) and use them in the spark application to create a direct-stream(which creates the kafka-consumer) or create a kafak-producer for each of the rdd(because these will be executed at worker nodes / executors)?
When I try to use the hdfs file location as usually in kafka client properties, throws error saying file not found. What is the right way to provide files in hdfs to kafka client properties.
17/03/20 16:18:00 ERROR StreamingContext: Error starting the context, marking it as stopped
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:702)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:557)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:540)
at org.apache.spark.streaming.kafka010.Subscribe.onStart(ConsumerStrategy.scala:83)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.consumer(DirectKafkaInputDStream.scala:75)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.start(DirectKafkaInputDStream.scala:243)
at org.apache.spark.streaming.DStreamGraph$$anonfun$start$5.apply(DStreamGraph.scala:49)
at org.apache.spark.streaming.DStreamGraph$$anonfun$start$5.apply(DStreamGraph.scala:49)
at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach_quick(ParArray.scala:143)
at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach(ParArray.scala:136)
at scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:972)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:969)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:578)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:572)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:556)
at it.gis.servicemanagement.dcap.dsvs.spark.kafka_stream.ReadAndRelay.main(ReadAndRelay.java:168)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: hdfs:/namenode:9000/tmp/kafka_dev_certs/keystore.jks (No such file or directory)
at org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:70)
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:83)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:623)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:557)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:540)
at org.apache.spark.streaming.kafka010.Subscribe.onStart(ConsumerStrategy.scala:83)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.consumer(DirectKafkaInputDStream.scala:75)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.start(DirectKafkaInputDStream.scala:243)
at org.apache.spark.streaming.DStreamGraph$$anonfun$start$5.apply(DStreamGraph.scala:49)
at org.apache.spark.streaming.DStreamGraph$$anonfun$start$5.apply(DStreamGraph.scala:49)
at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach_quick(ParArray.scala:143)
at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach(ParArray.scala:136)
at scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:972)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:969)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: hdfs:/namenode:9000/tmp/kafka_dev_certs/keystore.jks (No such file or directory)
at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:110)
at org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41)
... 25 more
Caused by: java.io.FileNotFoundException: hdfs://namenode:9000/tmp/kafka_dev_certs/keystore.jks (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:205)
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:190)
at org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:126)
at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:108)
... 26 more
17/03/20 16:18:00 INFO ReceiverTracker: ReceiverTracker stopped
17/03/20 16:18:00 INFO JobGenerator: Stopping JobGenerator immediately
17/03/20 16:18:00 INFO RecurringTimer: Stopped timer for JobGenerator after time -1
17/03/20 16:18:00 INFO JobGenerator: Stopped JobGenerator
17/03/20 16:18:00 INFO JobScheduler: Stopped JobScheduler
Late to the party as they say it. But here is what I learned:
spark-submit --files <commaSeparatedList> ...
will copy the file to the working directory of all the workers.
For instance: spark-submit --files keystore.jks,truststore.jks ... can be used in Spark (scala) as:
val df = spark
.readStream
.format("kafka")
...
.option("kafka.ssl.truststore.location", "truststore.jks")
.option("kafka.ssl.truststore.password", "")
.option("kafka.ssl.keystore.location", "keystore.jks")
.option("kafka.ssl.keystore.password", "")
...
.load()
Other posts on SO (I don't have the links handy) suggested to use: org.apache.spark.SparkFiles.get("keystore.jks")
and use that location in the .option("kafka.ssl.truststore.location", ... but that still resulted in FNFE.
Some similarity that I observed:
Files in resources folder, eg: resources/input.txt can be read as Source.fromFile("/input.txt") which is similar to what is being achieved with --files

Why is map reduce job poinitng to localhost:8080?

I am working with Map Reduce job and executing it using ToolRunner's run method.
Here is my code:
public class MaxTemperature extends Configured implements Tool {
public static void main(String[] args) throws Exception {
System.setProperty("hadoop.home.dir", "/");
int exitCode = ToolRunner.run(new MaxTemperature(), args);
System.exit(exitCode);
}
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
System.out.println("Starting job");
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
int returnValue = job.waitForCompletion(true) ? 0:1;
if(job.isSuccessful()) {
System.out.println("Job was successful");
} else if(!job.isSuccessful()) {
System.out.println("Job was not successful");
}
return returnValue;
}
}
The job executed well as expected. But when i looked into the logs which displays the information abou the job tracking, I found that the Map reduce is pointing to localhost:8080 for the tracking of the job.
Here is the snapshot of logs:
20521 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
20670 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1454583076_0001
20713 [main] WARN org.apache.hadoop.conf.Configuration - file:/tmp/hadoop-KV/mapred/staging/KV1454583076/.staging/job_local1454583076_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
20716 [main] WARN org.apache.hadoop.conf.Configuration - file:/tmp/hadoop-KV/mapred/staging/KV1454583076/.staging/job_local1454583076_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
20818 [main] WARN org.apache.hadoop.conf.Configuration - file:/tmp/hadoop-KV/mapred/local/localRunner/KV/job_local1454583076_0001/job_local1454583076_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
20820 [main] WARN org.apache.hadoop.conf.Configuration - file:/tmp/hadoop-KV/mapred/local/localRunner/KV/job_local1454583076_0001/job_local1454583076_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
**20826 [main] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/**
20827 [main] INFO org.apache.hadoop.mapreduce.Job - Running job: job_local1454583076_0001
20829 [Thread-10] INFO org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null
So my question is why is map reduce pointing to localhost:8080
The url to track the job: http://localhost:8080/
There is no configuration file or properties file where i manually set this. Also, is it possible that i can change it to some other port? If yes, how can i achieve this?
So the ports are configured in yarn-site.xml : yarn-site.xml
Check : yarn.resourcemanager.webapp.address
We need to change the default configuration and create a Configuration object and set the properties to this configuration object and then create a Job object using this Configuration as follows:
Configuration configuration = getConf();
//configuration.set("fs.defaultFS", "hdfs://192.**.***.2**");
//configuration.set("mapred.job.tracker", "jobtracker:jtPort");
configuration.set("mapreduce.jobtracker.address", "localhost:54311");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("yarn.resourcemanager.address", "127.0.0.1:8032");
//configuration.set("yarn.resourcemanager.webapp.address", "127.0.0.1:8032");
//Initialize the Hadoop job and set the jar as well as the name of the Job
Job job = new Job(configuration);

HBase Bulk Load MapReduce HFile exception (netty jar)

I am attempting to run a simple MapReduce process to write HFiles for later import into an HBase table.
When the job is submitted:
hbase com.pcoa.Driver /test /bulk pcoa
I am getting the following exception indicating that netty-3.6.6.Final.jar does not exist in HDFS (it does however exist here).
-rw-r--r--+ 1 mbeening flprod 1206119 Sep 18 18:25 /dedge1/hadoop/hbase-0.96.1.1-hadoop2/lib/netty-3.6.6.Final.jar
I am afraid that I do not understand how to address this configuration(?) error.
Can anyone provide any advice to me?
Here is the Exception:
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost/dedge1/hadoop/hbase-0.96.1.1-hadoop2/lib/netty-3.6.6.Final.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1110)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:264)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:300)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:387)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)
at com.pcoa.Driver.main(Driver.java:63)
Here is my driver routine:
public class Driver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "HBase Bulk Import");
job.setJarByClass(HBaseKVMapper.class);
job.setMapperClass(HBaseKVMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setInputFormatClass(TextInputFormat.class);
HTable hTable = new HTable(conf, args[2]);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
I am not sure why/if i had to do this (didn't see anything like this in any of the startup docs anywhere)
but I ran one of these:
hdfs dfs -put /hadoop/hbase-0.96.1.1-hadoop2/lib/*.jar /hadoop/hbase-0.96.1.1-hadoop2/lib
And....my MR job seems to run now
If this is an incorrect course - please let me know
thanks!

running another job in hadoop

I don't understand how to make a job use the same output directory
directory to write a different file in it. I have tried commeting
and ucommenting this line, but it still doesn't work. I get the following
exception when I comment it. Anyhow in the code I am trying to run two
separate jobs with the same reducer but a different mapper.
EDIT: And no, the output of one job is not the input of the other, the reason
I want them in the same folder is because they are inputs to yet another map
reduce job I want to do.
FileOutputFormat.setOutputPath(job, new Path(args[1]));
11/04/14 13:33:11 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:120)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:770)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.myorg.WordCount.main(WordCount.java:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
// if (otherArgs.length != 2) {
// System.err.println("Usage: wordcount <in> <out>");
// System.exit(2);
// }
Job job = new Job(conf, "Job1");
job.setJarByClass(WordCount.class);
job.setMapperClass(Mapper1.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
FileSystem hdfs = FileSystem.get(conf);
Path fromPath = new Path("/user/hadoop/output/part-r-00000");
Path toPath = new Path("/user/hadoop/output/output1");
// renaming to output1
boolean isRenamed = hdfs.rename(fromPath, toPath);
if (isRenamed)
{
System.out.println("Renamed to /user/hadoop/output/output1!");
}
else
{
System.out.println("Not Renamed!");
}
job = new Job(conf, "Job2");
job.setJarByClass(WordCount.class);
job.setMapperClass(Mapper2.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit( job.waitForCompletion(true) ? 0 : 1);
}
adding the following to my code causes other errors:
job.setInputFormatClass(FileInputFormat.class);
job.setOutputFormatClass(FileOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Exception in thread "main" java.lang.RuntimeException: java.lang.InstantiationException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:768)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.myorg.WordCount.main(WordCount.java:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.InstantiationException
at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:30)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)
... 9 more
You have to provide a new Configuration object for the second job. BTW why aren't you using these methods for your output format?
job.setInputFormatClass(FileInputFormat.class);
job.setOutputFormatClass(FileOutputFormat.class);
Here is a blogpost about recursing jobs, thats quite the same stuff you are doing.
http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html
EDIT:
By the way, what is your intend to write into a folder that is the output of the previous job aka the input of the new job? This will just result in another exception like: "Output path already exists".
All the files don't need to be in the same directory. Your third job can have multiple input paths (directories or files).
see
FileInputFormat.addInputPaths(JobConf conf, String commaSeparatedPaths)
and friends...

Resources