I am attempting to run a simple MapReduce process to write HFiles for later import into an HBase table.
When the job is submitted:
hbase com.pcoa.Driver /test /bulk pcoa
I am getting the following exception indicating that netty-3.6.6.Final.jar does not exist in HDFS (it does however exist here).
-rw-r--r--+ 1 mbeening flprod 1206119 Sep 18 18:25 /dedge1/hadoop/hbase-0.96.1.1-hadoop2/lib/netty-3.6.6.Final.jar
I am afraid that I do not understand how to address this configuration(?) error.
Can anyone provide any advice to me?
Here is the Exception:
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost/dedge1/hadoop/hbase-0.96.1.1-hadoop2/lib/netty-3.6.6.Final.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1110)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:264)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:300)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:387)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)
at com.pcoa.Driver.main(Driver.java:63)
Here is my driver routine:
public class Driver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "HBase Bulk Import");
job.setJarByClass(HBaseKVMapper.class);
job.setMapperClass(HBaseKVMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setInputFormatClass(TextInputFormat.class);
HTable hTable = new HTable(conf, args[2]);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
I am not sure why/if i had to do this (didn't see anything like this in any of the startup docs anywhere)
but I ran one of these:
hdfs dfs -put /hadoop/hbase-0.96.1.1-hadoop2/lib/*.jar /hadoop/hbase-0.96.1.1-hadoop2/lib
And....my MR job seems to run now
If this is an incorrect course - please let me know
thanks!
Related
public static void main(String[] args){
Map<String, Object> kafkaParams = new HashMap<String, Object>();
if(args.length < 2){
logger.error("Please provide ssl key location and which env to connect to");
}
else{
ReadAndRelay.path = args[0];
ReadAndRelay.env = args[1];
}
kafkaParams.put("security.protocol", "SSL");
try{
if(ReadAndRelay.env.equals("dev")){
kafkaParams.put("group.id" , "group_id");
kafkaParams.put("ssl.keystore.location", ReadAndRelay.path+"/keystore.jks");
kafkaParams.put("ssl.truststore.location", ReadAndRelay.path+"/truststore.jks");
kafkaParams.put("bootstrap.servers", "bootstrap_servers");
}
}catch(Exception e){
e.printStackTrace();
}
kafkaParams.put("ssl.truststore.password", "truststore_password");
kafkaParams.put("ssl.keystore.password", "keystore_password");
kafkaParams.put("ssl.key.password", "key_password");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("topic1","topic2");
SparkConf sparkConf = new SparkConf().setAppName("kafka-stream").setMaster("local[4]");
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, new Duration(2000));
final JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
}
Above code works with path of keystore files being local. How to have to keystore files in a common location(like HDFS) and use them in the spark application to create a direct-stream(which creates the kafka-consumer) or create a kafak-producer for each of the rdd(because these will be executed at worker nodes / executors)?
When I try to use the hdfs file location as usually in kafka client properties, throws error saying file not found. What is the right way to provide files in hdfs to kafka client properties.
17/03/20 16:18:00 ERROR StreamingContext: Error starting the context, marking it as stopped
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:702)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:557)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:540)
at org.apache.spark.streaming.kafka010.Subscribe.onStart(ConsumerStrategy.scala:83)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.consumer(DirectKafkaInputDStream.scala:75)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.start(DirectKafkaInputDStream.scala:243)
at org.apache.spark.streaming.DStreamGraph$$anonfun$start$5.apply(DStreamGraph.scala:49)
at org.apache.spark.streaming.DStreamGraph$$anonfun$start$5.apply(DStreamGraph.scala:49)
at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach_quick(ParArray.scala:143)
at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach(ParArray.scala:136)
at scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:972)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:969)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
at ... run in separate thread using org.apache.spark.util.ThreadUtils ... ()
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:578)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:572)
at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:556)
at it.gis.servicemanagement.dcap.dsvs.spark.kafka_stream.ReadAndRelay.main(ReadAndRelay.java:168)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: hdfs:/namenode:9000/tmp/kafka_dev_certs/keystore.jks (No such file or directory)
at org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:70)
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:83)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:623)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:557)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:540)
at org.apache.spark.streaming.kafka010.Subscribe.onStart(ConsumerStrategy.scala:83)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.consumer(DirectKafkaInputDStream.scala:75)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.start(DirectKafkaInputDStream.scala:243)
at org.apache.spark.streaming.DStreamGraph$$anonfun$start$5.apply(DStreamGraph.scala:49)
at org.apache.spark.streaming.DStreamGraph$$anonfun$start$5.apply(DStreamGraph.scala:49)
at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach_quick(ParArray.scala:143)
at scala.collection.parallel.mutable.ParArray$ParArrayIterator.foreach(ParArray.scala:136)
at scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:972)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:969)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: hdfs:/namenode:9000/tmp/kafka_dev_certs/keystore.jks (No such file or directory)
at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:110)
at org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41)
... 25 more
Caused by: java.io.FileNotFoundException: hdfs://namenode:9000/tmp/kafka_dev_certs/keystore.jks (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:205)
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:190)
at org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:126)
at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:108)
... 26 more
17/03/20 16:18:00 INFO ReceiverTracker: ReceiverTracker stopped
17/03/20 16:18:00 INFO JobGenerator: Stopping JobGenerator immediately
17/03/20 16:18:00 INFO RecurringTimer: Stopped timer for JobGenerator after time -1
17/03/20 16:18:00 INFO JobGenerator: Stopped JobGenerator
17/03/20 16:18:00 INFO JobScheduler: Stopped JobScheduler
Late to the party as they say it. But here is what I learned:
spark-submit --files <commaSeparatedList> ...
will copy the file to the working directory of all the workers.
For instance: spark-submit --files keystore.jks,truststore.jks ... can be used in Spark (scala) as:
val df = spark
.readStream
.format("kafka")
...
.option("kafka.ssl.truststore.location", "truststore.jks")
.option("kafka.ssl.truststore.password", "")
.option("kafka.ssl.keystore.location", "keystore.jks")
.option("kafka.ssl.keystore.password", "")
...
.load()
Other posts on SO (I don't have the links handy) suggested to use: org.apache.spark.SparkFiles.get("keystore.jks")
and use that location in the .option("kafka.ssl.truststore.location", ... but that still resulted in FNFE.
Some similarity that I observed:
Files in resources folder, eg: resources/input.txt can be read as Source.fromFile("/input.txt") which is similar to what is being achieved with --files
I am trying to run a MapReduce job from outside the cluster.
e.g. Hadoop cluster is running on Linux machines.
We have one web application running on a Windows machine.
We want to run the hadoop job from this remote web application.
We want to retrieve the hadoop output directory and present it as a Graph.
We have written the following piece of code:
Configuration conf = new Configuration();
Job job = new Job(conf);
conf.set("mapred.job.tracker", "192.168.56.101:54311");
conf.set("fs.default.name", "hdfs://192.168.56.101:54310");
job.setJarByClass(Analysis.class) ;
//job.setOutputKeyClass(Text.class);
//job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
//job.set
job.setInputFormatClass(CustomFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.waitForCompletion(true);
And this is the error we get. Even if we shut down the hadoop 1.1.2 cluster, the error is still the same.
14/03/07 00:23:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/03/07 00:23:37 ERROR security.UserGroupInformation: PriviledgedActionException as:user cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-user\mapred\staging\user818037780\.staging to 0700
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-user\mapred\staging\user818037780\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:550)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:580)
at LineCounter.main(LineCounter.java:86)
While running from a remote system, you should run as remote user. You can do it in your main class as follows:
public static void main(String a[]) {
UserGroupInformation ugi
= UserGroupInformation.createRemoteUser("root");
try {
ugi.doAs(new PrivilegedExceptionAction<Void>() {
public Void run() throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
conf.set("hadoop.job.ugi", "root");
// write your remaining piece of code here.
return null;
}
});
} catch (Exception e) {
e.printStackTrace();
}
}
Also while submitting a mapreduce job, it should copy your java classes with their dependent jars to hadoop cluster, where it execute mapreduce job.You can read more here.
So you need to create a runnable jar of your code (with main class Analysis in your case) with all dependent jar files inits manifest classpath. Then run your jar file from your commandline using
java -jar job-jar-with-dependencies.jar arguments
HTH!
I am using Amazon EMR. I have a map / reduce job configured like this:
private static final String TEMP_PATH_PREFIX = System.getProperty("java.io.tmpdir") + "/dmp_processor_tmp";
...
private Job setupProcessorJobS3() throws IOException, DataGrinderException {
String inputFiles = System.getProperty(DGConfig.INPUT_FILES);
Job processorJob = new Job(getConf(), PROCESSOR_JOBNAME);
processorJob.setJarByClass(DgRunner.class);
processorJob.setMapperClass(EntityMapperS3.class);
processorJob.setReducerClass(SelectorReducer.class);
processorJob.setOutputKeyClass(Text.class);
processorJob.setOutputValueClass(Text.class);
FileOutputFormat.setOutputPath(processorJob, new Path(TEMP_PATH_PREFIX));
processorJob.setOutputFormatClass(TextOutputFormat.class);
processorJob.setInputFormatClass(NLineInputFormat.class);
FileInputFormat.setInputPaths(processorJob, inputFiles);
NLineInputFormat.setNumLinesPerSplit(processorJob, 10000);
return processorJob;
}
In my mapper class, I have:
private Text outkey = new Text();
private Text outvalue = new Text();
...
outkey.set(entity.getEntityId().toString());
outvalue.set(input.getId().toString());
printLog("context write");
context.write(outkey, outvalue);
This last line (context.write(outkey, outvalue);), causes this exception. Of course both outkey and outvalue are not null.
2013-10-24 05:48:48,422 INFO com.s1mbi0se.grinder.core.mapred.EntityMapperCassandra (main): Current Thread: Thread[main,5,main]Current timestamp: 1382593728422 context write
2013-10-24 05:48:48,422 ERROR com.s1mbi0se.grinder.core.mapred.EntityMapperCassandra (main): Error on entitymapper for input: 03a07858-4196-46dd-8a2c-23dd824d6e6e
java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1293)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1210)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
at org.apache.hadoop.io.Text.write(Text.java:281)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1077)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:698)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.s1mbi0se.grinder.core.mapred.EntityMapper.map(EntityMapper.java:78)
at com.s1mbi0se.grinder.core.mapred.EntityMapperS3.map(EntityMapperS3.java:34)
at com.s1mbi0se.grinder.core.mapred.EntityMapperS3.map(EntityMapperS3.java:14)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2013-10-24 05:48:48,422 INFO com.s1mbi0se.grinder.core.mapred.EntityMapperS3 (main): Current Thread: Thread[main,5,main]Current timestamp: 1382593728422 Entity Mapper end
The first records on each task are just processed ok. In some point of the task processing, I start to take this exception over and over, and then it doesn't process a single record anymore for that task.
I tried to set TEMP_PATH_PREFIX to "s3://mybucket/dmp_processor_tmp", but same thing happened.
Any idea why is this happening? What could be making hadoop not being able to write on it's output?
How can I run the code below from the command line?
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.*;
public class MyHBase {
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
try {
HTable table = new HTable(conf, "test-table");
Put put = new Put(Bytes.toBytes("test-key"));
put.add(Bytes.toBytes("cf"), Bytes.toBytes("q"), Bytes.toBytes("value"));
table.put(put);
} finally {
admin.close();
}
}
}
How to set my hbase classpath? I am getting a huge string in my classpath.
UPDATE
root# vi MyHBase.java
hbase-0.92.2 root# java -classpath `hbase classpath`:./ /var/root/MyHBase
-sh: hbase: command not found
Exception in thread "main" java.lang.NoClassDefFoundError: /var/root/MyHBase
Caused by: java.lang.ClassNotFoundException: .var.root.MyHBase
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
I am able to do
hbase-0.92.2 root# bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.2, r1379292, Fri Aug 31 13:13:53 UTC 2012
hbase(main):001:0> exit
Am i doing anything wrong?
You can also do something like this:-
# export HADOOP_CLASSPATH=`./hbase classpath`
and then bundle this java in jar to run it within hadoop cluster like this:-
#hadoop jar <jarfile> <mainclass>
Use this
java -classpath `hbase classpath`:./ MyHBase
this will get the entire classpath string that HBase uses and adds the current directory as well to the classpath.
Cheers
Rags
I don't understand how to make a job use the same output directory
directory to write a different file in it. I have tried commeting
and ucommenting this line, but it still doesn't work. I get the following
exception when I comment it. Anyhow in the code I am trying to run two
separate jobs with the same reducer but a different mapper.
EDIT: And no, the output of one job is not the input of the other, the reason
I want them in the same folder is because they are inputs to yet another map
reduce job I want to do.
FileOutputFormat.setOutputPath(job, new Path(args[1]));
11/04/14 13:33:11 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:120)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:770)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.myorg.WordCount.main(WordCount.java:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
// if (otherArgs.length != 2) {
// System.err.println("Usage: wordcount <in> <out>");
// System.exit(2);
// }
Job job = new Job(conf, "Job1");
job.setJarByClass(WordCount.class);
job.setMapperClass(Mapper1.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
FileSystem hdfs = FileSystem.get(conf);
Path fromPath = new Path("/user/hadoop/output/part-r-00000");
Path toPath = new Path("/user/hadoop/output/output1");
// renaming to output1
boolean isRenamed = hdfs.rename(fromPath, toPath);
if (isRenamed)
{
System.out.println("Renamed to /user/hadoop/output/output1!");
}
else
{
System.out.println("Not Renamed!");
}
job = new Job(conf, "Job2");
job.setJarByClass(WordCount.class);
job.setMapperClass(Mapper2.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit( job.waitForCompletion(true) ? 0 : 1);
}
adding the following to my code causes other errors:
job.setInputFormatClass(FileInputFormat.class);
job.setOutputFormatClass(FileOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Exception in thread "main" java.lang.RuntimeException: java.lang.InstantiationException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:768)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.myorg.WordCount.main(WordCount.java:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.InstantiationException
at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:30)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)
... 9 more
You have to provide a new Configuration object for the second job. BTW why aren't you using these methods for your output format?
job.setInputFormatClass(FileInputFormat.class);
job.setOutputFormatClass(FileOutputFormat.class);
Here is a blogpost about recursing jobs, thats quite the same stuff you are doing.
http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html
EDIT:
By the way, what is your intend to write into a folder that is the output of the previous job aka the input of the new job? This will just result in another exception like: "Output path already exists".
All the files don't need to be in the same directory. Your third job can have multiple input paths (directories or files).
see
FileInputFormat.addInputPaths(JobConf conf, String commaSeparatedPaths)
and friends...