Kite Dataset map-reduce

Kite Dataset map-reduce - hadoop

I am trying to do map-reduce with kite-dataset api.
I have followed below urls to refer.
https://community.cloudera.com/t5/Kite-SDK-includes-Morphlines/Map-Reduce-with-Kite/td-p/22165
https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-mapreduce/src/test/java/org/kitesdk/data/mapreduce/TestMapReduce.java
My code snippet as below
public class MapReduce {
private static final String sourceDatasetURI = "dataset:hive:test_avro";
private static final String destinationDatasetURI = "dataset:hive:test_parquet";
private static class LineCountMapper
extends Mapper<GenericData.Record, Void, Text, IntWritable> {
#Override
protected void map(GenericData.Record record, Void value,
Context context)
throws IOException, InterruptedException {
System.out.println("Record is "+record);
context.write(new Text(record.get("index").toString()), new IntWritable(1));
}
}
private Job createJob() throws Exception {
System.out.println("Inside Create Job");
Job job = new Job();
job.setJarByClass(getClass());
Dataset<GenericData.Record> inputDataset = Datasets.load(sourceDatasetURI, GenericData.Record.class);
Dataset<GenericData.Record> outputDataset = Datasets.load(destinationDatasetURI, GenericData.Record.class);
DatasetKeyInputFormat.configure(job).readFrom(inputDataset).withType(GenericData.Record.class);
job.setMapperClass(LineCountMapper.class);
DatasetKeyOutputFormat.configure(job).writeTo(outputDataset).withType(GenericData.Record.class);
job.waitForCompletion(true);
return job;
}
public static void main(String[] args) throws Exception {
MapReduce httAvroToParquet = new MapReduce();
httAvroToParquet.createJob();
}
}
I am using HDP 2.3.2 box ,creating assembly jar and submitting my application through spark-submit.
I am getting below error when I submit my application.
2015-12-18 04:09:07,156 WARN [main] org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2015-12-18 04:09:07,282 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter set in config null
2015-12-18 04:09:07,333 WARN [main] org.kitesdk.data.spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
2015-12-18 04:09:07,334 INFO [main] org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive://{}:9083/default/test_parquet. Check that JARs for hive datasets are on the classpath.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive://{}:9083/default/test_parquet. Check that JARs for hive datasets are on the classpath.
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:478)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:458)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1560)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:458)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:377)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1518)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1515)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1448)
Caused by: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive://{}:9083/default/test_parquet. Check that JARs for hive datasets are on the classpath.
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:109)
at org.kitesdk.data.Datasets.load(Datasets.java:103)
at org.kitesdk.data.Datasets.load(Datasets.java:165)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.load(DatasetKeyOutputFormat.java:510)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.getOutputCommitter(DatasetKeyOutputFormat.java:473)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:476)
... 11 more
I am not getting what's wrong ? Is there any class-path problem ? If yes then where do I set it ?

You effectively have a classpath problem
Your project is missing org.kitesdk:kite-data-hive
You can
Add this jar to your fat jar before submitting it to Spark
Tells Spark to add it to your classpath when you submit

Related

Why is map reduce job poinitng to localhost:8080?

I am working with Map Reduce job and executing it using ToolRunner's run method.
Here is my code:
public class MaxTemperature extends Configured implements Tool {
public static void main(String[] args) throws Exception {
System.setProperty("hadoop.home.dir", "/");
int exitCode = ToolRunner.run(new MaxTemperature(), args);
System.exit(exitCode);
}
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
System.out.println("Starting job");
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
int returnValue = job.waitForCompletion(true) ? 0:1;
if(job.isSuccessful()) {
System.out.println("Job was successful");
} else if(!job.isSuccessful()) {
System.out.println("Job was not successful");
}
return returnValue;
}
}
The job executed well as expected. But when i looked into the logs which displays the information abou the job tracking, I found that the Map reduce is pointing to localhost:8080 for the tracking of the job.
Here is the snapshot of logs:
20521 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
20670 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1454583076_0001
20713 [main] WARN org.apache.hadoop.conf.Configuration - file:/tmp/hadoop-KV/mapred/staging/KV1454583076/.staging/job_local1454583076_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
20716 [main] WARN org.apache.hadoop.conf.Configuration - file:/tmp/hadoop-KV/mapred/staging/KV1454583076/.staging/job_local1454583076_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
20818 [main] WARN org.apache.hadoop.conf.Configuration - file:/tmp/hadoop-KV/mapred/local/localRunner/KV/job_local1454583076_0001/job_local1454583076_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
20820 [main] WARN org.apache.hadoop.conf.Configuration - file:/tmp/hadoop-KV/mapred/local/localRunner/KV/job_local1454583076_0001/job_local1454583076_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
**20826 [main] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/**
20827 [main] INFO org.apache.hadoop.mapreduce.Job - Running job: job_local1454583076_0001
20829 [Thread-10] INFO org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null
So my question is why is map reduce pointing to localhost:8080
The url to track the job: http://localhost:8080/
There is no configuration file or properties file where i manually set this. Also, is it possible that i can change it to some other port? If yes, how can i achieve this?

So the ports are configured in yarn-site.xml : yarn-site.xml
Check : yarn.resourcemanager.webapp.address

We need to change the default configuration and create a Configuration object and set the properties to this configuration object and then create a Job object using this Configuration as follows:
Configuration configuration = getConf();
//configuration.set("fs.defaultFS", "hdfs://192.**.***.2**");
//configuration.set("mapred.job.tracker", "jobtracker:jtPort");
configuration.set("mapreduce.jobtracker.address", "localhost:54311");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("yarn.resourcemanager.address", "127.0.0.1:8032");
//configuration.set("yarn.resourcemanager.webapp.address", "127.0.0.1:8032");
//Initialize the Hadoop job and set the jar as well as the name of the Job
Job job = new Job(configuration);

InvocationTargetException in Yarn task with Hadoop

While running Kafka -> Apache Apex ->Hbase, it is saying following exception in Yarn tasks:
com.datatorrent.stram.StreamingAppMasterService: Application master, appId=4, clustertimestamp=1479188884109, attemptId=2
2016-11-15 11:59:51,068 INFO org.apache.hadoop.service.AbstractService: Service com.datatorrent.stram.StreamingAppMasterService failed in state INITED; cause: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:130)
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:156)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:241)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:333)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:330)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:330)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:444)
And my DataTorrent log showing the following exception. I am running the app which communicates Kafka -> Apex -> Hbase streaming application.
Connecting to ResourceManager at hduser1/127.0.0.1:8032
16/11/15 17:47:38 WARN client.EventsAgent: Cannot read events for application_1479208737206_0008: java.io.FileNotFoundException: File does not exist: /user/hduser1/datatorrent/apps/application_1479208737206_0008/events/index.txt
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1893)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1834)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1814)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1786)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:552)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:362)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)
Adding the code :
public void populateDAG(DAG dag, Configuration conf){
KafkaSinglePortInputOperator in
= dag.addOperator("kafkaIn", new KafkaSinglePortInputOperator());
in.setInitialOffset(AbstractKafkaInputOperator.InitialOffset.EARLIEST.name());
LineOutputOperator out = dag.addOperator("fileOut", new LineOutputOperator());
dag.addStream("data", in.outputPort, out.input);}
LineOutputOperator extends AbstractFileOutputOperator
private static final String NL = System.lineSeparator();
private static final Charset CS = StandardCharsets.UTF_8;
#NotNull
private String baseName;
#Override
public byte[] getBytesForTuple(byte[] t) {
String result = new String(t, CS) + NL;
return result.getBytes(CS);
}
#Override
protected String getFileName(byte[] tuple) {
return baseName;
}
public String getBaseName() { return baseName; }
public void setBaseName(String v) { baseName = v; }
How to resolve this problem?
Thanks.

Can you share some details about your environment like what version of hadoop and apex ? Also, which log does this exception appear in ?
Just as a simple sanity check, can you run the simple maven archetype generated application described at: http://docs.datatorrent.com/beginner/
If that works, try running the fileIO and kafka applications at:
https://github.com/DataTorrent/examples/tree/master/tutorials
If those work ok we can look at the details of your code.

I got the solution,
The problem related to expiry of my license, So reinstalled new one and works fine for actual code.

DistributedCache - third party jar not found

I'm trying to get a hold of DistributedCache. I'm using Apache Hadoop 1.2.1 on two nodes.
I referred to the Cloudera post which is simply extended in the other posts that explain how to use third-party jars using -libjars
Note:
In my jar, I haven't included any jar libs. - neither Hadoop core nor commons lang.
The code :
public class WordCounter extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub
// Job job = new Job(getConf(), args[0]);
Job job = new Job(super.getConf(), args[0]);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setJarByClass(WordCounter.class);
FileInputFormat.setInputPaths(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
int jobState = job.waitForCompletion(true) ? 0 : 1;
return jobState;
}
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
if (args == null || args.length < 3) {
System.out.println("The below three arguments are expected");
System.out
.println("<job name> <hdfs path of the input file> <hdfs path of the output file>");
return;
}
WordCounter wordCounter = new WordCounter();
// System.exit(ToolRunner.run(wordCounter, args));
System.exit(ToolRunner.run(new Configuration(), wordCounter, args));
}
}
The Mapper class is naive, its only attempting to use the StringUtils from Apache Commons(and NOT hadoop)
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* #author 298790
*
*/
public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private static IntWritable one = new IntWritable(1);
#Override
protected void map(
LongWritable key,
Text value,
org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
StringTokenizer strTokenizer = new StringTokenizer(value.toString());
Text token = new Text();
while (strTokenizer.hasMoreTokens()) {
token.set(strTokenizer.nextToken());
context.write(token, one);
}
System.out.println("Converting " + value + " to upper case "
+ StringUtils.upperCase(value.toString()));
}
}
The commands that I use :
bigdata#slave3:~$ export HADOOP_CLASSPATH=dumphere/lib/commons-lang3-3.1.jar
bigdata#slave3:~$
bigdata#slave3:~$ echo $HADOOP_CLASSPATH
dumphere/lib/commons-lang3-3.1.jar
bigdata#slave3:~$
bigdata#slave3:~$ echo $LIBJARS
dumphere/lib/commons-lang3-3.1.jar
bigdata#slave3:~$ hadoop jar dumphere/code/jars/hdp_3rdparty.jar com.hadoop.basics.WordCounter "WordCount" "/input/dumphere/Childhood_days.txt" "/output/dumphere/wc" -libjars ${LIBJARS}
The exception I get :
Warning: $HADOOP_HOME is deprecated.
14/08/13 21:56:05 INFO input.FileInputFormat: Total input paths to process : 1
14/08/13 21:56:05 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/08/13 21:56:05 WARN snappy.LoadSnappy: Snappy native library not loaded
14/08/13 21:56:05 INFO mapred.JobClient: Running job: job_201408111719_0190
14/08/13 21:56:06 INFO mapred.JobClient: map 0% reduce 0%
14/08/13 21:56:37 INFO mapred.JobClient: Task Id : attempt_201408111719_0190_m_000000_0, Status : FAILED
Error: java.lang.ClassNotFoundException: org.apache.commons.lang3.StringUtils
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at com.hadoop.basics.WCMapper.map(WCMapper.java:40)
at com.hadoop.basics.WCMapper.map(WCMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
14/08/13 21:56:42 INFO mapred.JobClient: Task Id : attempt_201408111719_0190_m_000000_1, Status : FAILED
Error: java.lang.ClassNotFoundException: org.apache.commons.lang3.StringUtils
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at com.hadoop.basics.WCMapper.map(WCMapper.java:40)
at com.hadoop.basics.WCMapper.map(WCMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
The Cloudera post mentions :
The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. More specifically, you will find the JAR in one of the ${mapred.local.dir}/taskTracker/archive/${user.name}/distcache/… subdirectories on local nodes.
But on that path, I'm not able to find the commons-lang3-3.1.jar
What am I missing?

Hadoop: ClassNotFoundException - org.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat

I'm facing ClassNotFoundException, when I run my job for the class org.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat.
I tried to pass the additional jar files with -libjars, still I am facing the same issue. Any suggestions will be greatly helpful. Thanks in advance.
Below is the command I am using and exception I am facing!
hadoop jar MyJob.jar MyDriver -libjars hcatalog-core-0.5.0-cdh4.4.0.jar inputDir OutputDir
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hcatalog/rcfile/RCFileMapReduceOutputFormat
at com.cloudera.sa.omniture.mr.OmnitureToRCFileJob.run(OmnitureToRCFileJob.java:91)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.cloudera.sa.omniture.mr.OmnitureToRCFileJob.main(OmnitureToRCFileJob.java:131)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.ClassNotFoundException: org.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 8 more
I implemented ToolRunner as well, below is the code which confirms that!
public class OmnitureToRCFileJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
OmnitureToRCFileJob processor = new OmnitureToRCFileJob();
String[] otherArgs = new GenericOptionsParser(processor.getConf(), args).getRemainingArgs();
System.exit(ToolRunner.run(processor.getConf(), processor, otherArgs));
}
}

Did you try running by giving full path of "hcatalog-core-0.5.0-cdh4.4.0.jar" jar file in your below line.
hadoop jar MyJob.jar MyDriver -libjars hcatalog-core-0.5.0-cdh4.4.0.jar inputDir OutputDir
or
Below configuration should also work for you
$ export LIBJARS= <fullpath>/hcatalog-core-0.5.0-cdh4.4.0.jar
$hadoop jar MyJob.jar MyDriver -libjars ${LIBJARS} inputDir OutputDir

If you look at hadoop command documentation, you can see that -libjars is a generic option. For parsing generic option, you got to override the ToolRunner.run() method in your driver class as follows :
public class TestDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
# Job configuration details
# Job submission
return 0;
}
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new TestDriver(), args);
System.exit(exitCode);
}
I'nk you are getting this exception from your driver code itself. Setting hcatalog-cor*.jar using -libjars option may not be available in client JVM(JVM in which driver code runs). Better you need to set this jar in HADOOP_CLASSPATH environment variable before executing the same using hadoop jar as follows
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:<PATH-TO-HCAT-LIB>/hcatalog-core-0.5.0-cdh4.4.0.jar;
hadoop jar MyJob.jar MyDriver -libjars hcatalog-core-0.5.0-cdh4.4.0.jar inputDir OutputDir

Id had the same problem but found out the jar command doesn't accept the --libjars argument.
"Specify comma separated jar files to include in the classpath. Applies only to job." --> Hadoop Cli Generic Options
Instead you should use the env vars to add additional or replace jars.
export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CLASSPATH="./lib/*"

hadoop not running in the multinode cluster

I have a jar file "Tsp.jar" that I made myself. This same jar files executes well in single node cluster setup of hadoop. However when I run it on a cluster comprising 2 machines, a laptop and desktop it gives me an exception when the map function reach 50%. Here is the output
`hadoop#psycho-O:/usr/local/hadoop$ bin/hadoop jar Tsp.jar clust-Tsp_ip1 clust_Tsp_op4
11/04/27 16:13:06 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/04/27 16:13:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
11/04/27 16:13:06 INFO mapred.FileInputFormat: Total input paths to process : 1
11/04/27 16:13:06 INFO mapred.JobClient: Running job: job_201104271608_0001
11/04/27 16:13:07 INFO mapred.JobClient: map 0% reduce 0%
11/04/27 16:13:17 INFO mapred.JobClient: map 50% reduce 0%
11/04/27 16:13:20 INFO mapred.JobClient: Task Id : attempt_201104271608_0001_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Tsp$TspReducer
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841)
at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:853)
at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1100)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:812)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Tsp$TspReducer
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)
... 6 more
Caused by: java.lang.ClassNotFoundException: Tsp$TspReducer
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:807)
... 7 more
11/04/27 16:13:20 WARN mapred.JobClient: Error reading task outputemil-desktop
11/04/27 16:13:20 WARN mapred.JobClient: Error reading task outputemil-desktop
^Z
[1]+ Stopped bin/hadoop jar Tsp.jar clust-Tsp_ip1 clust_Tsp_op4
hadoop#psycho-O:~$ jps
4937 Jps
3976 RunJar
`
Alse the cluster worked fine executing the wordcount example. So I guess its the problem with the Tsp.jar file.
1) Is it necessary to have a jar file to run on a cluster?
2) Here I tried to run a jar file in the cluster which I made. But is still gives a warning that jar file is not found. Why is that?
3) What all should be taken care of when running a jar file? Like what all it must contain other than the program which I wrote? My jar file contains a a Tsp.class, Tsp$TspReducer.class and a Tsp$TspMapper.class. The terminal says it cant find Tsp$TspReducer when it is already there in the jar file.
Thankyou
EDIT
public class Tsp {
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(Tsp.class);
conf.setJobName("Tsp");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(TspMapper.class);
conf.setCombinerClass(TspReducer.class);
conf.setReducerClass(TspReducer.class);
FileInputFormat.addInputPath(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
public static class TspMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
function findCost() {
}
public void map(LongWritable key,Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
find adjacency matrix from the input;
for(int i = 0; ...) {
.....
output.collect(new Text(string1), new Text(string2));
}
}
}
public static class TspReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
Text t1 = new Text();
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String a;
a = values.next().toString();
output.collect(key,new Text(a));
}
}
}

You currently have
conf.setJobName("Tsp");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(TspMapper.class);
conf.setCombinerClass(TspReducer.class);
conf.setReducerClass(TspReducer.class);
and as the error is stating No job jar file set you are not setting a jar.
You will need to something similar to
conf.setJarByClass(Tsp.class);
From what I'm seeing, that should resolve the error seen here.

11/04/27 16:13:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Do what they say, when setting up your job, set the jar where the class is contained. Hadoop copies the jar into the DistributedCache (a filesystem on every node) and uses the classes out of it.

I had the exact same issue. Here is how I solved the problem(imagine your map reduce class is called A). After creating the job call:
job.setJarByClass(A.class);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Kite Dataset map-reduce - hadoop

You effectively have a classpath problem Your project is missing org.kitesdk:kite-data-hive You can Add this jar to your fat jar before submitting it to Spark Tells Spark to add it to your classpath when you submit

Related

Why is map reduce job poinitng to localhost:8080?

InvocationTargetException in Yarn task with Hadoop

DistributedCache - third party jar not found

Hadoop: ClassNotFoundException - org.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat

hadoop not running in the multinode cluster

Categories

Resources