Error from AVRO Mapreduce - hadoop

Getting the following error when I try to run mapreduce on avro:
14/02/26 20:07:50 INFO mapreduce.Job: Task Id : attempt_1393424169778_0002_m_000001_0, Status : FAILED
Error: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
How can I fix this?
I have Hadoop 2.2 up and running.
I'm using Avro 1.7.6.
Below is the code:
package avroColorCount;
import java.io.IOException;
import org.apache.avro.*;
import org.apache.avro.Schema.Type;
import org.apache.avro.mapred.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class MapredColorCount extends Configured implements Tool {
public static class ColorCountMapper extends AvroMapper<User, Pair<CharSequence, Integer>> {
#Override
public void map(User user, AvroCollector<Pair<CharSequence, Integer>> collector, Reporter reporter)
throws IOException {
CharSequence color = user.getFavoriteColor();
// We need this check because the User.favorite_color field has type ["string", "null"]
if (color == null) {
color = "none";
}
collector.collect(new Pair<CharSequence, Integer>(color, 1));
}
}
public static class ColorCountReducer extends AvroReducer<CharSequence, Integer,
Pair<CharSequence, Integer>> {
#Override
public void reduce(CharSequence key, Iterable<Integer> values,
AvroCollector<Pair<CharSequence, Integer>> collector,
Reporter reporter)
throws IOException {
int sum = 0;
for (Integer value : values) {
sum += value;
}
collector.collect(new Pair<CharSequence, Integer>(key, sum));
}
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MapredColorCount <input path> <output path>");
return -1;
}
JobConf conf = new JobConf(getConf(), MapredColorCount.class);
conf.setJobName("colorcount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
AvroJob.setMapperClass(conf, ColorCountMapper.class);
AvroJob.setReducerClass(conf, ColorCountReducer.class);
// Note that AvroJob.setInputSchema and AvroJob.setOutputSchema set
// relevant config options such as input/output format, map output
// classes, and output key class.
AvroJob.setInputSchema(conf, User.getClassSchema());
AvroJob.setOutputSchema(conf, Pair.getPairSchema(Schema.create(Type.STRING),
Schema.create(Type.INT)));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapredColorCount(), args);
System.exit(res);
}
}

You're using wrong version of avro library.
createDatumWriter method first appeared in GenericData class in version 1.7.5 of avro library. If Hadoop does not seem to find it, then it means that there is an earlier version of avro library (possibly 1.7.4) in your classpath.
First try to provide a correct version of library with HADOOP_CLASSPATH or -libjars option.
Unfortunately, it may be more tricky. In my case it was some other jar file that I loaded with my project but actually never used. I spent several weeks do find it. Hope now you will find it quicker.
Here is some handy code to help you analyze your classpath during your job run (use it inside working job, like WordCount example):
public static void printClassPath() {
ClassLoader cl = ClassLoader.getSystemClassLoader();
URL[] urls = ((URLClassLoader) cl).getURLs();
System.out.println("classpath BEGIN");
for (URL url : urls) {
System.out.println(url.getFile());
}
System.out.println("classpath END");
}
Hope it helps.

Viacheslav Rodionov's answer definitely points to the root cause. Thank you for posting! The following configuration setting then seemed to pick up the 1.7.6 library first and allowed my reducer code (where the createDatumWriter method was called) to complete successfully:
Configuration conf = getConf();
conf.setBoolean(MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);
Job job = Job.getInstance(conf);

I ran exactly into the same problem and as Viacheslav suggested -- it's a version conflict between Avro installed with Hadoop distribution, and Avro version in your project.
And it seems the most reliable way to solve the problem -- simply just use Avro version installed with your Hadoop distro. Unless there is compelling reason to use different version.
Why is using default Avro version which comes with Hadoop distribution is good idea? Because in production hadoop environment you most likely will deal numerous other jobs and services running on the same shared hadoop infrastructure. And the all share the same jar dependencies which come with Hadoop distribution installed in your production environment.
Replacing jar version for specific mapreduce job maybe tricky but solvable task. However it creates a risk of introducing compatibility problem which may be very hard to detect and can backfire later somewhere else in your hadoop ecosystem.

Related

Map-reduce job giving ClassNotFound exception even though mapper is present when running with yarn?

I am running a hadoop job which is working fine when I am running it without yarn in pseudo-distributed mode, but it is giving me class not found exception when running with yarn
16/03/24 01:43:40 INFO mapreduce.Job: Task Id : attempt_1458775953882_0002_m_000003_1, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.hadoop.keyword.count.ItemMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.keyword.count.ItemMapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more
Here is the source-code for the job
Configuration conf = new Configuration();
conf.set("keywords", args[2]);
Job job = Job.getInstance(conf, "item count");
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Here is the command I am running
hadoop jar ~/itemcount.jar /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
Edit Code, after suggestion
package com.hadoop.keyword.count;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
public class ItemImpl {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("keywords", args[2]);
Job job = Job.getInstance(conf, "item count");
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class ItemMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
JSONParser parser = new JSONParser();
#Override
public void map(Object key, Text value, Context output) throws IOException,
InterruptedException {
JSONObject tweetObject = null;
String[] keywords = this.getKeyWords(output);
try {
tweetObject = (JSONObject) parser.parse(value.toString());
} catch (ParseException e) {
e.printStackTrace();
}
if (tweetObject != null) {
String tweetText = (String) tweetObject.get("text");
if(tweetText == null){
return;
}
tweetText = tweetText.toLowerCase();
/* StringTokenizer st = new StringTokenizer(tweetText);
ArrayList<String> tokens = new ArrayList<String>();
while (st.hasMoreTokens()) {
tokens.add(st.nextToken());
}*/
for (String keyword : keywords) {
keyword = keyword.toLowerCase();
if (tweetText.contains(keyword)) {
output.write(new Text(keyword), one);
}
}
output.write(new Text("count"), one);
}
}
String[] getKeyWords(Mapper<Object, Text, Text, IntWritable>.Context context) {
Configuration conf = (Configuration) context.getConfiguration();
String param = conf.get("keywords");
return param.split(",");
}
}
public static class ItemReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
#Override
protected void reduce(Text key, Iterable<IntWritable> values, Context output)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
output.write(key, new IntWritable(wordCount));
}
}
}
Running in full distributed mode your TaskTracker/NodeManager (the thing running your mapper) is running in a separate JVM and it sounds like your class is not making it onto that JVM's classpath.
Try using the -libjars <csv,list,of,jars> command line arg on job invocation. This will have Hadoop distribute the jar to the TaskTracker JVM and load your classes from that jar. (Note, this copies the jar out to each node in your cluster and makes it available only for that specific job. If you have common libraries that would need to be invoked for a lot of jobs, you'd want to look into using the Hadoop distributed cache.)
You may also want to try yarn -jar ... when launching your job versus hadoop -jar ... since that's the new/preferred way to launch yarn jobs.
Can you check the content of your itemcount.jar ?( jar -tvf itemcount.jar). I faced this issue once only to find that the .class was missing from the jar.
I had the same error a few days ago.
Changing map and reduce classes to static fixed my problem.
Make your map and reduce classes inner classes.
Control constructors of map and reduce classes (i/o values and override statement)
Check your jar command
old one
hadoop jar ~/itemcount.jar /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
new
hadoop jar ~/itemcount.jar com.hadoop.keyword.count.ItemImpl /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
add packageName.mainclass after you specified .jar file
Try-catch
try {
tweetObject = (JSONObject) parser.parse(value.toString());
} catch (Exception e) { **// Change ParseException to Exception if you don't only expect Parse error**
e.printStackTrace();
return; **// return from function in case of any error**
}
}
extends Configured and implement Tool
public class ItemImpl extends Configured implements Tool{
public static void main (String[] args) throws Exception{
int res =ToolRunner.run(new ItemImpl(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Job job=Job.getInstance(getConf(),"ItemImpl ");
job.setJarByClass(this.getClass());
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setMapOutputKeyClass(Text.class);//probably not essential but make it certain and clear
job.setMapOutputValueClass(IntWritable.class); //probably not essential but make it certain and clear
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
add public static map
add public static reduce
I'm not an expert about this topic but This implementation is from one of my working projects. Try this if doesn't work for you I would suggest you check the libraries you added to your project.
Probably first step will solve it but
If these steps doesn't work , share the code with us.

yarn stderr no logger appender and no stdout

I'm running a simple mapreduce program wordcount agian Apache Hadoop 2.6.0. The hadoop is running distributedly (several nodes). However, I'm not able to see any stderr and stdout from yarn job history. (but I can see the syslog)
The wordcount program is really simple, just for demo purpose.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static final Log LOG = LogFactory.getLog(WordCount.class);
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
LOG.info("LOG - map function invoked");
System.out.println("stdout - map function invoded");
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapreduce.job.jar","/space/tmp/jar/wordCount.jar");
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/jsun/input"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/user/jsun/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Note in the map function of Mapper class, I added two statements:
LOG.info("LOG - map function invoked");
System.out.println("stdout - map function invoded");
These two statements are to test whether I can see logging from hadoop server. I can successfully run the program. But if I go to localhost:8088 to see the application history and then "logs", I see nothing in "stdout", and in "stderr":
log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I think there is some configuration needed to get those output, but not sure which piece of information is missing. I searched online as well as in stackoverflow. Some people mentioned container-log4j.properties but they are not specific about how to configure that file and where to put.
One thing to note is I also tried the job with Hortonworks Data Platform 2.2 and Cloudera 5.4. The result is the same. I remember when I dealt with some previous version of hadoop (hadoop 1.x), I can easily see the loggings from same place. So I guess this is something new in hadoop 2.x
=======
As a comparison, if I make the apache hadoop run in local mode (meaning LocalJobRunner), I can see some loggings in console like this:
[2015-09-08 15:57:25,992]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:998) INFO:kvstart = 26214396; length = 6553600
[2015-09-08 15:57:25,996]org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:402) INFO:Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
[2015-09-08 15:57:26,064]WordCount$TokenizerMapper.map(WordCount.java:28) INFO:LOG - map function invoked
stdout - map function invoded
[2015-09-08 15:57:26,075]org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:591) INFO:
[2015-09-08 15:57:26,077]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1457) INFO:Starting flush of map output
[2015-09-08 15:57:26,077]org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1475) INFO:Spilling map output
These kind of loggings ("map function is invoked") is what I expected in hadoop server logging.
All the sysout written in Map-Reduce program can not be seen on console. It is because map-reduce run in multiple parallel copies across the cluster, so there is no concept of a single console with output.
However, The System.out.println() for map and reduce phases can be seen in the job logs. Easy way to access the logs is
open the jobtracker web console - http://localhost:50030/jobtracker.jsp
click on the completed job
click on map or reduce task
click on tasknumber
Go to task logs
Check stdout logs.
Please note that if you are not able to locate URL, just look into the console log for jobtracker URL.

ClassNotFoundException when running HBase map reduce job on cluster

I have been testing a map reduce job on a single node and it seems to work but now that I am trying to run it on a remote cluster I am getting a ClassNotFoundExcepton. My code is structured as follows:
public class Pivot {
public static class Mapper extends TableMapper<ImmutableBytesWritable, ImmutableBytesWritable> {
#Override
public void map(ImmutableBytesWritable rowkey, Result values, Context context) throws IOException {
(map code)
}
}
public static class Reducer extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
(reduce code)
}
}
public static void main(String[] args) {
Configuration conf = HBaseConfiguration.create();
conf.set("fs.default.name", "hdfs://hadoop-master:9000");
conf.set("mapred.job.tracker", "hdfs://hadoop-master:9001");
conf.set("hbase.master", "hadoop-master:60000");
conf.set("hbase.zookeeper.quorum", "hadoop-master");
conf.set("hbase.zookeeper.property.clientPort", "2222");
Job job = new Job(conf);
job.setJobName("Pivot");
job.setJarByClass(Pivot.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob("InputTable", scan, Mapper.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);
TableMapReduceUtil.initTableReducerJob("OutputTable", Reducer.class, job);
job.waitForCompletion(true);
}
}
The error I am receiving when I try to run this job is the following:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Pivot$Mapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
...
Is there something I'm missing? Why is the job having difficulty finding the mapper?
When running a job from Eclipse it's important to note that Hadoop requires you to launch your job from a jar. Hadoop requires this so it can send your code up to HDFS / JobTracker.
In your case i imagine you haven't bundled up your job classes into a jar, and then run the program 'from the jar' - resulting in a CNFE.
Try building a jar and running from the command line using hadoop jar myjar.jar ..., once this works then you can test running from within Eclipse

hadoop, how to include 3part jar while try to run mapred job

As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)
Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.

Multiple ways to write driver of Hadoop program - Which one to choose?

I have observed that there are multiple ways to write driver method of Hadoop program.
Following method is given in Hadoop Tutorial by Yahoo
public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
and this method is given in Hadoop The Definitive Guide 2012 book by Oreilly.
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
While trying program given in Oreilly book I found that constructors of Job class are deprecated. As Oreilly book is based on Hadoop 2 (yarn) I was surprised to see that they have used deprecated class.
I would like to know which method everyone uses?
I use the former approach.If we go with overriding the run() method, we can use hadoop jar options like -D,-libjars,-files etc.,.All these are very much necessary in almost any hadoop project.
Not sure if we can use them through the main() method.
Slightly different to your first (Yahoo) block - you should be using the ToolRunner / Tool classes which take advantage of the GenericOptionsParser (as noted in Eswara's answer)
A template pattern would be something like:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolExample extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
// old API
JobConf jobConf = new JobConf(getConf());
// new API
Job job = new Job(getConf());
// rest of your config here
// determine success / failure (depending on your choice of old / new api)
// return 0 for success, non-zero for an error
return 0;
}
public static void main(String args[]) throws Exception {
System.exit(ToolRunner.run(new ToolExample(), args));
}
}

Resources