Avro Map Reduce - AvroInputFormat not found error - hadoop

This is what i have understood so far reading from varied sources on the internet.
Avro mapred and Avro are not part of CDH4 (Cloudera Distribution) and i have to set it up manually using HADOOP_CLASSPATH=avro.jar:avro-mapred.jar
I have done that and when i run my job on my pseudo cluster it throws the following exception:
13/12/27 00:47:40 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/12/27 00:47:40 INFO mapred.FileInputFormat: Total input paths to process : 1
13/12/27 00:47:41 INFO mapred.JobClient: Running job: job_201312221245_0017
13/12/27 00:47:42 INFO mapred.JobClient: map 0% reduce 0%
13/12/27 00:47:57 INFO mapred.JobClient: Task Id : attempt_201312221245_0017_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.avro.mapred.AvroInputFormat not found
I'm running the job as follows:
hadoop jar build/libs/hadoop-boilerplate-1.0.jar CustomerMapReduce transactions/input transactions/output1 -libjars /path/to/libs/avro-1.7.4.jar,/path/to/libs/avro-mapred-1.7.4.jar

You should implement Tool and use getConf() for job configuration.
public class SomeClass extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
...
}
}

Related

Hadoop - Mappers not emitting anything

I'm running the code below and no output is generated (well, the output folder and the reducer output file are created, but there is nothing wihtin the part-r-00000 file). From the logs, I suspect the mappers are not emitting anything.
The code:
package com.telefonica.iot.tidoop.mrlib;
import com.telefonica.iot.tidoop.mrlib.utils.Constants;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.Logger;
public class Count extends Configured implements Tool {
private static final Logger LOGGER = Logger.getLogger(Count.class);
public static class UnitEmitter extends Mapper<Object, Text, Text, LongWritable> {
private final Text commonKey = new Text("common-key");
#Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(commonKey, new LongWritable(1));
} // map
} // UnitEmitter
public static class Adder extends Reducer<Text, LongWritable, Text, LongWritable> {
#Override
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
long sum = 0;
for (LongWritable value : values) {
sum += value.get();
} // for
context.write(key, new LongWritable(sum));
} // reduce
} // Adder
public static class AdderWithTag extends Reducer<Text, LongWritable, Text, LongWritable> {
private String tag;
#Override
public void setup(Context context) throws IOException, InterruptedException {
tag = context.getConfiguration().get(Constants.PARAM_TAG, "");
} // setup
#Override
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
long sum = 0;
for (LongWritable value : values) {
sum += value.get();
} // for
context.write(new Text(tag), new LongWritable(sum));
} // reduce
} // AdderWithTag
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Filter(), args);
System.exit(res);
} // main
#Override
public int run(String[] args) throws Exception {
// check the number of arguments, show the usage if it is wrong
if (args.length != 3) {
showUsage();
return -1;
} // if
// get the arguments
String input = args[0];
String output = args[1];
String tag = args[2];
// create and configure a MapReduce job
Configuration conf = this.getConf();
conf.set(Constants.PARAM_TAG, tag);
Job job = Job.getInstance(conf, "tidoop-mr-lib-count");
job.setNumReduceTasks(1);
job.setJarByClass(Count.class);
job.setMapperClass(UnitEmitter.class);
job.setCombinerClass(Adder.class);
job.setReducerClass(AdderWithTag.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
// run the MapReduce job
return job.waitForCompletion(true) ? 0 : 1;
} // main
private void showUsage() {
System.out.println("...");
} // showUsage
} // Count
The command executed, and the output logs:
$ hadoop jar target/tidoop-mr-lib-0.0.0-SNAPSHOT-jar-with-dependencies.jar com.telefonica.iot.tidoop.mrlib.Count -libjars target/tidoop-mr-lib-0.0.0-SNAPSHOT-jar-with-dependencies.jar tidoop/numbers tidoop/numbers_count onetag
15/11/05 17:24:52 INFO input.FileInputFormat: Total input paths to process : 1
15/11/05 17:24:52 WARN snappy.LoadSnappy: Snappy native library is available
15/11/05 17:24:53 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/11/05 17:24:53 INFO snappy.LoadSnappy: Snappy native library loaded
15/11/05 17:24:53 INFO mapred.JobClient: Running job: job_201507101501_23002
15/11/05 17:24:54 INFO mapred.JobClient: map 0% reduce 0%
15/11/05 17:25:00 INFO mapred.JobClient: map 100% reduce 0%
15/11/05 17:25:07 INFO mapred.JobClient: map 100% reduce 33%
15/11/05 17:25:08 INFO mapred.JobClient: map 100% reduce 100%
15/11/05 17:25:09 INFO mapred.JobClient: Job complete: job_201507101501_23002
15/11/05 17:25:09 INFO mapred.JobClient: Counters: 25
15/11/05 17:25:09 INFO mapred.JobClient: Job Counters
15/11/05 17:25:09 INFO mapred.JobClient: Launched reduce tasks=1
15/11/05 17:25:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5350
15/11/05 17:25:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/11/05 17:25:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/11/05 17:25:09 INFO mapred.JobClient: Rack-local map tasks=1
15/11/05 17:25:09 INFO mapred.JobClient: Launched map tasks=1
15/11/05 17:25:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8702
15/11/05 17:25:09 INFO mapred.JobClient: FileSystemCounters
15/11/05 17:25:09 INFO mapred.JobClient: FILE_BYTES_READ=6
15/11/05 17:25:09 INFO mapred.JobClient: HDFS_BYTES_READ=1968928
15/11/05 17:25:09 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108226
15/11/05 17:25:09 INFO mapred.JobClient: Map-Reduce Framework
15/11/05 17:25:09 INFO mapred.JobClient: Map input records=598001
15/11/05 17:25:09 INFO mapred.JobClient: Reduce shuffle bytes=6
15/11/05 17:25:09 INFO mapred.JobClient: Spilled Records=0
15/11/05 17:25:09 INFO mapred.JobClient: Map output bytes=0
15/11/05 17:25:09 INFO mapred.JobClient: CPU time spent (ms)=2920
15/11/05 17:25:09 INFO mapred.JobClient: Total committed heap usage (bytes)=355663872
15/11/05 17:25:09 INFO mapred.JobClient: Combine input records=0
15/11/05 17:25:09 INFO mapred.JobClient: SPLIT_RAW_BYTES=124
15/11/05 17:25:09 INFO mapred.JobClient: Reduce input records=0
15/11/05 17:25:09 INFO mapred.JobClient: Reduce input groups=0
15/11/05 17:25:09 INFO mapred.JobClient: Combine output records=0
15/11/05 17:25:09 INFO mapred.JobClient: Physical memory (bytes) snapshot=328683520
15/11/05 17:25:09 INFO mapred.JobClient: Reduce output records=0
15/11/05 17:25:09 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1466642432
15/11/05 17:25:09 INFO mapred.JobClient: Map output records=0
The content of the output file:
$ hadoop fs -cat /user/frb/tidoop/numbers_count/part-r-00000
[frb#cosmosmaster-gi tidoop-mr-lib]$ hadoop fs -ls /user/frb/tidoop/numbers_count/
Found 3 items
-rw-r--r-- 3 frb frb 0 2015-11-05 17:25 /user/frb/tidoop/numbers_count/_SUCCESS
drwxr----- - frb frb 0 2015-11-05 17:24 /user/frb/tidoop/numbers_count/_logs
-rw-r--r-- 3 frb frb 0 2015-11-05 17:25 /user/frb/tidoop/numbers_count/part-r-00000
Any hints about what is happening?
Weird. I'd try using Mapper (identity mapper) with your job.
If the Mapper does not output anything there must be something weird with your hadoop installation, or job configuration.

getting Error while importing data from mongodb to hdfs

I am getting errors while importing data from mongodb to hdfs.
I as using:
Ambari Sandbox [Hortonworks] Hadoop 2.7
MongoDB version 3.0
These are the jar files I am including:
mongo-java-driver-2.11.4.jar
mongo-hadoop-core-1.3.0.jar
Here is the code I am using:
package com.mongo.test;
import java.io.*;
import org.apache.commons.logging.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.mapreduce.*;
import org.bson.*;
import com.mongodb.MongoClient;
import com.mongodb.hadoop.*;
import com.mongodb.hadoop.util.*;
public class ImportFromMongoToHdfs {
private static final Log log =
LogFactory.getLog(ImportFromMongoToHdfs.class);
public static class ReadEmpDataFromMongo extends Mapper<Object,
BSONObject, Text, Text>{
public void map(Object key, BSONObject value, Context context) throws
IOException, InterruptedException{
System.out.println("Key: " + key);
System.out.println("Value: " + value);
String md5 = value.get("md5").toString();
String name = value.get("name").toString();
String dev = value.get("dev").toString();
String salary = value.get("salary").toString();
String location = value.get("location").toString();
String output = "\t" + name + "\t" + dev + "\t" + salary + "\t" +
location;
context.write( new Text(md5), new Text(output));
}
}
public static void main(String[] args)throws Exception {
final Configuration conf = new Configuration();
MongoConfigUtil.setInputURI(conf,"mongodb://10.25.3.196:27017/admin.emp")
;
MongoConfigUtil.setCreateInputSplits(conf, false);
System.out.println("Configuration: " + conf);
final Job job = new Job(conf, "ReadWeblogsFromMongo");
Path out = new Path("/mongodb3");
FileOutputFormat.setOutputPath(job, out);
job.setJarByClass(ImportFromMongoToHdfs.class);
job.setMapperClass(ReadEmpDataFromMongo.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(com.mongodb.hadoop.MongoInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1 );
}
}
This is the error I am getting back:
[root#sandbox ~]# hadoop jar /mongoinput/mongdbconnect.jar com.mongo.test.ImportFromMongoToHdfs
WARNING: Use "yarn jar" to launch YARN applications.
Configuration: Configuration: core-default.xml, core-site.xml
15/09/09 09:22:51 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/09 09:22:53 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.25.3.209:8050
15/09/09 09:22:53 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/09/09 09:22:54 INFO splitter.SingleMongoSplitter: SingleMongoSplitter calculating splits for mongodb://10.25.3.196:27017/admin.emp
15/09/09 09:22:54 INFO mapreduce.JobSubmitter: number of splits:1
15/09/09 09:22:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441784509780_0003
15/09/09 09:22:55 INFO impl.YarnClientImpl: Submitted application application_1441784509780_0003
15/09/09 09:22:55 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1441784509780_0003/
15/09/09 09:22:55 INFO mapreduce.Job: Running job: job_1441784509780_0003
15/09/09 09:23:05 INFO mapreduce.Job: Job job_1441784509780_0003 running in uber mode : false
15/09/09 09:23:05 INFO mapreduce.Job: map 0% reduce 0%
15/09/09 09:23:12 INFO mapreduce.Job: Task Id : attempt_1441784509780_0003_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.mongodb.hadoop.MongoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class com.mongodb.hadoop.MongoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more
15/09/09 09:23:18 INFO mapreduce.Job: Task Id : attempt_1441784509780_0003_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.mongodb.hadoop.MongoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class com.mongodb.hadoop.MongoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more
15/09/09 09:23:24 INFO mapreduce.Job: Task Id : attempt_1441784509780_0003_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.mongodb.hadoop.MongoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class com.mongodb.hadoop.MongoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more
15/09/09 09:23:32 INFO mapreduce.Job: map 100% reduce 0%
15/09/09 09:23:32 INFO mapreduce.Job: Job job_1441784509780_0003 failed with state FAILED due to: Task failed task_1441784509780_0003_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
15/09/09 09:23:32 INFO mapreduce.Job: Counters: 9
Job Counters
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=16996
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=16996
Total vcore-seconds taken by all map tasks=16996
Total megabyte-seconds taken by all map tasks=4249000
[root#sandbox ~]#
Does anyone know what is wrong?
make sure you keep mongo-hadoop jar in Hadoop class path and restart the Hadoop.
The error java.lang.ClassNotFoundException: Class com.mongodb.hadoop.MongoInputFormat should be resolved.
You are getting ClassNotFoundException becuase you is unable to reach to jar "mongo-hadoop-core*.jar". You have to make "mongo-hadoop-core*.jar" available to your code
There are many ways you resolve this error -
Create Fat Jar for your program. Fat jar will contain all necessary dependent jars. You can easily create fat jar if you are using any IDE.
use "-libjars" argument while submitting your yarn job
Copy mongo jars to Hadoop_Classpath location
I have just resolved a problem like this. In fact, this is an error at run time. If we set Hadoop_ClassPath pointing to the external necessary jar files, this was not enough yet. Because, I think at run time, Hadoop will look for jar files in the folder in which Hadoop is installed. I realize that we need to copy all necessary external jar files in the folder installed Hadoop. So :
First, you need to check HADOOP_CLASSPATH by typing :
- hadoop classpath
Then copy the necessary external jar file in one the HADOOP_CLASSPATH. For exemple, I will copy mongo-hadoop-1.5.1.jar and some others jar files to folder /usr/local/hadoop/share/hadoop/mapreduce.
Then it works for me!

Hadoop Word Count working but not summing words up

I'm using Hadoop 1.2.1 and for some reason my Word Count output looks strange:
input file:
this is sparta this was sparta hello world goodbye world
output in hdfs:
goodbye 1
hello 1
is 1
sparta 1
sparta 1
this 1
this 1
was 1
world 1
world 1
code:
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
And here's some relevant console output:
14/01/04 16:17:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/01/04 16:17:37 INFO input.FileInputFormat: Total input paths to process : 1
14/01/04 16:17:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/01/04 16:17:37 WARN snappy.LoadSnappy: Snappy native library not loaded
14/01/04 16:17:38 INFO mapred.JobClient: Running job: job_201401041506_0013
14/01/04 16:17:39 INFO mapred.JobClient: map 0% reduce 0%
14/01/04 16:17:45 INFO mapred.JobClient: map 100% reduce 0%
14/01/04 16:17:52 INFO mapred.JobClient: map 100% reduce 33%
14/01/04 16:17:54 INFO mapred.JobClient: map 100% reduce 100%
14/01/04 16:17:55 INFO mapred.JobClient: Job complete: job_201401041506_0013
14/01/04 16:17:55 INFO mapred.JobClient: Counters: 26
14/01/04 16:17:55 INFO mapred.JobClient: Job Counters
14/01/04 16:17:55 INFO mapred.JobClient: Launched reduce tasks=1
14/01/04 16:17:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6007
14/01/04 16:17:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/04 16:17:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/01/04 16:17:55 INFO mapred.JobClient: Launched map tasks=1
14/01/04 16:17:55 INFO mapred.JobClient: Data-local map tasks=1
14/01/04 16:17:55 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9167
14/01/04 16:17:55 INFO mapred.JobClient: File Output Format Counters
14/01/04 16:17:55 INFO mapred.JobClient: Bytes Written=77
14/01/04 16:17:55 INFO mapred.JobClient: FileSystemCounters
14/01/04 16:17:55 INFO mapred.JobClient: FILE_BYTES_READ=123
14/01/04 16:17:55 INFO mapred.JobClient: HDFS_BYTES_READ=169
14/01/04 16:17:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=122037
14/01/04 16:17:55 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=77
14/01/04 16:17:55 INFO mapred.JobClient: File Input Format Counters
14/01/04 16:17:55 INFO mapred.JobClient: Bytes Read=57
14/01/04 16:17:55 INFO mapred.JobClient: Map-Reduce Framework
14/01/04 16:17:55 INFO mapred.JobClient: Map output materialized bytes=123
14/01/04 16:17:55 INFO mapred.JobClient: Map input records=10
14/01/04 16:17:55 INFO mapred.JobClient: Reduce shuffle bytes=123
14/01/04 16:17:55 INFO mapred.JobClient: Spilled Records=20
14/01/04 16:17:55 INFO mapred.JobClient: Map output bytes=97
14/01/04 16:17:55 INFO mapred.JobClient: Total committed heap usage (bytes)=269619200
14/01/04 16:17:55 INFO mapred.JobClient: Combine input records=0
14/01/04 16:17:55 INFO mapred.JobClient: SPLIT_RAW_BYTES=112
14/01/04 16:17:55 INFO mapred.JobClient: Reduce input records=10
14/01/04 16:17:55 INFO mapred.JobClient: Reduce input groups=7
14/01/04 16:17:55 INFO mapred.JobClient: Combine output records=0
14/01/04 16:17:55 INFO mapred.JobClient: Reduce output records=10
14/01/04 16:17:55 INFO mapred.JobClient: Map output records=10
What can cause this? I'm very new to Hadoop, so i'm not sure where to look.
Thanks!
You're using an old API signature. In 1.x+ the reduce method changed to use iterables instead of iterator (which was what the old 0.x API used, so you will see iterator in many examples in books and on the web).
http://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapreduce/Reducer.html#reduce%28KEYIN,%20java.lang.Iterable,%20org.apache.hadoop.mapreduce.Reducer.Context%29
Try
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
The #Override annotation tells your compiler to check that your reduce method is overriding the correct method signature in the parent class.

Spring hadoop Mapper configuration

I'm using Hadoop 1.2.1 and Spring Hadoop 1.0.2
I wanted to check the Spring autowiring in a Hadoop Mapper. I wrote this configuration file:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:hdp="http://www.springframework.org/schema/hadoop"
xmlns:p="http://www.springframework.org/schema/p"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">
<context:property-placeholder location="configuration.properties"/>
<context:component-scan base-package="it.test"/>
<hdp:configuration id="hadoopConfiguration">
fs.default.name=${hd.fs}
</hdp:configuration>
<hdp:job id="my-job"
mapper="hadoop.mapper.MyMapper"
reducer="hadoop.mapper.MyReducer"
output-path="/root/Scrivania/outputSpring/out"
input-path="/root/Scrivania/outputSpring/in" jar="" />
<hdp:job-runner id="my-job-runner" job-ref="my-job" run-at-startup="true"/>
<hdp:hbase-configuration configuration-ref="hadoopConfiguration" zk-quorum="${hbase.host}" zk-port="${hbase.port}"/>
<bean id="hbaseTemplate" class="org.springframework.data.hadoop.hbase.HbaseTemplate">
<property name="configuration" ref="hbaseConfiguration"/>
</bean>
</beans>
Then I created this Mapper
public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
private static final Log logger = ....
#Autowired
private IHistoricalDataService hbaseService;
private List<HistoricalDataModel> data;
#SuppressWarnings({ "unchecked", "rawtypes" })
#Override
protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException {
super.cleanup(context);
}
#SuppressWarnings({ "rawtypes", "unchecked" })
#Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException {
super.setup(context);
try {
data = hbaseService.findAllHistoricalData();
logger.warn("Data "+data);
} catch (Exception e) {
String message = "Errore nel setup del contesto; messaggio errore: "+e.getMessage();
logger.fatal(message, e);
throw new InterruptedException(message);
}
}
#Override
protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.map(key, value, context);
}
}
As you can see MyMapper does nothing; the only thing I want to print is the data variable; nothing exception
When I launch it in my IDE (Eclipse Luna) by a JUnit Test I can see only this prints:
16:19:11,902 INFO [XmlBeanDefinitionReader] Loading XML bean definitions from class path resource [application-context.xml]
16:19:12,540 INFO [GenericApplicationContext] Refreshing org.springframework.context.support.GenericApplicationContext#150e804: startup date [Mon Dec 02 16:19:12 CET 2013]; root of context hierarchy
16:19:12,693 INFO [PropertySourcesPlaceholderConfigurer] Loading properties file from class path resource [configuration.properties]
16:19:12,722 INFO [DefaultListableBeanFactory] Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory#109f81a: defining beans [org.springframework.context.support.PropertySourcesPlaceholderConfigurer#0,pinfClusteringHistoricalDataDao,historicalDataServiceImpl,clusterAnalysisSvcImpl,org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,hadoopConfiguration,clusterAnalysisJob,clusterAnalysisJobRunner,hbaseConfiguration,hbaseTemplate,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor]; root of factory hierarchy
16:19:13,516 INFO [JobRunner] Starting job [clusterAnalysisJob]
16:19:13,568 WARN [NativeCodeLoader] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16:19:13,584 WARN [JobClient] No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
16:19:13,619 INFO [FileInputFormat] Total input paths to process : 0
16:19:13,998 INFO [JobClient] Running job: job_local265750426_0001
16:19:14,065 INFO [LocalJobRunner] Waiting for map tasks
16:19:14,065 INFO [LocalJobRunner] Map task executor complete.
16:19:14,127 INFO [ProcessTree] setsid exited with exit code 0
16:19:14,134 INFO [Task] Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#b1258d
16:19:14,144 INFO [LocalJobRunner]
16:19:14,148 INFO [Merger] Merging 0 sorted segments
16:19:14,149 INFO [Merger] Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16:19:14,149 INFO [LocalJobRunner]
16:19:14,219 INFO [Task] Task:attempt_local265750426_0001_r_000000_0 is done. And is in the process of commiting
16:19:14,226 INFO [LocalJobRunner]
16:19:14,226 INFO [Task] Task attempt_local265750426_0001_r_000000_0 is allowed to commit now
16:19:14,251 INFO [FileOutputCommitter] Saved output of task 'attempt_local265750426_0001_r_000000_0' to /root/Scrivania/outputSpring/out
16:19:14,254 INFO [LocalJobRunner] reduce > reduce
16:19:14,255 INFO [Task] Task 'attempt_local265750426_0001_r_000000_0' done.
16:19:15,001 INFO [JobClient] map 0% reduce 100%
16:19:15,005 INFO [JobClient] Job complete: job_local265750426_0001
16:19:15,007 INFO [JobClient] Counters: 13
16:19:15,007 INFO [JobClient] File Output Format Counters
16:19:15,007 INFO [JobClient] Bytes Written=0
16:19:15,007 INFO [JobClient] FileSystemCounters
16:19:15,007 INFO [JobClient] FILE_BYTES_READ=22
16:19:15,007 INFO [JobClient] FILE_BYTES_WRITTEN=67630
16:19:15,007 INFO [JobClient] Map-Reduce Framework
16:19:15,008 INFO [JobClient] Reduce input groups=0
16:19:15,008 INFO [JobClient] Combine output records=0
16:19:15,008 INFO [JobClient] Reduce shuffle bytes=0
16:19:15,008 INFO [JobClient] Physical memory (bytes) snapshot=0
16:19:15,008 INFO [JobClient] Reduce output records=0
16:19:15,008 INFO [JobClient] Spilled Records=0
16:19:15,008 INFO [JobClient] CPU time spent (ms)=0
16:19:15,009 INFO [JobClient] Total committed heap usage (bytes)=111935488
16:19:15,009 INFO [JobClient] Virtual memory (bytes) snapshot=0
16:19:15,009 INFO [JobClient] Reduce input records=0
16:19:15,009 INFO [JobRunner] Completed job [clusterAnalysisJob]
16:19:15,028 WARN [SpringHadoopTest] Scrivo............ OOOOOOO
It seems that the JOb starts but my Mapper is never executed; can anybody suggest to me where I'm wrong?
There is no autowiring of mappers or reducers. These classes are loaded by Hadoop so there is no application context associated with them at runtime. The application context is only available as part of the workflow orchestration of the jobs.
I don't know why your setup method isn't logging any messages, are you sure you specified the right class and package for the mapper?
-Thomas
Is it possible that your input file exists, but is empty? With no input splits, no mapper tasks would ever get created. Just a guess...

hadoop not running in the multinode cluster

I have a jar file "Tsp.jar" that I made myself. This same jar files executes well in single node cluster setup of hadoop. However when I run it on a cluster comprising 2 machines, a laptop and desktop it gives me an exception when the map function reach 50%. Here is the output
`hadoop#psycho-O:/usr/local/hadoop$ bin/hadoop jar Tsp.jar clust-Tsp_ip1 clust_Tsp_op4
11/04/27 16:13:06 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/04/27 16:13:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
11/04/27 16:13:06 INFO mapred.FileInputFormat: Total input paths to process : 1
11/04/27 16:13:06 INFO mapred.JobClient: Running job: job_201104271608_0001
11/04/27 16:13:07 INFO mapred.JobClient: map 0% reduce 0%
11/04/27 16:13:17 INFO mapred.JobClient: map 50% reduce 0%
11/04/27 16:13:20 INFO mapred.JobClient: Task Id : attempt_201104271608_0001_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Tsp$TspReducer
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841)
at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:853)
at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1100)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:812)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Tsp$TspReducer
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)
... 6 more
Caused by: java.lang.ClassNotFoundException: Tsp$TspReducer
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:807)
... 7 more
11/04/27 16:13:20 WARN mapred.JobClient: Error reading task outputemil-desktop
11/04/27 16:13:20 WARN mapred.JobClient: Error reading task outputemil-desktop
^Z
[1]+ Stopped bin/hadoop jar Tsp.jar clust-Tsp_ip1 clust_Tsp_op4
hadoop#psycho-O:~$ jps
4937 Jps
3976 RunJar
`
Alse the cluster worked fine executing the wordcount example. So I guess its the problem with the Tsp.jar file.
1) Is it necessary to have a jar file to run on a cluster?
2) Here I tried to run a jar file in the cluster which I made. But is still gives a warning that jar file is not found. Why is that?
3) What all should be taken care of when running a jar file? Like what all it must contain other than the program which I wrote? My jar file contains a a Tsp.class, Tsp$TspReducer.class and a Tsp$TspMapper.class. The terminal says it cant find Tsp$TspReducer when it is already there in the jar file.
Thankyou
EDIT
public class Tsp {
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(Tsp.class);
conf.setJobName("Tsp");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(TspMapper.class);
conf.setCombinerClass(TspReducer.class);
conf.setReducerClass(TspReducer.class);
FileInputFormat.addInputPath(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
public static class TspMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
function findCost() {
}
public void map(LongWritable key,Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
find adjacency matrix from the input;
for(int i = 0; ...) {
.....
output.collect(new Text(string1), new Text(string2));
}
}
}
public static class TspReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
Text t1 = new Text();
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String a;
a = values.next().toString();
output.collect(key,new Text(a));
}
}
}
You currently have
conf.setJobName("Tsp");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(TspMapper.class);
conf.setCombinerClass(TspReducer.class);
conf.setReducerClass(TspReducer.class);
and as the error is stating No job jar file set you are not setting a jar.
You will need to something similar to
conf.setJarByClass(Tsp.class);
From what I'm seeing, that should resolve the error seen here.
11/04/27 16:13:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Do what they say, when setting up your job, set the jar where the class is contained. Hadoop copies the jar into the DistributedCache (a filesystem on every node) and uses the classes out of it.
I had the exact same issue. Here is how I solved the problem(imagine your map reduce class is called A). After creating the job call:
job.setJarByClass(A.class);

Resources