First Hadoop program using map and reducer - hadoop

I'm trying to compile my first Hadoop program. I have as input file something like that:
1 54875451 2015 LA89LP
2 47451451 2015 LA89LP
3 878451 2015 LA89LP
4 54875 2015 LA89LP
5 2212 2015 LA89LP
When I'm compiling it i get map 100%, reducer 0% and an java.lang.Exception: java.util.NoSuchElementException caused by a lot of staff, including:
java.util.NoSuchElementException
java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
I don't really understand why. Any help is really appreciate
My Map and Reducer are in this way:
public class Draft {
public static class TokenizerMapper extends Mapper<Object, Text, Text, Text>{
private Text word = new Text();
private Text word2 = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String id = itr.nextToken();
String price = itr.nextToken();
String dateTransfer = itr.nextToken();
String postcode = itr.nextToken();
word.set(postcode);
word2.set(price);
context.write(word, word2);
}
}
}
public static class MaxReducer extends Reducer<Text,Text,Text,Text> {
private Text word = new Text();
private Text word2 = new Text();
public void reduce(Text key, Iterable<Text> values, Context context
) throws IOException, InterruptedException {
String max = "0";
HashSet<String> S = new HashSet<String>();
for (Text val: values) {
String d = key.toString();
String price = val.toString();
if (S.contains(d)) {
if (Integer.parseInt(price)>Integer.parseInt(max)) max = price;
} else {
S.add(d);
max = price;
}
}
word.set(key.toString());
word2.set(max);
context.write(word, word2);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Draft");
job.setJarByClass(Draft.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(MaxReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class); // output key type for mapper
job.setOutputValueClass(Text.class); // output value type for mapper
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

This error occurs, when some of your records have less than 4 fields. Your code in the mapper assumes that each record contains 4 fields: id, price, dateTransfer and postcode.
But, some of the records may not contain all the 4 fields.
For e.g. if the record is:
1 54875451 2015
then, following line will throw an exception (java.util.NoSuchElementException):
String postcode = itr.nextToken();
You are trying to assign postcode (which is assumed to be the 4th field), but there are only 3 fields in the input record.
To overcome this problem, you need to change your string tokenizer code in the map() method. Since you are emitting only postcode and price from the map(), you can change your are code as below:
String[] tokens = value.toString().split(" ");
String price = "";
String postcode = "";
if(tokens.length >= 2)
price = tokens[1];
if(tokens.length >= 4)
postcode = tokens[3];
if(!price.isEmpty())
{
word.set(postcode);
word2.set(price);
context.write(word, word2);
}

Related

Mapping with key as Text is not working, but parsing it to Intwritable works

I'm a beginner in hadoop and to learn i started doing outer join on two tables.
one has details about movies and other table has ratings.
sample data for movies table
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller
sample data for ratings
userId,movieId,rating,timestamp
1,122,2.0,945544824
1,172,1.0,945544871
1,1221,5.0,945544788
1,1441,4.0,945544871
1,1609,3.0,945544824
1,1961,3.0,945544871
1,1972,1.0,945544871
2,441,2.0,1008942733
2,494,2.0,1008942733
2,1193,4.0,1008942667
2,1597,3.0,1008942773
2,1608,3.0,1008942733
2,1641,4.0,1008942733
MovieId is primary key in movies table and foreign key in ratings table.So used movieId as key in mapper class.I have used two mappers, one for movieId table and other for ratings table.
code that i have written
public class Join {
public static class MovMapper
extends Mapper<Object, Text, Text, Text>{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] arr= value.toString().split(",");
word.set(arr[0]);
//System.out.println(word.toString()+ " mov");
context.write(word, value);
}
}
public static class RatMapper
extends Mapper<Object, Text, Text, Text>{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] arr= value.toString().split(",");
word.set(arr[1]);
//System.out.println(word.toString() + " rat");
context.write(word, value);
}
}
public static class JoinReducer
extends Reducer<Text,Text,Text,Text> {
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
List <Text> rat=new ArrayList<Text>();
Text mov= null;
System.out.println("#######################################################################################");
for(Text item:values){
if(item.toString().split(",").length == 3){
mov= new Text(item);
}
else
rat.add(new Text(item));
System.out.println("---->" + item);
}
System.out.println("item cnt: "+rat.size()+" mov"+mov+" key"+key+" byte: "+key.getBytes().toString());
for(Text item:rat){
if(mov != null) {
context.write(item,mov);
}
}
System.out.println("#######################################################################################");
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "join");
job.setJarByClass(Join.class);
job.setCombinerClass(JoinReducer.class);
job.setReducerClass(JoinReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class,MovMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),TextInputFormat.class,RatMapper.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
ProblemWhile mapping,records from movies table and ratings table are getting mapped to different tasks though the movieId is same.surprisingly when i convert movieId into intwritable, records from both tables matching with the key are getting mapped to same task.

Getting error:- Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

I have written a mapreduce job for doing log file analysis.My mappers output text both as key and value and I have explicitly set the map output classes in my driver class.
But i still get the error:-Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
public class CompositeUserMapper extends Mapper<LongWritable, Text, Text, Text> {
IntWritable a = new IntWritable(1);
//Text txt = new Text();
#Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Pattern p = Pattern.compile("\bd{8}\b");
Matcher m = p.matcher(line);
String userId = "";
String CompositeId = "";
if(m.find()){
userId = m.group(1);
}
CompositeId = line.substring(line.indexOf("compositeId :")+13).trim();
context.write(new Text(CompositeId),new Text(userId));
// TODO Auto-generated method stub
super.map(key, value, context);
}
My Driver class is as below:-
public class CompositeUserDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
CompositeUserDriver wd = new CompositeUserDriver();
int res = ToolRunner.run(wd, args);
System.exit(res);
}
public int run(String[] arg0) throws Exception {
// TODO Auto-generated method stub
Job job=new Job();
job.setJarByClass(CompositeUserDriver.class);
job.setJobName("Composite UserId Count" );
FileInputFormat.addInputPath(job, new Path(arg0[0]));
FileOutputFormat.setOutputPath(job, new Path(arg0[1]));
job.setMapperClass(CompositeUserMapper.class);
job.setReducerClass(CompositeUserReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
//return 0;
}
}
Please advise how can sort this problem out.
Remove the super.map(key, value, context); line from your mapper code: it calls map method of the parent class, which is identity mapper that returns key and value passed to it, in this case the key is the byte offset from the beginning of the file

Writing text output from map function hadoop

Input :
a,b,c,d,e
q,w,34,r,e
1,2,3,4,e
In mapper, I would grab all the values of the last field, and I want to emit (e,(a,b,c,d)) i.e. it emits (key, (rest of the fields from the line)).
Help appreciated.
Current code:
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); // reads the input line by line
String[] attr = line.split(","); // extract each attribute values from the csv record
context.write(attr[argno-1],line); // gives error seems to like only integer? how to override this?
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
// further process , loads the chunk into 2d arraylist object for processing
}
public static void main(String[] args) throws Exception {
String line;
String arguements[];
Configuration conf = new Configuration();
// compute the total number of attributes in the file
FileReader infile = new FileReader(args[0]);
BufferedReader bufread = new BufferedReader(infile);
line = bufread.readLine();
arguements = line.split(","); // split the fields separated by comma
conf.setInt("argno", arguements.length); // saving that attribute value
Job job = new Job(conf, "nb");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class); /* The method setMapperClass(Class<? extends Mapper>) in the type Job is not applicable for the arguments (Class<Map>) */
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}`
Please note the errors (see comments) I get face.
So this is simple. First parse your string to get the key and pass the rest of the line as the value. Then use the identity reducer which will combine all the same key values as list together as your output. It should be in the same format.
So your map function will output:
e, (a,b,c,d,e)
e, (q,w,34,r,e)
e, (1,2,3,4,e)
Then after the identity reduce it should output:
e, {a,b,c,d,e; q,w,34,r,e; 1,2,3,4,e}
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); // reads the input line by line
String[] attr = line.split(","); // extract each attribute values from the csv record
context.write(attr[argno-1],line); // gives error seems to like only integer? how to override this?
}
}
public static void main(String[] args) throws Exception {
String line;
String arguements[];
Configuration conf = new Configuration();
// compute the total number of attributes in the file
FileReader infile = new FileReader(args[0]);
BufferedReader bufread = new BufferedReader(infile);
line = bufread.readLine();
arguements = line.split(","); // split the fields separated by comma
conf.setInt("argno", arguements.length); // saving that attribute value
Job job = new Job(conf, "nb");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
Found alternate logic. Implemented , tested and verified.

HADOOP - Mapreduce - I obtain same value for all keys

I have a problem with mapreduce. Giving as input a list of song ("Songname"#"UserID"#"boolean") i must have as result a song list in which is specified how many time different users listen them... so a '' output ("Songname","timelistening").
I used hashtable to allow only one couple .
With short files it works well but when I put as input a list about 1000000 of records it returns me the same value (20) for all records.
This is my mapper:
public static class CanzoniMapper extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable userID = new IntWritable(0);
private Text song = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] caratteri = value.toString().split("#");
if(caratteri[2].equals("1")){
song.set(caratteri[0]);
userID.set(Integer.parseInt(caratteri[1]));
context.write(song,userID);
}
}
}
This is my reducer:
public static class CanzoniReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
Hashtable<IntWritable,Text> doppioni = new Hashtable<IntWritable,Text>();
for (IntWritable val : values) {
doppioni.put(val,key);
}
result.set(doppioni.size());
doppioni.clear();
context.write(key,result);
}
}
and main:
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(Canzoni.class);
job.setMapperClass(CanzoniMapper.class);
//job.setCombinerClass(CanzoniReducer.class);
//job.setNumReduceTasks(2);
job.setReducerClass(CanzoniReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Any idea???
Maybe I solved it. It's an input problem. There were too many records compared to the number of songs, so in these records' list each song was listed at least once by each user.
In my test I had 20 different users, so naturally the result gives me 20 for each song.
I must increase the number of different songs.

Distributed Cache Hadoop not retrieving the file content

I am getting some garbage like value instead of the data from the file I want to use as distributed cache.
The Job Configuration is as follows:
Configuration config5 = new Configuration();
JobConf conf5 = new JobConf(config5, Job5.class);
conf5.setJobName("Job5");
conf5.setOutputKeyClass(Text.class);
conf5.setOutputValueClass(Text.class);
conf5.setMapperClass(MapThree4c.class);
conf5.setReducerClass(ReduceThree5.class);
conf5.setInputFormat(TextInputFormat.class);
conf5.setOutputFormat(TextOutputFormat.class);
DistributedCache.addCacheFile(new URI("/home/users/mlakshm/ap1228"), conf5);
FileInputFormat.setInputPaths(conf5, new Path(other_args.get(5)));
FileOutputFormat.setOutputPath(conf5, new Path(other_args.get(6)));
JobClient.runJob(conf5);
In the Mapper, I have the following code:
public class MapThree4c extends MapReduceBase implements Mapper<LongWritable, Text,
Text, Text >{
private Set<String> prefixCandidates = new HashSet<String>();
Text a = new Text();
public void configure(JobConf conf5) {
Path[] dates = new Path[0];
try {
dates = DistributedCache.getLocalCacheFiles(conf5);
System.out.println("candidates: "+candidates);
String astr = dates.toString();
a = new Text(astr);
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer st = new StringTokenizer(line);
st.nextToken();
String t = st.nextToken();
String uidi = st.nextToken();
String uidj = st.nextToken();
String check = null;
output.collect(new Text(line), a);
}
}
The output value, I am getting from this mapper is:[Lorg.apache.hadoop.fs.Path;#786c1a82
instead of the value from the distributed cache file.
That looks like what you get when you call toString() on an array and if you look at the javadocs for DistributedCache.getLocalCacheFiles(), that is what it returns. If you need to actually read the contents of the files in the cache, you can open/read them with the standard java APIs.
From your code:
Path[] dates = DistributedCache.getLocalCacheFiles(conf5);
Implies that:
String astr = dates.toString(); // is a pointer to the above array (ie.dates) which is what you see in the output as [Lorg.apache.hadoop.fs.Path;#786c1a82.
You need to do the following to see the actual paths:
for(Path cacheFile: dates){
output.collect(new Text(line), new Text(cacheFile.getName()));
}

Resources