mapper behaving differently on local and on the cluster - hadoop

I run a map only job(on Hadoop) in order to sort the key values, because it's said "Hadoop automatically sorts data emitted by mappers before being sent to reducers".
input file
2013-04-15 835352
2013-04-16 846299
2013-04-17 828286
2013-04-18 747767
2013-04-19 807924
I think Map(second_cloumn, first_column) should sort this file as shown in output1. It actually did when i run this on my local machine. But when i run this on a cluster, the output is like shown in output2.
output1 file
747767 2013-04-18
807924 2013-04-19
828286 2013-04-17
835352 2013-04-15
846299 2013-04-16
output2 file
835352 2013-04-15
747767 2013-04-18
807924 2013-04-19
828286 2013-04-17
846299 2013-04-16
How can I guarantee it to be always like in the output1. I am open for another suggestions for sorting by the second column.
Mapper
public class MapAccessTime1 extends Mapper<LongWritable, Text, IntWritable, Text> {
private IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
int val = 0;
StringTokenizer tokenizer = new StringTokenizer(line);
if (!line.startsWith("#")) {
if (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
}
if (tokenizer.hasMoreTokens()) {
val = Integer.parseInt(tokenizer.nextToken());
one = new IntWritable(val);
context.write(one, word);
}
}
}
}

MapOnly job doesn't do shuffle and sorting. Using an identity reducer solves my problem.

Related

If 2 Mappers output the same key , what will the input to the reducer be?

I've the following doubt while learning Map reduce. It will be of great help if some one could answer.
I've two mappers working on the same file - I configured them using MultipleInputFormat
mapper 1 - Expected Output [ after extracting few columns of a file]
a - 1234
b - 3456
c - 1345
Mapper 2 Expected output [After extracting few columns of the same file]
a - Monday
b - Tuesday
c - Wednesday
And there is a reducer function that just outputs the key and value pair that it gets as input
So I expected the output to be as I know that similar keys will be shuffled to make a list.
a - [1234,Monday]
b - [3456, Tuesday]
c - [1345, Wednesday]
But am getting some weird output.I guess only 1 Mapper is getting run.
Should this not be expected ? Will the output of each mapper be shuffled separately ? Will both the mappers run parallel ?
Excuse me if its a lame question Please understand that I am new to Hadoop and Map Reduce
Below is the code
//Mapper1
public class numbermapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key,Text value, Context context) throws IOException, InterruptedException {
String record = value.toString();
String[] parts = record.split(",");
System.out.println("***Mapper number output "+parts[0]+" "+parts[1]);
context.write(new Text(parts[0]), new Text(parts[1]));
}
}
//Mapper2
public class weekmapper extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String record = value.toString();
String[] parts = record.split(",");
System.out.println("***Mapper week output "+parts[0]+" "+parts[2]);
context.write(new Text(parts[0]), new Text(parts[2]));
}
}
//Reducer
public class rjoinreducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Text values, Context context)
throws IOException, InterruptedException {
context.write(key, values);
}
}
//Driver class
public class driver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Reduce-side join");
job.setJarByClass(numbermapper.class);
job.setReducerClass(rjoinreducer.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, numbermapper.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, weekmapper.class);
Path outputPath = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And this is the O/P I got-
a Monday
b Tuesday
c Wednesday
Dataset used
a,1234,Monday
b,3456,Tuesday
c,1345,Wednesday
Multiple input format was just taking 1 file and running one mapper on it because I have given the same path for both the Mappers.
When I copy the dataset to a different file and ran the same program taking two different files (same content but different names for the files) I got the expected output.
So i now understood that the output from different mapper functions is also combined based on key , not just the output from the same mapper function.
Thanks for trying to help....!!!

Multiple Input Files Mapreduce Wordcount example done separately

I was going about Hadoop framework for Mapreduce model,and actually tried out basic examples like WordCount, Max_temperature so much so as to create a mapreduce task for my project .I only want to know how to process wordcount as one output file for each input file...as in let me give you an example on that :-
FILE_1 Dog Cat Dog Bull
FILE_2 Cow Ox Tiger Dog Cat
FILE_3 Dog Cow Ox Tiger Bull
should give 3 output files, 1 for each input file as follows:-
Out_1 Dog 2,Cat 1,Bull 1
Out_2 Cow 1,Ox 1,Tiger 1,Dog 1,Cat 1
Out_3 Dog 1,Cow 1,Ox 1,Tiger 1,Bull 1
I went through the answers posted here Hadoop MapReduce - one output file for each input but couldn't grasp it properly.
Help please! Thanks
Each Reducer outputs one output file.
The number of output files is dependent on number of Reducers.
(A)
Assuming you want to process all three input files in a single MapReduce Job.
At the very minimum - you must set number of Reducers equal to the Number of Output Files you want.
Since you are trying to do word-counts Per File. And not across Files.
You will have to ensure that all the file contents (of one file) are processed by a Single Reducer. Using a Custom Partitioner is one way to do this.
(B)
Another way is to simply run your MapReduce Job Three Times. Once for Each Input File. And have Reducer count as 1.
Even I am a newbie in hadoop and found this question very interesting. And this is how I resolved this.
public class Multiwordcnt {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job myJob = new Job(conf, "Multiwordcnt");
String[] userargs = new GenericOptionsParser(conf, args).getRemainingArgs();
myJob.setJarByClass(Multiwordcnt.class);
myJob.setMapperClass(MyMapper.class);
myJob.setReducerClass(MyReducer.class);
myJob.setMapOutputKeyClass(Text.class);
myJob.setMapOutputValueClass(IntWritable.class);
myJob.setOutputKeyClass(Text.class);
myJob.setOutputValueClass(IntWritable.class);
myJob.setInputFormatClass(TextInputFormat.class);
myJob.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(myJob, new Path(userargs[0]));
FileOutputFormat.setOutputPath(myJob, new Path(userargs[1]));
System.exit(myJob.waitForCompletion(true) ? 0 : 1 );
}
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
Text emitkey = new Text();
IntWritable emitvalue = new IntWritable(1);
public void map(LongWritable key , Text value, Context context) throws IOException, InterruptedException {
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
String filepathword = filePathString + "*" + tokenizer.nextToken();
emitkey.set(filepathword);
context.write(emitkey, emitvalue);
}
}
}
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
Text emitkey = new Text();
IntWritable emitvalue = new IntWritable();
private MultipleOutputs<Text,IntWritable> multipleoutputs;
public void setup(Context context) throws IOException, InterruptedException {
multipleoutputs = new MultipleOutputs<Text,IntWritable>(context);
}
public void reduce(Text key , Iterable <IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values){
sum = sum + value.get();
}
String pathandword = key.toString();
String[] splitted = pathandword.split("\\*");
String path = splitted[0];
String word = splitted[1];
emitkey.set(word);
emitvalue.set(sum);
System.out.println("word:" + word + "\t" + "sum:" + sum + "\t" + "path: " + path);
multipleoutputs.write(emitkey,emitvalue , path);
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleoutputs.close();
}
}
}

Hadoop MapReduce: return sorted list of words in a text file

So my task is to return a alpahbetically sorted list of all words contained in a text file while keeping duplicates.
{To be or not to be} −→ {be be not or to to}
My idea is to take each word as the key as well as the value. This way, because hadoop sorts the keys, they will automatically be sorted alphabtically. In the Reduce phase I simply append all words with the same key (so basically identical words) to one single Text value.
public class WordSort {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
// transform to lower case
String lower = word.toString().toLowerCase();
context.write(new Text(lower), new Text(lower));
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String result = "";
for (Text value : values){
res += value.toString() + " ";
}
context.write(key, new Text(result));
}
}
However my problem is, how do I simply return the value in my output file? At the moment I have this:
be be be
not not
or or
to to to
So in every line I have the key first and then the values, but I just want to return the values so that I get this:
be be
not
or
to to
Is this even possible or do I have to just delete one entry from the value of each word?
Disclaimer: I'm not an Hadoop user, but I do a lot of Map/Reduce with CouchDB.
If you just need the keys, why don't you emit an empty value?
Moreover, it sounds like you don't want to reduce them at all, since you want to get a key for every occurrence.
Just tried with the MaxTemperature example from the Hadoop - The Definitive Guide and the below code worked
context.write(null, new Text(result));

Creating Sequence File Format for Hadoop MR

I was working with Hadoop MapRedue, and had a question.
Currently, my mapper's input KV type is LongWritable, LongWritable type and
output KV type is also LongWritable, LongWritable type.
InputFileFormat is SequenceFileInputFormat.
Basically What I want to do is to change a txt file into SequenceFileFormat so that I can use this into my mapper.
What I would like to do is
input file is something like this
1\t2 (key = 1, value = 2)
2\t3 (key = 2, value = 3)
and on and on...
I looked at this thread How to convert .txt file to Hadoop's sequence file format but reliazing that TextInputFormat only support Key = LongWritable and Value = Text
Is there any way to get txt and make a sequence file in KV = LongWritable, LongWritable?
Sure, basically the same way I told in the other thread you've linked. But you have to implement your own Mapper.
Just a quick scratch for you:
public class LongLongMapper extends
Mapper<LongWritable, Text, LongWritable, LongWritable> {
#Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, LongWritable>.Context context)
throws IOException, InterruptedException {
// assuming that your line contains key and value separated by \t
String[] split = value.toString().split("\t");
context.write(new LongWritable(Long.valueOf(split[0])), new LongWritable(
Long.valueOf(split[1])));
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("Convert Text");
job.setJarByClass(LongLongMapper.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
// increase if you need sorting or a special number of files
job.setNumReduceTasks(0);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(LongWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path("/input"));
FileOutputFormat.setOutputPath(job, new Path("/output"));
// submit and wait for completion
job.waitForCompletion(true);
}
}
Each value in your mapper function will get a line of your input, so we are just splitting it by your delimiter (tab) and parsing each part of it into longs.
That's it.

Hadoop Task Side Effect File Example

Can I get an example of how to use task side effect files?
public class Map0t extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable >{
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
IntWritable one = new IntWritable(1);
StringTokenizer tokenizer = new StringTokenizer(value.toString(), ",");
String x;
String y;
String z;
x = tokenizer.nextToken();
y = tokenizer.nextToken();
z = tokenizer.nextToken();
output.collect(new Text(x+" "+z), one);
}
}
I want to write, new Text(x+" "+y), new Text(z) as a side effect in the above Mapper function to a different folder in HDFS.
I searched but could not find any example on how to use task side effect files.
Not an optimum approach, but one way I can think is
Open a file in HDFS in setup() in the mapper, write into the file and then close the file in the clean() in the mapper. One think to make sure is to use a unique file name in the setup() of the mapper.

Resources