Hadoop Task Side Effect File Example - hadoop

Can I get an example of how to use task side effect files?
public class Map0t extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable >{
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
IntWritable one = new IntWritable(1);
StringTokenizer tokenizer = new StringTokenizer(value.toString(), ",");
String x;
String y;
String z;
x = tokenizer.nextToken();
y = tokenizer.nextToken();
z = tokenizer.nextToken();
output.collect(new Text(x+" "+z), one);
}
}
I want to write, new Text(x+" "+y), new Text(z) as a side effect in the above Mapper function to a different folder in HDFS.
I searched but could not find any example on how to use task side effect files.

Not an optimum approach, but one way I can think is
Open a file in HDFS in setup() in the mapper, write into the file and then close the file in the clean() in the mapper. One think to make sure is to use a unique file name in the setup() of the mapper.

Related

Getting the partition id of input file in Hadoop

I need to know the row index of the partitions of the input file that I'm using. I could force this in the original file by concatenating the row index to the data but I'd rather have a way of doing this in Hadoop. I have this in my mapper...
String id = context.getConfiguration().get("mapreduce.task.partition");
But "id" is 0 in every case. In the "Hadoop: The Definitive Guide" it mentions accessing properties like the partition id "can be accessed from the context object passed to all methods of the Mapper or Reducer". It does not, from what I can tell, actually go into how to access this information.
I went through the documentation for the Context object and it seems like the above is the way to do it and the script does compile. But since I'm getting 0 for every value, I'm not sure if I'm actually using the right thing and I'm unable to find any detail online that could help in figuring this out.
Code used to test...
public class Test {
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String id = context.getConfiguration().get("mapreduce.task.partition");
context.write(new Text("Test"), new Text(id + "_" + value.toString()));
}
}
public static class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value : values) {
context.write(key, value);
}
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: Test <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Test.class);
job.setJobName("Test");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Two options are:
Use the offset instead of the row number
Track the line number in the mapper
For the first one, the key which is LongWritable tells you the offset of the line being processed. Unless your lines are exactly the same length, you won't be able to calculate the line number from an offset, but it does allow you to determine ordering if thats useful.
The second option is to just track it in the mapper. You could change your code to something like:
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
private long currentLineNum = 0;
private Text test = new Text("Test");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(test, new Text(currentLineNum + "_" + value));
currentLineNum++;
}
}
You could also represent your matrix as lines of tuples and include the row and col on every tuple so when you're reading in the file, you have that information. If you use a file that is just space or comma seperated values that make up a 2D array, it'll be extremely hard to figure out what line (row) you are currently working on in the mapper

NoSuchElementException in mapreduce

I am new to map reduce getting NoSuchElementException, please help.
input file container below text :
this is a hadoop program
i am writing it for first time
Mapper class :
public class Mappers extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, IntWritable>{
private Text word = new Text();
private IntWritable singleWordCount = new IntWritable();
private IntWritable one = new IntWritable(1);
#Override
public void map(LongWritable key, Text value, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException {
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
int wordSize = wordList.nextToken().length();
singleWordCount.set(wordSize);
if(word != null && wordList != null && wordList.nextToken() != null){
word.set(wordList.nextToken());
output.collect(singleWordCount, one);
}
}
}
}
This is the error I am getting
You're calling wordList.nextToken() three times in the loop for every iteration. Every time you call it StringTokenizerwill return the next token, which will cause the exception when your program hits the word first in your text, because you retrieve first then time and then try to retrieve the next word, which doesn't exist, causing the exception.
What you need to do is retrieve it once in every iteration and store it in a variable. Or if you really need to retrieve two words in one iteration always call hasMoreTokens() to check if there actually is another word to process before you actually call nextToken().

Multiple Input Files Mapreduce Wordcount example done separately

I was going about Hadoop framework for Mapreduce model,and actually tried out basic examples like WordCount, Max_temperature so much so as to create a mapreduce task for my project .I only want to know how to process wordcount as one output file for each input file...as in let me give you an example on that :-
FILE_1 Dog Cat Dog Bull
FILE_2 Cow Ox Tiger Dog Cat
FILE_3 Dog Cow Ox Tiger Bull
should give 3 output files, 1 for each input file as follows:-
Out_1 Dog 2,Cat 1,Bull 1
Out_2 Cow 1,Ox 1,Tiger 1,Dog 1,Cat 1
Out_3 Dog 1,Cow 1,Ox 1,Tiger 1,Bull 1
I went through the answers posted here Hadoop MapReduce - one output file for each input but couldn't grasp it properly.
Help please! Thanks
Each Reducer outputs one output file.
The number of output files is dependent on number of Reducers.
(A)
Assuming you want to process all three input files in a single MapReduce Job.
At the very minimum - you must set number of Reducers equal to the Number of Output Files you want.
Since you are trying to do word-counts Per File. And not across Files.
You will have to ensure that all the file contents (of one file) are processed by a Single Reducer. Using a Custom Partitioner is one way to do this.
(B)
Another way is to simply run your MapReduce Job Three Times. Once for Each Input File. And have Reducer count as 1.
Even I am a newbie in hadoop and found this question very interesting. And this is how I resolved this.
public class Multiwordcnt {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job myJob = new Job(conf, "Multiwordcnt");
String[] userargs = new GenericOptionsParser(conf, args).getRemainingArgs();
myJob.setJarByClass(Multiwordcnt.class);
myJob.setMapperClass(MyMapper.class);
myJob.setReducerClass(MyReducer.class);
myJob.setMapOutputKeyClass(Text.class);
myJob.setMapOutputValueClass(IntWritable.class);
myJob.setOutputKeyClass(Text.class);
myJob.setOutputValueClass(IntWritable.class);
myJob.setInputFormatClass(TextInputFormat.class);
myJob.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(myJob, new Path(userargs[0]));
FileOutputFormat.setOutputPath(myJob, new Path(userargs[1]));
System.exit(myJob.waitForCompletion(true) ? 0 : 1 );
}
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
Text emitkey = new Text();
IntWritable emitvalue = new IntWritable(1);
public void map(LongWritable key , Text value, Context context) throws IOException, InterruptedException {
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
String filepathword = filePathString + "*" + tokenizer.nextToken();
emitkey.set(filepathword);
context.write(emitkey, emitvalue);
}
}
}
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
Text emitkey = new Text();
IntWritable emitvalue = new IntWritable();
private MultipleOutputs<Text,IntWritable> multipleoutputs;
public void setup(Context context) throws IOException, InterruptedException {
multipleoutputs = new MultipleOutputs<Text,IntWritable>(context);
}
public void reduce(Text key , Iterable <IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values){
sum = sum + value.get();
}
String pathandword = key.toString();
String[] splitted = pathandword.split("\\*");
String path = splitted[0];
String word = splitted[1];
emitkey.set(word);
emitvalue.set(sum);
System.out.println("word:" + word + "\t" + "sum:" + sum + "\t" + "path: " + path);
multipleoutputs.write(emitkey,emitvalue , path);
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleoutputs.close();
}
}
}

Hadoop MapReduce: return sorted list of words in a text file

So my task is to return a alpahbetically sorted list of all words contained in a text file while keeping duplicates.
{To be or not to be} −→ {be be not or to to}
My idea is to take each word as the key as well as the value. This way, because hadoop sorts the keys, they will automatically be sorted alphabtically. In the Reduce phase I simply append all words with the same key (so basically identical words) to one single Text value.
public class WordSort {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
// transform to lower case
String lower = word.toString().toLowerCase();
context.write(new Text(lower), new Text(lower));
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String result = "";
for (Text value : values){
res += value.toString() + " ";
}
context.write(key, new Text(result));
}
}
However my problem is, how do I simply return the value in my output file? At the moment I have this:
be be be
not not
or or
to to to
So in every line I have the key first and then the values, but I just want to return the values so that I get this:
be be
not
or
to to
Is this even possible or do I have to just delete one entry from the value of each word?
Disclaimer: I'm not an Hadoop user, but I do a lot of Map/Reduce with CouchDB.
If you just need the keys, why don't you emit an empty value?
Moreover, it sounds like you don't want to reduce them at all, since you want to get a key for every occurrence.
Just tried with the MaxTemperature example from the Hadoop - The Definitive Guide and the below code worked
context.write(null, new Text(result));

Make use of the relation name/table name/file name in Hadoop's MapReduce

Is there a way to use the relation name in MapReduce's Map and Reduce? I am trying to do Set difference using Hadoop's MapReduce.
Input: 2 files R and S containing list of terms. (Am going to use t to denote a term)
Objective: To find R - S, i.e. terms in R and not in S
Approach:
Mapper: Spits out t -> R or t -> S, depending on whether t comes from R or S. So, the map output has the t as the key and the file name as the value.
Reducer: If the value list for a t contains only R, then output t -> t.
Do I need to some how tag the terms with the filename? Or is there any other way?
Source code for something I did for Set Union (doesn't need file name anywhere in this case). Just wanted to use this as an example to illustrate the unavailability of filename in Mapper.
public class Union {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
output.collect(value, value);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{
while (values.hasNext())
{
output.collect(key, values.next());
break;
}
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Union.class);
conf.setJobName("Union");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.set("mapred.job.queue.name", "myQueue");
conf.setNumReduceTasks(5);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
As you can see I can't identify which key -> value pair (input to the Mapper) came from which file. Am I overlooking something simple here?
Thanks much.
I would implement your question just like you answered. That is just the way MapReduce was meant to be.
I guess your problem was actually writing n-times the same value into the HDFS?
EDIT:
Pasted from my Comment down there
Ah I got it ;) I'm not really familiar with the "old" API, but you can "query" your Reporter with:
reporter.getInputSplit();
This returns you an interface called InputSplit. This is easily castable to "FileSplit". And within FileSplit object you could obtain the Path with: "split.getPath()". And from the Path object you just need to call the getName() method.
So this snippet should work for you:
FileSplit fsplit = reporter.getInputSplit(); // maybe cast it down to FileSplit if needed..
String yourFileName = fsplit.getPath().getName();

Resources