MapReduce TotalOrderPartitioning writes output to only to one file? - hadoop

I am running a mapreduce job which read the input and sorts it using multiple reduces.
I am able to get the output sorted with the number of reducers to be 5. However, the output is written to only 1 file and have 4 empty files along with it.
I am using an input sampler and totalorderpartitioner for global sorting.
My driver looks like follows:
int numReduceTasks = 5;
Configuration conf = new Configuration();
Job job = new Job(conf, "DictionarySorter");
job.setJarByClass(SampleEMR.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setNumReduceTasks(numReduceTasks);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, input);
FileOutputFormat.setOutputPath(job, new Path(output
+ ".dictionary.sorted." + getCurrentDateTime()));
job.setPartitionerClass(TotalOrderPartitioner.class);
Path inputDir = new Path("/others/partitions");
Path partitionFile = new Path(inputDir, "partitioning");
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),
partitionFile);
double pcnt = 1.0;
int numSamples = numReduceTasks;
int maxSplits = numReduceTasks - 1;
if (0 >= maxSplits)
maxSplits = Integer.MAX_VALUE;
InputSampler.Sampler<LongWritable, Text> sampler = new InputSampler.RandomSampler<LongWritable, Text>(pcnt,
numSamples, maxSplits);
InputSampler.writePartitionFile(job, sampler);
job.waitForCompletion(true);

Your RandomSampler parameters seem suspicious to me:
The first parameter freq is a probability, not a percentage. For pcnt = 1 you are sampling 100% of the records.
The second parameter numSamples should be bigger. It should be enough to represent the distribution of your whole dataset.
Imagine you have the following keys: 4,7,8,9,4,1,2,5,6,3,2,4,7,4,8,1,7,1,8,9,9,9,9
Using freq = 0.3 and numSamples = 10. for the sake of simplicity let's say 0.3 means every 3 keys one if sampled. This will collect the following sample: 4,9,2,3,7,1,8,9. This will be sorted into 1,2,3,4,7,8,9,9. This sample has 8 elements, so all of them are kept, because it does not exceed the maximum number of samples numSamples = 10.
Out of this sample, the boundaries for your reducers will be something like 2,4,8,9. This means that if a pair has the key "1" it will end up in Reducer #1. A pair with key "2" will end up in Reducer #2. A pair with key "5" will end up in Reducer #3, etc... This would be a good distribution.
Now if we run your values on the same example keys. Your freq = 1 so you take each key into the sample. So your sample will be the same as the initial keyset. Except that you set a max number of samples numSamples = 4, which means that you only keep 4 elements in your sample. Your final sample is likely to be 9,9,9,9. In this case all your boundaries are the same, so all pairs always go to Reducer #5.
In my example it looks like we were very unlucky to have the same last 4 keys. But if your original dataset is already sorted, this is likely to happen (and the boundary distribution is guaranteed to be bad) if you use a high frequency with a small number of samples.
This blog post has lots of details on Sampling and TotalOrderPartitioning.

Related

Hadoop MapReduce to get percentage of each word

I am using Hadoop Mapreduce to get the word and word count information. Besides the count of each word, I also need to find the percentage of each word that show in the document. The output is like this.
If the document just contains three words "hello","world" and "kitty". The result should be like this.
word count percentage
hello 40 0.4
world 50 0.5
kitty 10 0.1
I can set a TOTAL_KEY to count all words, the problem is that the result will return at the same time when each word count return. When output each word to hdfs, it is impossible to calculate the percentage at that time.
You can set a counter in your Mapper.
increment a global counter for counting total number of words while you emit the word from the mapper.
get the counter in your reducer to get the total number of words emitted.
calculate the percentage using the general method.
job.setJarByClass(WinnerCount.class);
job.setMapperClass(WinnerCountMapper.class);
//job.setCombinerClass(WinnerCountCombiner.class);
job.setCombinerClass(WinnerCountCombiner.class);
job.setReducerClass(WinnerCountReducer.class);
job.setInputFormatClass(ChessInputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, temp);
job.waitForCompletion(true);
Counters counters = job.getCounters();
Counter total = counters.findCounter(WinnerCountMapper.MAP_COUNTER.TOTAL_TUPLE);
//System.out.println(total.getDisplayName()+":"+total.getValue());
/*if (hdfs.exists(output))
hdfs.delete(output, true);*/
Configuration conf2 = new Configuration();
conf2.set("total", total.getValue()+"");
Job job2 = Job.getInstance(conf2,"aggregation");
job2.setJarByClass(WinnerCount.class);
job2.setMapperClass(AggregationMapper.class);
job2.setReducerClass(AggregationReducer.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);

Reducer that groups by two values

I have a case in which Mapper emits data that belongs to a subgroup and the subgroup belongs to a group.
I need to add up all the values in the subgroup and find the minimal value between all subgroups of the group, for each of the groups.
So, I have an output from Mapper that looks like this
Group 1
group,subgroupId,value
Group1,1,2
Group1,1,3
Group1,1,4
Group1,2,1
Group1,2,2
Group1,3,1
Group1,3,2
Group1,3,5
Group 2
group,subgroupId,value
Group2,4,2
Group2,4,3
Group2,4,4
Group2,5,1
Group2,5,2
Group2,6,1
Group2,6,2
And my output should be
Group1, 1, (2+3+4)
Group1, 2, (1+2)
Group1, 3, (1+2+5)
Group1 min = min((2+3+4),(1+2),(1+2+5))
Same for Group 2.
So I practically need to group twice, first group by GROUP and then inside of it group by SUBGROUPID.
So I should emit the minimal sum from a group, in the given example my reducer should emit (2,3), since the minimal sum is 3 and it comes from element with id 2.
So, it seems that it could be solved best using reduce twice, first reduce would get elements grouped by id and that would be passed to the second Reducer grouped by Group id.
Does this make sense and how to implement it? I've seen ChainedMapper and ChainedReducer, but they don't fit for this purpose.
Thanks
If all data can fit in the memory of one machine, you can simply do all this in a single job, using a single reducer (job.setNumReducers(1);) and two temp variables. The output is emitted in the cleanup phase of the reducer. Here is the pseudocode for that, if you use the new Hadoop API (that supports the cleanup() method):
int tempKey;
int tempMin;
setup() {
tempMin = Integer.MAX_VALUE;
}
reduce(key, values) {
int sum = 0;
while (values.hasNext()) {
sum += values.next();
}
if (sum < tempMin) {
tempMin = sum;
tempKey = key;
}
}
cleanup() { //only in the new API
emit(tempKey, tempMin);
}
Your approach (summarized below), is how I would do it.
Job 1:
Mapper: Assigns an id to a subgroupid
Combiner/Reducer(same class): Finds the minimum value for
subgroupid.
Job 2:
Mapper: Assigns a groupid to a subgroupid.
Combiner/Reducer(same class): Finds the minimum value for
groupid.
This is best implemented in two jobs for the following reasons:
Simplifies the mapper and reducer significantly (you don't need to worry about finding all the groupids the first time around). Finding the (groupid, subgroupid) pairs in the mapper could be non-trivial. Writing the two mappers should be trivial.
Follows the map reduce programming guidelines given by Tom White in Hadoop: The Definitive Guide (Chapter 6).
An Oozie workflow can easily and simply accommodate the dependent jobs.
The intermediate file products (key:subgroupid, value: min value for subgroupid) should be small, limiting the use of network resources.

How to set number of reducer dynamically based on my mapper output size?

I know that the number of mapper can be set based on my dfs split size by setting mapred.min.split.size to dfs.block.size.
Similary how can set I the number of reducers based on my mapper output size?
PS: I know that the below options can be used to manipulate the number of reducer.
mapred.tasktracker.reduce.tasks.maximum
mapred.reduce.tasks
No of reducers can not set after job submission.
Think about it this way - partitioner is called on the mapper output and it needs to know no of reducers to partition.
To set number of reducer task dynamically:
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps.
So in the code below, let us set the number of reducer tasks dynamically to adjust with the number of map tasks at runtime.
In Java code:
long defaultBlockSize = 0;
int NumOfReduce = 10; // you can set anything
long inputFileLength = 0;
try {
FileSystem fileSystem = FileSystem.get(this.getConf()); // hdfs file
// system
inputFileLength = fileSystem.getContentSummary(
new Path(PROP_HDFS_INPUT_LOCATION)).getLength();// input
// file or
// files
// stored in
// hdfs
defaultBlockSize = fileSystem.getDefaultBlockSize(new Path(
hdfsFilePath.concat("PROP_HDFS_INPUT_LOCATION")));// getting
// default
// block
// size
if (inputFileLength > 0 && defaultBlockSize > 0) {
NumOfReduce = (int) (((inputFileLength / defaultBlockSize) + 1) * 2);// calculating
// no.
// of
// blocks
}
System.out.println("NumOfReduce : " + NumOfReduce);
} catch (Exception e) {
LOGGER.error(" Exception{} ", e);
}
job.setNumReduceTasks(NumOfReduce);
If you want to set the number of mappers and reducers through command line dynamically::
you can use the below options:
-D mapred.map.tasks=5 -D mapred.reduce.tasks=5
We can also set the number of mappers and reducers in the driver code:
job.setNumMapTasks();
job.setNumReduceTasks();
I dont think you can dynamically change the number of reducers once the map reduce job started.As far as i know,there would be no human interaction of providing values during the job is being run.It should be preconfigured. Map Reduce job is a batch process(runs long time),so it hard for user to know when it would ask the user for number of reducers and it is not designed to be interactive during this process!! Hope you got the answer !!

MapReduce Job distribution among reducers

I developed a small mapreduce program. When i opened the process log, i saw that one map and two reducers were created by the framework. I had only one file for input and got two output files. Now please tell me
1) Number of mapper and reducer are created by framework or it can be changed?
2) Number of output files always equal to number of reducers? i.e. each reducer
creates its own output file?
3) How one input file is distributed among mappers? And output of one mapper is
distributed among multiple reducers (this is done by framework or you can change)?
4) How to manage when multiple input files are there i.e. A directory ,
containing input files?
Please answer these questions. I am beginner to MapReduce.
Let me attempt to answer your questions. Please tell me wherever you think is incorrect -
1) Number of mapper and reducer are created by framework or it can be changed?
Total number of map tasks created depends on the total number of logical splits being made out of the HDFS blocks. So, fixing the number of map tasks may not always be possible because different files can have different sizes and with that different number of total blocks. So, if you are using TextInputFormat, roughly each logical split equals to a block and fixing number of total map task would not be possible since, for each file there can be different number of blocks created.
Unlike number of mappers, reducers can be fixed.
2) Number of output files always equal to number of reducers? i.e. each reducer
creates its own output file?
To certain degree yes but there are ways with which it's possible to create more than one output file from a reducer. For e.g.: MultipleOutputs
3) How one input file is distributed among mappers? And output of one mapper is
distributed among multiple reducers (this is done by framework or you can change)?
Each file in HDFS is composed of blocks. Those blocks are replicated and can remain in multiple nodes (machines). Map tasks are then scheduled to runs upon these blocks.
The level of concurrency with which map task can run, depends upon the number of processors each machine have.
E.g. for a file if 10,000 map tasks are scheduled, depending upon total number of processors throughout the cluster, only a 100 can run concurrently at a time.
By default Hadoop uses HashPartitioner, which calculates the hashcode of the keys being sent from the Mapper to the framework and converts them to a partition.
E.g.:
public int getPartition(K2 key, V2 value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
As you can see above, a partition is selected out of the total number of reducers that's fixed based upon the hash code. So, if your numReduceTask = 4, the value returned would be between 0 to 3.
4) How to manage when multiple input files are there i.e. A directory ,
containing input files?
Hadoop supports a directory consisting of multiple files as a input to a job.
As explained by 'SSaikia_JtheRocker' mapper tasks are created according to the total number of logical splits on HDFS blocks.
I would like to add something to the question #3 "How one input file is distributed among mappers? And output of one mapper is distributed among multiple reducers (this is done by framework or you can change)?"
For example consider my word count program which counts the number of words in a file is shown below:
#
public class WCMapper extends Mapper {
#Override
public void map(LongWritable key, Text value, Context context) // Context context is output
throws IOException, InterruptedException {
// value = "How Are You"
String line = value.toString(); // This is converting the Hadoop's "How Are you" to Java compatible "How Are You"
StringTokenizer tokenizer = new StringTokenizer (line); // StringTokenizer returns an array tokenizer = {"How", "Are", "You"}
while (tokenizer.hasMoreTokens()) // hasMoreTokens is a method in Java which returns boolean values 'True' or 'false'
{
value.set(tokenizer.nextToken()); // value's values are overwritten with "How"
context.write(value, new IntWritable(1)); // writing the current context to local disk
// How, 1
// Are, 1
// You, 1
// Mapper will run as many times as the number of lines
}
}
}
#
So in the above program, for the line "How are you" is split into 3 words by StringTokenizer and when used this in the while loop, the mapper is called as many times as the number of words, so here 3 mappers are called.
And reducer, we can specify like how many reducers we want our output to be generated in using 'job.setNumReduceTasks(5);' statement. Below code snippet will give you an idea.
#
public class BooksMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Use programArgs array to retrieve program arguments.
String[] programArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
Job job = new Job(conf);
job.setJarByClass(BooksMain.class);
job.setMapperClass(BookMapper.class);
job.setReducerClass(BookReducer.class);
job.setNumReduceTasks(5);
// job.setCombinerClass(BookReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// TODO: Update the input path for the location of the inputs of the map-reduce job.
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
// TODO: Update the output path for the output directory of the map-reduce job.
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
// Submit the job and wait for it to finish.
job.waitForCompletion(true);
// Submit and return immediately:
// job.submit();
}
}
#

Need help in writing Map/Reduce job to find average

I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:
ProcessName Time
process1 10
process2 20
processn 30
I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
Thanks.
Your Mappers read the text file and apply the following map function on every line
map: (key, value)
time = value[2]
emit("1", time)
All map calls emit the key "1" which will be processed by one single reduce function
reduce: (key, values)
result = sum(values) / n
emit("1", result)
Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.
Update
If you were to execute this job, for each line a tuple would have to be sent to the reducer, potentially clogging the network if you run a Hadoop cluster on multiple machines.
A more clever approach can compute the sum of the times closer to the inputs, e.g. by specifying a combiner:
combine: (key, values)
emit(key, sum(values))
This combiner is then executed on the results of all map functions of the same machine, i.e., without networking in between.
The reducer would then only get as many tuples as there are machines in the cluster, rather than as many as lines in your log files.
Your mapper maps your inputs to the value that you want to take the average of. So let's say that your input is a text file formatted like
ProcessName Time
process1 10
process2 20
.
.
.
Then you would need to take each line in your file, split it, grab the second column, and output the value of that column as an IntWritable (or some other Writable numeric type). Since you want to take the average of all times, not grouped by process name or anything, you will have a single fixed key. Thus, your mapper would look something like
private IntWritable one = new IntWritable(1);
private IntWritable output = new IntWritable();
proctected void map(LongWritable key, Text value, Context context) {
String[] fields = value.split("\t");
output.set(Integer.parseInt(fields[1]));
context.write(one, output);
}
Your reducer takes these values, and simply computes the average. This would look something like
IntWritable one = new IntWritable(1);
DoubleWritable average = new DoubleWritable();
protected void reduce(IntWritable key, Iterable<IntWrtiable> values, Context context) {
int sum = 0;
int count = 0;
for(IntWritable value : values) {
sum += value.get();
count++;
}
average.set(sum / (double) count);
context.Write(key, average);
}
I'm making a lot of assumptions here, about your input format and what not, but they are reasonable assumptions and you should be able to adapt this to suit your exact needs.
Will my output always be a text file or is it possible to directly store the average in some sort of a variable?
You have a couple of options here. You can post-process the output of the job (written a single file), or, since you're computing a single value, you can store the result in a counter, for example.

Resources