What is the benefit of the reducers in Hadoop? - hadoop

I don't see a value for the reducers in Hadoop in the following scenario:
The Map Tasks generate unique keys (Because we can merge both the Map/Reduce functionality together)
The output size of the Map Tasks is too big (This will exhaust the memory if we wait for the reducers to begin the work)
If we have any functionality that doesn't need grouping and sorting of the keys
Please correct me if I am wrong.
And if someone could give me a real example of the benefits of the reducers and when it should be used, I will appreciate it.

Reducer is beneficial (or required) when you need to do operations like aggregation/grouping etc..
FYI : Reducer is meant for grouping different value for a key which comes from different mapper. So for a use case which do not require grouping/aggregation then there is no point of using reducer(you can set it to Zero , meaning Map-Only jobs).
One quick use-case i can think of is - you want to randomly split a big file to multiple part file. In this case you will supply big file (lets say 100G) to Map-Only jobs. All maps will read a chunk of file and write as a part of file.

Related

Hadoop-2.4.1 custom partitioner to balance reducers

As we know, that during the shuffle phase of hadoop, each of the reducer read data from all the mapper's output (intermedia data).
Now, we also know that by default Hash-Partitioning is used for reducers.
My question is: How do we implement an algorithm, e.g. Locality-aware?
In short, you should not do it.
First, you have no control over where the mappers and reducers are executed on the cluster, so even when the complete output of a single mapper will go to a single reducer there is a huge probability that they would be on different hosts and the data would be transferred through the network
Second, to make the reducer process the whole output of the mapper, you first have to make mapper process the right part of the information, which means that you have to preprocess data by partitioning it and then run a single mapper and a single reducer for each partition, but this preprocessing itself would take much resources so it is mostly meaningless
And finally, why do you need it? The main concept of map-reduce is manipulation with key-value pairs, and reducer in general should aggregate list of values outputted by the mappers for the same keys. Here's why hash partitioning is used: distribute N keys between K reducers. Using different type of partitioner is a really seldom case. If you need data locality you might prefer to work with MPP database rather than Hadoop, for example.
If you really need a custom partitioner, here's an example of how it can be implemented: http://hadooptutorial.wikispaces.com/Custom+partitioner. Nothing special, just return reducer number based on the key and value passed and the number of reducers. Using hash code of the host name divided (%) by the number of reducers will make the whole output of a single mapper go to a single reducer. Also you might use process PID % number of reducers. But before doing this you have to check, whether you really need this behavior or not.

Hadoop - set reducer number to 0 but write to same file?

My job is computational intensive so I am actually only using the distribution function of Hadoop, and I want all my output to be in 1 single file so I have set the number of reducer to 1. My reducer is actually doing nothing...
By explicitly setting the number of reducer to 0, may I know how can I control in the mapper to force all the outputs are written into the same 1 output file? Thanks.
You can't do that in Hadoop. Your mappers each have to write to independent files. This makes them efficient (no contention or network transfer). If you want to combine all those files, you need a single reducer. Alternatively, you can let them be separate files, and combine the files when you download them (e.g., using HDFS's command-line cat or getmerge options).
EDIT: From your comment, I see that what you want is to get away with the hassle of writing a reducer. This is definitely possible. To do this, you can use the IdentityReducer. You can check its API here and an explanation of 0 reducers vs. using the IdentityReducer is available here.
Finally, when I say that having multiple mappers generate a single output is not possible, I mean it is not possible with plain files in HDFS. You could do this with other types of output, like having all mappers write to a single database. This is OK if your mappers are not generating much output. Details on how this would work are available here.
cabad is correct for the most part. However, if you want to process the file with a single Mapper to a single output file you could use a FileInputFormat that marks the file as not splittable. Do this as well as set the number of Reducers to 0. This reduces the performance of using multiple data nodes but skips Shuffle and Sort.

Input Sampler in Hadoop

My understanding about InputSampler is that it gets data from record reader and samples keys and then creates a partition file in HDFS.
I have few queries about this sampler:
1) Is this sampling task a map task ?
2) My data is on HDFS (distributed across nodes of my cluster). Will this sampler run on nodes which has the data to be sampled?
3) Will this consume my map slots?
4) Will the sample run simultaneously with the map tasks of my MR job ? I want to know whether it will affect time consumed by mappers by reducing the number of slots?
I found that the InputSampler makes a seriously flawed assumption and is therefore not very helpful.
The idea is that it samples key values from the mapper input and then uses the resulting statistics to evenly partition the mapper output. The assumption then is that the key type and value distribution are the same for the mapper input and output. In my experience the mapper almost never sends the same key value types to the reducer as it reads in. So the InputSampler is useless.
In the few times where I had to sample in order to partition effectively, I ended up doing the sampling as part of the mapper (since only then did I know what keys were being produced) and writing the results out in the mapper's close() method to a directory (one set of stats per mapper). My partitioner then had to perform lazy initialization on its first call to read the mapper-written files, assimilate the stats into some useful structure and then to partition subsequent keys accordingly.
Your only other real option is to guess at development time how the key values are distributed and hard-code that assumption into your partitioner.
Not very clean but it was the best I could figure.
This question was asked a long time ago, and many questions were left unanswered.
The only and most voted answer by #Chris does not really answer the questions, but gives an interesting point of view, though a bit too pessimistic and misleading in my opinion, so I'll discuss it here as well.
Answers to the original questions
The sampling task is done in the call to InputSampler.writePartitionFile(job, sampler). The call to this method is blocking, during which the sampling is done, in the same thread.
That's why you don't need to call job.waitForCompletion(). It's not a MapReduce Job, it simply runs in your client's process. Besides, a MapReduce job needs at least 20 seconds just to start, but sampling a small file only takes a couple of second.
Thus, the answer to all of your questions is simply "No".
More details from reading the code
If you look at the code of the writePartitionFile(), you will find that it calls sampler.getSample(), who will call inputformat.getSplits() to get a list of all input splits to be samples.
These input formats will then be read sequentially to extract the samples. Each input split is read by a new record reader created within the same method. This means that your client is doing the reading and sampling.
Your other nodes are not running any "map" or other processes, they are simply serving HDFS the block data needed by your client for its input splits needed for sampling.
Using Different key types between Map input and output
Now, to discuss the answer given by Chris. I agree that the InputSampler and TotalOrderPartitioner are probably flawed in some ways, since they are really not easy to understand and use ... But they do not impose key types to be the same between map input and output.
The InputSampler uses the job's InputFormat (and its RecordReader) keys to create the partition file containing all sampled keys. This file is then used by the TotalOrderPartitioner during the partitioning phase at the end of the Mapper's process to create partitions.
The easiest solution is to create a custom RecordReader, for the InputSampler only, which performs the same key transformation as your Mapper.
To illustrate this, let's say your dataset contains pairs of (char, int), and that your mapper transforms them into (int, int), by taking the character's ascii value. For example 'a' becomes 97.
If you want to perform total order partitioning of this job, your InputSampler would sample letters 'a', 'b', 'c'. Then during the partitioning phase, your mapper output keys will be integer values like 102 or 107, which wouldn't be comparable to 'a', 'g' or 't' from the partition-file for partition distribution. This is not consistent, and this is why it looks like the input and output key types are assumed to be the same, when using the same InputFormat for sampling and your mapreduce job.
So the solution is to write a custom InputFormat and its RecordReader, used only for the sampling client-side job, which reads your input file and does the same transformation from char to int before returning each record. This way the InputSampler will directly write the integer ascii values from the custom record reader to the partition-file, which maintain the same distribution, and will be usable with your mapper's output.
It's not so easy to grasp in a few lines of text explanation texts, but anybody interested in fully understanding how the InputSampler and TotalOrderPartitioner work should check out this page : http://blog.ditullio.fr/2016/01/04/hadoop-basics-total-order-sorting-mapreduce/
It explains in details how to use them in different cases.

How do I limit the number of records sent to the reducer in a map reduce job?

I have an file with over 300000 lines that's an input to a map reduce job and I want the job to process only the first 1000 lines of this file. Is there a good way to limit the number of records sent to the reducer?
A simple identity reducer is all I need to write out my output. Currently, the reducer writes out as many lines as there are in the input.
First, make sure your mapreduce program is set to only use one reducer. It has to be explicitly set, otherwise Hadoop might choose some other number, and then there's no good way to coordinate between reduce tasks to make sure they don't emit more than 1000 total. Then you can simply maintain an instance variable in your Reducer class that counts how many records it has seen, and stops emitting them after 1000.
The other, probably simpler, way to do it would be to shorten your input file. Just delete the lines you don't need.
It's also worth noting that hive and pig are both frameworks that will do this type of thing for you. Writing "raw" MapReduce code is rare in practice. Most people use one of those two.

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions.
All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.
Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.
Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.
Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.
Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.
Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.
You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.
It's possible to do that, but not in trivial way in one MR.
You may use composite keys. Let's say you need two kinds of the reducers, 'R1' and 'R2'. Add ids for these as a prefix to your o/p keys in the mapper. So, in the mapper, a key 'K' now becomes 'R1:K' or 'R2:K'.
Then, in the reducer, pass values to implementations of R1 or R2 based on the prefix.
I guess you want to run different reducers in a chain. In hadoop 'multiple reducers' means running multiple instances of the same reducer. I would propose you run one reducer at a time, providing trivial map function for all of them except the first one. To minimize time for data transfer, you can use compression.
Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:
job.setNumReduceTasks(<number>);
But. Your infrastructure has to support the multiple reducers, meaning that you have to
have more than one cpu available
adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly
And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:
the keymap-output have either to be partitionable by %numreducers OR you have to define your own partitioner:
job.setPartitionerClass(...)
for example with a random-partitioner ...
the data must be reduce-able in the partitioned format ... (references needed?)
You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...
Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map.
Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.
I still dont get your problem you can use following sequence:
database-->map-->reduce(use cat or None depending on requirement)
then store the data representation you have extracted.
if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.
Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.

Resources