Write output of two different Hadoop jobs to same set of reducers - hadoop

I have a scenario where I need to run two Hadoop jobs calculating n-gram statistics for two different corpora and make sure that they write each n-gram (and it's score) to the same reducer (so that in future I can read the data locally and compare and contrast two scores from two corpora). For e.g. if job J1 executes one of its reducers on machine M and writes n-gram N locally, I would like job J2 to also write n-gram N to the same machine M.
I know how to compute n-gram statistics for a corpora (for reference, one can refer to this pubication from Google). I have also defined my custom partitioner (taking hash based on first two words in the n-gram). Now how do I make sure that two different runs of the same program (on two different corpora) end up writing corresponding output to the same reducers?

Check out MultipleInputs. By pointing two sibling mappers against sibling datasets you can avoid running an ID map on the combined set before reducing.

Related

How mapper and reducer tasks are assigned

When executing a MR job, Hadoop divides the input data into N Splits and then starts the corresponding N Map programs to process them separately.
1.How is the data divided (splited into different inputSplits)?
2.How is Split scheduled (how do you decide which TaskTracker machine the Map program that handles Split should run on)?
3.How to read the divided data?
4.How Reduce task assigned ?
In hadoop1.X
In hadoop 2.x
Both of the questions has some relationship , so I asked them together , you can show which of the part you are good at .
thanks in advance .
Data is stored/read in HDFS Blocks of a predefined size, and read by various RecordReader types by using byte scanners, and knowing how many bytes to read in order to determine when an InputSplit needs to be returned.
A good exercise to understand it better is to implement your own RecordReader and create small and large files of one small record, one large record, and many records. In the many records case, you try to split a record across two blocks, but that test case should be the same as one large record over two blocks.
Reduce tasks can be set by the client of the MapReduce action.
As of Hadoop 2 + YARN, that image is outdated

What is the benefit of the reducers in Hadoop?

I don't see a value for the reducers in Hadoop in the following scenario:
The Map Tasks generate unique keys (Because we can merge both the Map/Reduce functionality together)
The output size of the Map Tasks is too big (This will exhaust the memory if we wait for the reducers to begin the work)
If we have any functionality that doesn't need grouping and sorting of the keys
Please correct me if I am wrong.
And if someone could give me a real example of the benefits of the reducers and when it should be used, I will appreciate it.
Reducer is beneficial (or required) when you need to do operations like aggregation/grouping etc..
FYI : Reducer is meant for grouping different value for a key which comes from different mapper. So for a use case which do not require grouping/aggregation then there is no point of using reducer(you can set it to Zero , meaning Map-Only jobs).
One quick use-case i can think of is - you want to randomly split a big file to multiple part file. In this case you will supply big file (lets say 100G) to Map-Only jobs. All maps will read a chunk of file and write as a part of file.

Where to add specific function in the Map/Reduce Framework

I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.
You're actually really confused. In a broad and general sense, the map portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?
In MapReduce, you have a Mapper and a Reducer. You also have a Partitioner and a Combiner.
Hadoop is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework, it divides that job such that there is a Mapper for every input split(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers also run on different nodes.
In the Map stage,
The file is divided into records by the RecordReader, the definition of record is controlled by InputFormat that we choose. Every record is a key-value pair.
The map() of our Mapper is run for every such record. The output of this step is again in key-value pairs
The output of our Mapper is partitioned using the Partitioner that we provide, or the default HashPartitioner. Here in this step, by partitioning, I mean deciding which key and its corresponding values go to which Reducer(if there is only one Reducer, its of no use anyway)
Optionally, you can also combine/minimize the output that is being sent to the reducer. You can use a Combiner to do that. Note that, the framework does not guarantee the number of times a Combiner will be called. It is only part of optimization.
This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers complete running on all nodes, the intermediate data i.e the data at end of Map stage is copied to their corresponding reducer.
In the Reduce stage, the reduce() of our Reducer is run on each record of data from the Mappers. Here the record comprises of a key and its corresponding values, not necessarily just one value. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce job you usually think about what can be done on each record of data in both the Mapper and Reducer. A MapReduce program can just contain a Mapper with map() implemented and a Reducer with reduce() implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For
each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.

How does partitioning in MapReduce exactly work?

I think I have a fair understanding of the MapReduce programming model in general, but even after reading the original paper and some other sources many details are unclear to me, especially regarding the partitioning of the intermediate results.
I will quickly summarize my understanding of MapReduce so far: We have a potentially very large input data set, which is automatically split up into M different pieces by the MR-Framework. For each piece, the framework schedules one map task which is executed by one of the available processors/machines in my cluster. Each of the M map tasks outputs a set of Key-Value-Pairs, which is stored locally on the same machine that executed this map task. Each machine divides its disk into R partitions and distributes its computed intermediate key value pairs based on the intermediate keys among the partitions. Then, the framework starts for each distinct intermediate key one reduce task which is again executed by any of the available machines.
Now my questions are:
In some tutorials it sounds like there could be map and reduce tasks executed in parallel. Is this right? How could that be, assuming that for each distinct intermediate key only one reduce task is started? Do we not have to wait until the last map task is finished before we can start the first reduce task?
As we have one reduce task per distinct intermediate key, is it right that each reduce task requires the executing machine to load the corresponding partition from every other machine? Potentially, every machine can have a key-value-pair with the desired intermediate key, so for each reduce task we potentially have to query all other machines. Is that really efficient?
The original paper says that the number of partitions (R) is specified by the user. But isn’t a partition the input for a reduce task? Or more exactly: Isn’t the union of all partitions with the same number among all machines the input of one reduce task? That would mean, that R depends on the number of distinct intermediate keys which the user usually doesn’t know.
Conceptually it is clear what the input and outputs of the map and reduce functions/tasks are. But I think I haven’t yet understood MapReduce on the technical level. Could somebody please help me understanding?
You can start the reducer tasks while the map tasks are still running (using a feature known as slowstart), but the reducers can only run the copy phase (acquiring the completed results from the completed map tasks. It will need to wait for all the mappers to complete before it can actually perform the final sort and reduce.
A reduce task actually processes zero, one or more keys (rather than a discrete tasks for each key). Each reducer will need to acquire the map output from each map task that relates to its partition before these intermediate outputs are sorted and then reduced one key set at a time.
Back to the note in 2 - a reducer task (one for each partition) runs on zero, one or more keys rather than a single task for each discrete key.
It's also important to understand the spread and variation of your intermediate key as it is hashed and modulo'd (if using the default HashPartitioner) to determine which reduce partition should process that key. Say you had an even number of reducer tasks (10), and output keys that always hashed to an even number - then in this case the modulo of these hashs numbers and 10 will always be an even number, meaning that the odd numbered reducers would never process any data.
Addendum to what Chris said,
Basically, a partitioner class in Hadoop (e.g. Default HashPartitioner)
has to implement this function,
int getPartition(K key, V value, int numReduceTasks)
This function is responsible for returning you the partition number and you get the number of reducers you fixed when starting the job from the numReduceTasks variable, as seen for in the HashPartitioner.
Based on what integer the above function return, Hadoop selects node where the reduce task for a particular key should run.
Hope this helps.

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions.
All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.
Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.
Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.
Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.
Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.
Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.
You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.
It's possible to do that, but not in trivial way in one MR.
You may use composite keys. Let's say you need two kinds of the reducers, 'R1' and 'R2'. Add ids for these as a prefix to your o/p keys in the mapper. So, in the mapper, a key 'K' now becomes 'R1:K' or 'R2:K'.
Then, in the reducer, pass values to implementations of R1 or R2 based on the prefix.
I guess you want to run different reducers in a chain. In hadoop 'multiple reducers' means running multiple instances of the same reducer. I would propose you run one reducer at a time, providing trivial map function for all of them except the first one. To minimize time for data transfer, you can use compression.
Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:
job.setNumReduceTasks(<number>);
But. Your infrastructure has to support the multiple reducers, meaning that you have to
have more than one cpu available
adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly
And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:
the keymap-output have either to be partitionable by %numreducers OR you have to define your own partitioner:
job.setPartitionerClass(...)
for example with a random-partitioner ...
the data must be reduce-able in the partitioned format ... (references needed?)
You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...
Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map.
Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.
I still dont get your problem you can use following sequence:
database-->map-->reduce(use cat or None depending on requirement)
then store the data representation you have extracted.
if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.
Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.

Resources