Data sharing in Hadoop Map Reduce chaining - hadoop

Is it possible to share a value between successive reducer and mapper?
Or is it possible to store the output of first reducer into memory and second mapper can access that from memory ?
Problem is ,
I had written a chain map reducer like Map1 -> Reducer1 --> Map2 --> Reducer2.
Map1 and Map2 is reading the same input file.
Reduce1 is deriving a value suppose 'X' as its output.
I need 'X' and input file for Map2.
How can we do this without reading the output file of Reduce1?
Is it possible store 'X' in memory to access for Mapper 2 ?

Each job is independent of each other, so without storing the output in intermediate location it's not possible to share the data across jobs.
FYI, in MapReduce model the map tasks don't talk to each other. Same is the case for reduce tasks also. Apache Giraph which runs on Hadoop uses communication between the mappers in the same job for iterative algorithms which requires the same job to be run again and again without communication between the mappers.
Not sure about the algorithm being implemented and why MR, but every MR algorithm can be implemented in BSP also. Here is a paper comparing BSP with MR. Some of the algorithms perform well in BSP when compared to MR. Apache Hama is an implementation of the BSP model, the way Apache Hadoop is an implementation of MR.

If number of distinct rows produced by Reducer1 is small (say you have 10000 (id,price) tuples), using two stage processing is preferred. You can load results from first map/reduce into memory in each Map2 mapper and filter input data. So, no uneeded data will be transferred via network and all data will be processed locally. With use of combineres amount of data can be even less.
In case of huge amount of distinct rows looks like you need to read data twice.

Related

How mapper and reducer tasks are assigned

When executing a MR job, Hadoop divides the input data into N Splits and then starts the corresponding N Map programs to process them separately.
1.How is the data divided (splited into different inputSplits)?
2.How is Split scheduled (how do you decide which TaskTracker machine the Map program that handles Split should run on)?
3.How to read the divided data?
4.How Reduce task assigned ?
In hadoop1.X
In hadoop 2.x
Both of the questions has some relationship , so I asked them together , you can show which of the part you are good at .
thanks in advance .
Data is stored/read in HDFS Blocks of a predefined size, and read by various RecordReader types by using byte scanners, and knowing how many bytes to read in order to determine when an InputSplit needs to be returned.
A good exercise to understand it better is to implement your own RecordReader and create small and large files of one small record, one large record, and many records. In the many records case, you try to split a record across two blocks, but that test case should be the same as one large record over two blocks.
Reduce tasks can be set by the client of the MapReduce action.
As of Hadoop 2 + YARN, that image is outdated

What's the difference between shuffle phase and combiner phase?

i'm pretty confused about the MapReduce Framework. I'm getting confused reading from different sources about that. By the way, this is my idea of a MapReduce Job
1. Map()-->emit <key,value>
2. Partitioner (OPTIONAL) --> divide
intermediate output from mapper and assign them to different
reducers
3. Shuffle phase used to make: <key,listofvalues>
4. Combiner, component used like a minireducer wich perform some
operations on datas and then pass those data to the reducer.
Combiner is on local not HDFS, saving space and time.
5. Reducer, get the data from the combiner, perform further
operation(probably the same as the combiner) then release the
output.
6. We will have n outputs parts, where n is the number
of reducers
It is basically right? I mean, i found some sources stating that combiner is the shuffle phase and it basically groupby each record by key...
Combiner is NOT at all similar to the shuffling phase. What you describe as shuffling is wrong, which is the root of your confusion.
Shuffling is just copying keys from map to reduce, it has nothing to do with key generation. It is the first phase of a Reducer, with the other two being sorting and then reducing.
Combining is like executing a reducer locally, for the output of each mapper. It basically acts like a reducer (it also extends the Reducer class), which means that, like a reducer, it groups the local values that the mapper has emitted for the same key.
Partitioning is, indeed, assigning the map output keys to specific reduce tasks, but it is not optional. Overriding the default HashPartitioner with an implementation of your own is optional.
I tried to keep this answer minimal, but you can find more information on the book Hadoop: The Definitive Guide by Tom White, as Azim suggests, and some related things in this post.
Think of combiner as a mini-reducer phase that only works on the output of map task within each node before it emits it to the actual reducer.
Taking the classical WordCount example, map phase output would be (word,1) for each word the map task processes. Lets assume the input to be processed is
"She lived in a big house with a big garage on the outskirts of a big
city in India"
Without a combiner, map phase would emit (big,1) three times and (a,1) three times and (in,1) two times. But when a combiner is used, the map phase would emit (big,3), (a,3) and (in,2). Note that the individual occurrences of each of these words is aggregated locally within the map phase before it emits its output to reduce phase. In use cases where Combiner is used, it would optimise to ensure network traffic from map to reduce is minimised due to local aggregation.
During the shuffle phase, output from various map phases are redirected to the correct reducer phase. This is handled internally by the framework. If a partitioner is used, it would be helpful to shuffle the input to reduce accordingly.
I don't think that combiner is a part of Shuffle and Sort phase.
Combiner, itself is one of the phases(optional) of the job lifecycle.
The pipelining of these phases could be like:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce
Out of these phases, Map, Partition and Combiner operate on the same node.
Hadoop dynamically selects nodes to run Reduce Phase depend upon the availability and accessibility of the resources in best possible way.
Shuffle and Sort, an important middle level phase works across the Map and Reduce nodes.
When a client submits a job, Map Phase starts working on input file which is stored across nodes in the form of blocks.
Mappers process each line of the file one by one and put the result generated into some memory buffer of 100MB(local memory to each mapper). When this buffer gets filled till a certain threshold, by default 80%, this buffer is sorted and then stored into the disk(as file). Each Mapper can generate multiple such intermediate sorted splits or files. When Mapper is done with all the lines of the block, all such splits are merged together(to form a single file), sorted(on the basis of key) and then Combiner phase starts working on this single file. Note that, if there is no Paritition phase, only one intermediate file will be produced, but in case of Parititioning multiple files get generated depending upon the developers logic. Below image from Oreilly Hadoop: The Definitive guide, may help you in understanding this concept in more details.
Later, Hadoop copies merged file from each of the Mapper nodes to the Reducer nodes depending upon the key value. That is all the records of the same key will be copied to the same Reducer node.
I think, you may know in depth about SS and Reduce Phase work, so not going into more details for these topics.
Also, for more information, I would suggest you to read Oreilly Hadoop: The Definitive guide. Its awesome book for Hadoop.

Where to add specific function in the Map/Reduce Framework

I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.
You're actually really confused. In a broad and general sense, the map portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?
In MapReduce, you have a Mapper and a Reducer. You also have a Partitioner and a Combiner.
Hadoop is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework, it divides that job such that there is a Mapper for every input split(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers also run on different nodes.
In the Map stage,
The file is divided into records by the RecordReader, the definition of record is controlled by InputFormat that we choose. Every record is a key-value pair.
The map() of our Mapper is run for every such record. The output of this step is again in key-value pairs
The output of our Mapper is partitioned using the Partitioner that we provide, or the default HashPartitioner. Here in this step, by partitioning, I mean deciding which key and its corresponding values go to which Reducer(if there is only one Reducer, its of no use anyway)
Optionally, you can also combine/minimize the output that is being sent to the reducer. You can use a Combiner to do that. Note that, the framework does not guarantee the number of times a Combiner will be called. It is only part of optimization.
This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers complete running on all nodes, the intermediate data i.e the data at end of Map stage is copied to their corresponding reducer.
In the Reduce stage, the reduce() of our Reducer is run on each record of data from the Mappers. Here the record comprises of a key and its corresponding values, not necessarily just one value. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce job you usually think about what can be done on each record of data in both the Mapper and Reducer. A MapReduce program can just contain a Mapper with map() implemented and a Reducer with reduce() implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For
each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.

Same-machine-as-data processing on reduce side of map reduce

One of the big benefits of Hadoop MapReduce is the fact that Map processes take place on the same machine that the data they operate upon resides (to the extent possible). But can this be or is this perhaps already true of the Reduce side? For example, in the extreme case of a Map-only job, all of the output data ends up on the same machine as the corresponding input data (right?). But in an intermediate case in which the output is somewhat correlated with the output, it seems reasonable to partition the output and to the extent possible keep it on same machine at it started on.
Is this possible? Does this already happen?
Inputs to the Reducers can reside on any node(local or remote) and not necessarily on the same machine where they are running. As Mappers complete their output gets written onto the local FS of the machine where they are running. Once this is done the intermediate output is needed by the machines that are about to run the reduce task. One thing to note here is that all the values corresponding to a particular key go the same reducer. So, it's not always possible that the input to Reducers is local, since different sets of key/value pairs are processed by different Mappers running on different machines.
Now, before the Mapper output is sent to Reducers for further processing, the data is partitioned based on keys and each partition goes to a Reducer and all the key/value pairs in that partition get processed by that Reducer. During the process a lot of data shuffling takes place. So it's not possible to maintain the data locality in case of Reducers.
Hope this answers the question.
If you know that the data for a particular reducer is already on the right node after the map phase, and the algorithm allows for it (see this blog post about it) you should insert your reducer as a combiner. Combiners are like miniature reducers that only get to see co-located data. Often you can dramatically improve performance because the combiner output can be orders of magnitude smaller than the map output, so what's left to shuffle is trivial.
Of course, if indeed the map phase leaves your data already correctly partitioned, why use a reducer at all? Why not create a second map job that simulates a reducer?

How does Hadoop/MapReduce scale when input data is NOT stored?

The intended use for Hadoop appears to be for when the input data is distributed (HDFS) and already stored local to the nodes at the time of the mapping process.
Suppose we have data which does not need to be stored; the data can be generated at runtime. For example, the input to the mapping process is to be every possible IP address. Is Hadoop capable of efficiently distributing the Mapper work across nodes? Would you need to explicitly define how to split the input data (i.e. the IP address space) to different nodes, or does Hadoop handle that automatically?
Let me first clarify a comment you made. Hadoop is designed to support potentially massively parallel computation across a potentially large number of nodes regardless of where the data comes from or goes. The Hadoop design favors scalability over performance when it has to. It is true that being clever about where the data starts out and how that data is distributed can make a significant difference in how well/quickly a hadoop job can run.
To your question and example, if you will generate the input data you have the choice of generating it before the first job runs or you can generate it within the first mapper. If you generate it within the mapper then you can figure out what node the mapper's running on and then generate just the data that would be reduced in that partition (Use a partitioner to direct data between mappers and reducers)
This is going to be a problem you'll have with any distributed platform. Storm, for example, lets you have some say in which bolt instance will will process each tuple. The terminology might be different, but you'll be implementing roughly the same shuffle algorithm in Storm as you would Hadoop.
You are probably trying to run a non-MapReduce task on a map reduce cluster then. (e.g. IP scanning?) There may be more appropriate tools for this, your know...
A thing few people do not realize is that MapReduce is about checkpointing. It was developed for huge clusters, where you can expect machines to fail during the computation. By having checkpointing and recovery built-in into the architecture, this reduces the consequences of failures and slow hosts.
And that is why everything goes from disk to disk in MapReduce. It's checkpointed before, and it's checkpointed after. And if it fails, only this part of the job is re-run.
You can easily outperform MapReduce by leaving away the checkpointing. If you have 10 nodes, you will win easily. If you have 100 nodes, you will usually win. If you have a major computation and 1000 nodes, chances are that one node fails and you wish you had been doing similar checkpointing...
Now your task doesn't sound like a MapReduce job, because the input data is virtual. It sounds much more as if you should be running some other distributed computing tool; and maybe just writing your initial result to HDFS for later processing via MapReduce.
But of course there are way to hack around this. For example, you could use /16 subnets as input. Each mapper reads a /16 subnet and does it's job on that. It's not that much fake input to generate if you realize that you don't need to generate all 2^32 IPs, unless you have that many nodes in your cluster...
Number of Mappers depends on the number of Splits generated by the implementation of the InputFormat.
There is NLineInputFormat, which you could configure to generate as many splits as there are lines in the input file. You could create a file where each line is an IP range. I have not used it personally and there are many reports that it does not work as expected.
If you really need it, you could create your own implementation of the InputFormat which generates the InputSplits for your virtual data and force as many mappers as you need.

Resources