I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Now i am trying to build a kd tree with this large data . I want to use map reduce for calculations.
Brute Force approach for my problem:
1) write a map reduce job to find variance of each column and select the column with highest variance
2) taking (column name ,variance value ) as inputs write another map reduce job to split the input data into 2 parts . 1 part has all the rows with value less than input value for the given column name the second part has all the rows greater than input value.
3) for each part repeat step 1 and step 2 , continue the process until you are left with 500 values in each part.
the column name , variance value forms a single node for my tree . so with the brute force approach for tree of height 10 i need to run 1024 map reduce jobs.
My questions:
1 ) Is there any way i can improve the efficiency by running less number of map reduce jobs ?
2 ) I am reading the same data every time . Is there any way i can avoid that ?
3 ) are there any other frameworks like pig , hive etc which are efficient for this kind of tasks ?
4 ) Any frameworks using which i can save the data into a data store and easily retrieve data ?
Pleas help ...
Why don't you try using Apache Spark (https://spark.apache.org/) here ?...this seems like a perfect use case for spark
With an MR job per node of the tree you have O(n) = 2^n number of jobs (where n is the height of the tree), which is not good for the overheads of the YARN. But with simple programming tricks you can bring it down to the O(n) = n.
Here are some ideas:
Add extra partition column in front of your key, this column is nodeID (each node in your tree has unique ID). This will create independent data flows and will ensure that keys from different branches of the tree do not mix and all of the variances are calculated in the context of the nodeID in waves, for each layer of nodes. This will remove the necessity of having an MR job per node with very little change in the code and ensure that you have O(n) = n number of jobs and not O(n) = 2^n;
Data is not sorted around the split value and while splitting elements from parent list will have to travel to their destination child lists and there will be network traffic between the cluster nodes. Thus caching the whole data set on the cluster with multiple machines might not give significant improvements;
After calculating a couple of levels of the tree, there can be a situation that certain nodeIDs have a number of rows that can fit in the memory of the mapper or the reducer, then you could continue processing that sub-tree completely in memory and avoid costly MR job, this could reduce the number of MR jobs as you get to the bottom of the tree or reduce the amount of data as the processing gets closer to the bottom;
Another optimisation would be to write a single MR job that in the mapper does the splitting around the selected value of each node and outputs them via MultipleOutputs and emits the keys with child nodeIDs of the next tree level to the reducer to calculate the variance of the columns within the child lists. Of cause the first ever run has no splitting value, but all subsequent runs will have multiple split values, one for each child nodeID.
Related
I am trying to understand MapReduce so this is a very noob question. I am looking at the picture below. From my understanding, which might very well be wrong, there is four nodes in the mapping phase and then there is 7 nodes in the shuffle phase. Every key:value pair is moved to different node. My question is what happens if there's only 3 nodes in the shuffle phase? If you have four equal sized key:value pair, can you move them arbitrarily to nodes so it doesn't matter that one is twice the size as the rest or do you split one of the pairs and spread it out evenly?
This image doesn't display actual nodes. Instead, each shuffle/reduce rectangle is a single call to reduce() function. There are 7 of them because 7 distinct keys were emitted by mapper stage. These calls are distributed among reduce tasks. You configure number of reduce tasks yourself with job.setNumReduceTasks(5). If you have one reduce task, all calls will happen there. If you have two reduce tasks, some calls will happen in the first reduce task, others -- in another one (as controlled by Partitioner). If you have 1000 reduce tasks, only some reduce tasks will get reduce() calls, others won't process any data at all.
Reduce tasks are started as separate processes on physical cluster nodes. They may start all simultaneously or not (depends on how many resources you have and also on your scheduler).
I have a list of numbers and want to compute the difference of consecutive numbers in that list. I'm working on RDDs in Apache Spark.
Example:
Input: [1,2,5,7,8,10,13,17,20,20,21]
Output: [1,3,2,1,2,3,4,3,0,1]
I'm wondering if this is possible using the mapreduce paradigm without duplicating the input RDD.
You can use org.apache.spark.mllib.rdd.RDDFunctions.sliding.
Returns a RDD from grouping items of its parent RDD in fixed size blocks by passing a sliding window over them. The ordering is first based on the partition index and then the ordering of items within each partition. This is similar to sliding in Scala collections, except that it becomes an empty RDD if the window size is greater than the total number of items. It needs to trigger a Spark job if the parent RDD has more than one partitions and the window size is greater than 1.
I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.
You're actually really confused. In a broad and general sense, the map portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?
In MapReduce, you have a Mapper and a Reducer. You also have a Partitioner and a Combiner.
Hadoop is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework, it divides that job such that there is a Mapper for every input split(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers also run on different nodes.
In the Map stage,
The file is divided into records by the RecordReader, the definition of record is controlled by InputFormat that we choose. Every record is a key-value pair.
The map() of our Mapper is run for every such record. The output of this step is again in key-value pairs
The output of our Mapper is partitioned using the Partitioner that we provide, or the default HashPartitioner. Here in this step, by partitioning, I mean deciding which key and its corresponding values go to which Reducer(if there is only one Reducer, its of no use anyway)
Optionally, you can also combine/minimize the output that is being sent to the reducer. You can use a Combiner to do that. Note that, the framework does not guarantee the number of times a Combiner will be called. It is only part of optimization.
This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers complete running on all nodes, the intermediate data i.e the data at end of Map stage is copied to their corresponding reducer.
In the Reduce stage, the reduce() of our Reducer is run on each record of data from the Mappers. Here the record comprises of a key and its corresponding values, not necessarily just one value. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce job you usually think about what can be done on each record of data in both the Mapper and Reducer. A MapReduce program can just contain a Mapper with map() implemented and a Reducer with reduce() implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For
each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.
In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.
What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.
Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).
Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. The default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.
A great source of information for these steps is this Yahoo tutorial (archived).
A nice graphical representation of this is the following (shuffle is called "copy" in this figure):
Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).
UPDATE: Since you are looking for something more official, you can also read Tom White's book "Hadoop: The Definitive Guide". Here is the interesting part for your question.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation, so I guess it is pretty credible and official...
Let's revisit key phases of Mapreduce program.
The map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one, or multiple output key/value pairs for each input key/value pairs.
The combine phase is done by combiners. The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.
The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. The programmer may supply custom compare functions for sorting and a partitioner for data split.
The partitioner decides which reducer will get a particular key value pair.
The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.
Have a look at this javacodegeeks article by Maria Jurcovicova and mssqltips article by Datta for a better understanding
Below is the image from safaribooksonline article
I thought of just adding some points missing in above answers. This diagram taken from here clearly states the what's really going on.
If I state again the real purpose of
Split: Improves the parallel processing by distributing the processing load across different nodes (Mappers), which would save the overall processing time.
Combine: Shrinks the output of each Mapper. It would save the time spending for moving the data from one node to another.
Sort (Shuffle & Sort): Makes it easy for the run-time to schedule (spawn/start) new reducers, where while going through the sorted item list, whenever the current key is different from the previous, it can spawn a new reducer.
Some of the data processing requirements doesn't need sort at all. Syncsort had made the sorting in Hadoop pluggable. Here is a nice blog from them on sorting. The process of moving the data from the mappers to the reducers is called shuffling, check this article for more information on the same.
I've always assumed this was necessary as the output from the mapper is the input for the reducer, so it was sorted based on the keyspace and then split into buckets for each reducer input. You want to ensure all the same values of a Key end up in the same bucket going to the reducer so they are reduced together. There is no point sending K1,V2 and K1,V4 to different reducers as they need to be together in order to be reduced.
Tried explaining it as simply as possible
Shuffling is the process by which intermediate data from mappers are transferred to 0,1 or more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of reducers (for a balanced load). Further the values associated with each key are locally sorted.
Because of its size, a distributed dataset is usually stored in partitions, with each partition holding a group of rows. This also improves parallelism for operations like a map or filter. A shuffle is any operation over a dataset that requires redistributing data across its partitions. Examples include sorting and grouping by key.
A common method for shuffling a large dataset is to split the execution into a map and a reduce phase. The data is then shuffled between the map and reduce tasks. For example, suppose we want to sort a dataset with 4 partitions, where each partition is a group of 4 blocks.The goal is to produce another dataset with 4 partitions, but this time sorted by key.
In a sort operation, for example, each square is a sorted subpartition with keys in a distinct range. Each reduce task then merge-sorts subpartitions of the same shade.
The above diagram shows this process. Initially, the unsorted dataset is grouped by color (blue, purple, green, orange). The goal of the shuffle is to regroup the blocks by shade (light to dark). This regrouping requires an all-to-all communication: each map task (a colored circle) produces one intermediate output (a square) for each shade, and these intermediate outputs are shuffled to their respective reduce task (a gray circle).
The text and image was largely taken from here.
There only two things that MapReduce does NATIVELY: Sort and (implemented by sort) scalable GroupBy.
Most of applications and Design Patterns over MapReduce are built over these two operations, which are provided by shuffle and sort.
This is a good reading. Hope it helps. In terms of sorting you are concerning, I think it is for the merge operation in last step of Map. When map operation is done, and need to write the result to local disk, a multi-merge will be operated on the splits generated from buffer. And for a merge operation, sorting each partition in advanced is helpful.
Well,
In Mapreduce there are two important phrases called Mapper and reducer both are too important, but Reducer is mandatory. In some programs reducers are optional. Now come to your question.
Shuffling and sorting are two important operations in Mapreduce. First Hadoop framework takes structured/unstructured data and separate the data into Key, Value.
Now Mapper program separate and arrange the data into keys and values to be processed. Generate Key 2 and value 2 values. This values should process and re arrange in proper order to get desired solution. Now this shuffle and sorting done in your local system (Framework take care it) and process in local system after process framework cleanup the data in local system.
Ok
Here we use combiner and partition also to optimize this shuffle and sort process. After proper arrangement, those key values passes to Reducer to get desired Client's output. Finally Reducer get desired output.
K1, V1 -> K2, V2 (we will write program Mapper), -> K2, V' (here shuffle and soft the data) -> K3, V3 Generate the output. K4,V4.
Please note all these steps are logical operation only, not change the original data.
Your question: What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
Short answer: To process the data to get desired output. Shuffling is aggregate the data, reduce is get expected output.
I'm trying to understand MapReduce model and I need advice because I'm not sure about the way how is sorted and partitioned file with intermediate results of map function. The most my knowledges about MapReduce I got from MapReduce papers of Jeffrey Dean & Sanjay Ghemawat and from Hadoop: The Definitive Guide.
The file with intermediate results of map function is compound of small sorted and partitioned files. These small files are divided into partitions corresponding to reduce workers. Then small files are merged into one file. I need to know how is partitioning of small files done. First I thought that every partition has some range of keys.
For example: if we've got keys as integer in range <1;100> and file is divided to three partitions then the first partition can consists of values with keys in range <1,33>, second partition with keys in range <34;66> and third partition <67;100>. The same partitioning is in merged file too.
But I'm not sure about it. Every partition is send to corresponding reduce worker. In our example, if we have two reduce workers then partitions with first two ranges of keys (<1,33> and <34;66>) can be sent to first worker and last partition to third worker. But if I'm wrong and the files are divided in another way (I mean that partitions hasn't got their own range of possible keys) then every reduce worker can has results for the same keys. So I need somehow merge results of these reduce workers, right? Can I send these results to master node and merge them there?
In short version: I need explain the way how files in map phase are divided (if my description is wrong) and explain how and where I can process results of reduce workers.
I hope I described my problem enough to understand. I can explain it more, of course.
Thanks a lot for your answers.
There is a Partitioner class that does this. Each key/value pair in the intermediate file is passed to the partitioner along with the total number of reducers (partitions) and the partitioner returns the partition number that should handle that specific key/value pair.
There is a default partitioner that does an OK job of partitioning, but if you want better control or if you have a specially formatted (e.g. complex) key then you can and should write your own partitioner.