Why are group by operations considered expensive in mapreduce? - hadoop

Can anyone explain me how jobs are mapped and reduces in hadoop and why are group by operations considered expensive?

I won't say expensive really. But I would use the word that it does affect the performance, as for order by or sort the processing needed to order the record is much more. The processing of data by comparator and partitioner would be huge when millions or billions of records are being sorted.
I hope I could answer your question.

Performance in Hadoop is affected by two main factors:
1- Processing: The execution time spent for processing the map and reduce tasks over the cluster node.
2- Communication: Shuffling data, Some operations needs to send data from one node to another one for processing (like global sorting).
Groupby needs complexity needs affects the two factors. In the shuffle half of the data size might be shuffled between the nodes.

Related

Hadoop Performance When retrieving Data Only

We know that performance of Hadoop may be increased by adding more data nodes. My question is: if we want to retrieve the data only without the need to process it or analyze it, is adding more data nodes will be useful? or it won't increase performance at all because we have retrieve operations only without any computations or map reduce jobs?
I will try to answer in parts:
If you only retrieve information from a hadoop cluster or HDFS then
it is similar to Cat command in linux, meaning only reading data
not processing.
If you want some calculations like SUM, AVG or any other aggregate
functions on top of your data then comes the concept of REDUCE ,
hence Map reduce comes into picture.
So hadoop is useful or worthy when your data is Huge and you do
calculations also. I think their is no performance benefits while
reading a small amount of data in HDFS than reading a Large amount
of data in HDFS (just think like you are storing your data in RDBMS
regularly and you only query select * statements on daily basis),
but when your data grows exponentially and you want to do
calculations your RDBMS query would take time to execute.
For Map reduce to work efficiently on huge data sets , you need to
have good amount of nodes and computing power, depending upon your
use case.

joins vs distributed cache in hadoop

what is the difference between joins and distributed cache in hadoop. I am really confusing with map-side join and reduce-side join an dhow it works. how distributed cache is different while processing the data in mapreduce job. Please share with example.
Regards,
Ravi
Let's say you have 2 files of data with the following records:
word -> frequency
Same words can be present in both files.
Your task is to merge these files, compute total frequency for each term, and produce the aggregated file.
Map side joins.
Useful when your data on both sides of the join already presorted by keys. In this case, it is a simple merge of two streams with linear complexity. In our example, our word-frequency data have to be pre-sorted alphabetically by words in both files.
Pros: works with virtually unlimited input data (does not have to fit in memory).
Does not require a reducer, thus it is very efficient.
Cons: requires your input data to be pre-sorted (for example, as a result of a previous map/reduce job)
Reduce joins.
Useful when our files are not sorted yet, and they are too large to fit in memory. So you have to merge them using distributed sort with reducer(s).
Pros: works with virtually unlimited input data (does not have to fit in memory).
Cons: requires reduce phase
Distributed cache.
Useful when our input word-frequency files are NOT sorted, and one of two files is small enough to fit in memory. In this case you can use it as a distributed cache, and load it in memory as a hash table Map<String, Integer>. Each mapper than will stream the largest input file as key value pairs and look up the values of the smaller file from the hash map.
Pros: Efficient, linear complexity based on largest input set size. Does not require reducer.
Cons: Requires one of the inputs to fit in memory.

Where to add specific function in the Map/Reduce Framework

I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.
You're actually really confused. In a broad and general sense, the map portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?
In MapReduce, you have a Mapper and a Reducer. You also have a Partitioner and a Combiner.
Hadoop is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework, it divides that job such that there is a Mapper for every input split(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers also run on different nodes.
In the Map stage,
The file is divided into records by the RecordReader, the definition of record is controlled by InputFormat that we choose. Every record is a key-value pair.
The map() of our Mapper is run for every such record. The output of this step is again in key-value pairs
The output of our Mapper is partitioned using the Partitioner that we provide, or the default HashPartitioner. Here in this step, by partitioning, I mean deciding which key and its corresponding values go to which Reducer(if there is only one Reducer, its of no use anyway)
Optionally, you can also combine/minimize the output that is being sent to the reducer. You can use a Combiner to do that. Note that, the framework does not guarantee the number of times a Combiner will be called. It is only part of optimization.
This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers complete running on all nodes, the intermediate data i.e the data at end of Map stage is copied to their corresponding reducer.
In the Reduce stage, the reduce() of our Reducer is run on each record of data from the Mappers. Here the record comprises of a key and its corresponding values, not necessarily just one value. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce job you usually think about what can be done on each record of data in both the Mapper and Reducer. A MapReduce program can just contain a Mapper with map() implemented and a Reducer with reduce() implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For
each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.

How to know in advance the number of records for each reducer in hadoop map reduce \ Ignore partitions based on size

I have a job that splits data into groups. and I need to keep only the large partitions (above a certain threshold).
Is there a method to this?
one solution is to iterate over all the items and store them in memory, and flush them only if they reach a certain size.
However, this solution may require a very large amount of memory.
I don't think there is a general straightforward solution (apart from storing until the size is reached). Maybe if you provide more details this will give us more inspiration?

Map side join in Hadoop loses advantage data locality?

My Question is related to Map side join in Hadoop.
I was reading ProHadoop the other day I did not understand following sentence
"The map-side join provides a framework for performing operations on multiple sorted
datasets. Although the individual map tasks in a join lose much of the advantage of data locality,
the overall job gains due to the potential for the elimination of the reduce phase and/or the
great reduction in the amount of data required for the reduce."
How can it lose advantage of data locality when if sorted data sets are stored on HDFS?Wan't job tracker in Hadoop will run task tracker in on the same on where data set block localize?
Correct my understanding please.
The satement is correct. You do not loss all data locality, but part of it. Lets see how it works:
We usually distinguish smaller and bigger part of the join.
Smaller partitions of the join are distributed to places where corresponding bigger partitions are stored. As a result we loss data locality for one of the joined datasets.
I don't know what does David mean, but to me, this is because you have only map phase, then you just go there and finish your job by bring different tables together, without any gains about HDFS?
This is the process followed in Map-side join:
Suppose we have two datasets R and S, assume that both of them fit into the main memory. R is large and S is small.
Smaller dataset is loaded to the main memory iteratively to match the pairs with R.
In this case, we achieve data locality for R but not S.

Resources