what is the difference between joins and distributed cache in hadoop. I am really confusing with map-side join and reduce-side join an dhow it works. how distributed cache is different while processing the data in mapreduce job. Please share with example.
Regards,
Ravi
Let's say you have 2 files of data with the following records:
word -> frequency
Same words can be present in both files.
Your task is to merge these files, compute total frequency for each term, and produce the aggregated file.
Map side joins.
Useful when your data on both sides of the join already presorted by keys. In this case, it is a simple merge of two streams with linear complexity. In our example, our word-frequency data have to be pre-sorted alphabetically by words in both files.
Pros: works with virtually unlimited input data (does not have to fit in memory).
Does not require a reducer, thus it is very efficient.
Cons: requires your input data to be pre-sorted (for example, as a result of a previous map/reduce job)
Reduce joins.
Useful when our files are not sorted yet, and they are too large to fit in memory. So you have to merge them using distributed sort with reducer(s).
Pros: works with virtually unlimited input data (does not have to fit in memory).
Cons: requires reduce phase
Distributed cache.
Useful when our input word-frequency files are NOT sorted, and one of two files is small enough to fit in memory. In this case you can use it as a distributed cache, and load it in memory as a hash table Map<String, Integer>. Each mapper than will stream the largest input file as key value pairs and look up the values of the smaller file from the hash map.
Pros: Efficient, linear complexity based on largest input set size. Does not require reducer.
Cons: Requires one of the inputs to fit in memory.
Related
there is something that i do not have clear about the whoel functioning view of a MapReduce programming environment.
Considering to have 1k of random unsorted words in the form (word, 1) coming out from a (or more than one) mapper. Suppose with the reducer i wanna save them all inside a single huge sorted file. How does it works? I mean, the reducer itself sort all the words automatically? What does the reducer function should do? What if i have just one reducer with limited ram and disk?
when the reducer get the data ,the data has already be sorted in the map side .
the process is like this
Map side:
1. Each inputSplit will be processed by a map task, and the result of the map output will be temporarily placed in a circular memory buffer [ SHUFFLE ](the size of the buffer is 100M by default, controlled by the io.sort.mb property). When the buffer is about to overflow (the default is 80% of the buffer size), an overflow file will be created in the local file system .
2. Before writing to the disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to the data of one partition. to avoid some of the reduction tasks being assigned to large amounts of data, even without data. In fact, the data in each partition is sorted. If the Combiner is set at this time, the sorted result is subjected to the Combiner operation.
3. When the local task outputs the last record, there may be a lot of overflow files, and these files need to be merged. The sorting and combining operations are continually performed during the merge process for two purposes: 1. Minimize the amount of data written to disk each time; 2. Minimize the amount of data transferred by the network during the next replication phase. Finally merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, you can compress the data here, just set mapred.compress.map.out to true.
4. Copy the data from the partition to the corresponding reduce task.
Reduce side:
1.Reduce will receive data from different map tasks, and the amount of data sent from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is directly stored in the memory. If the amount of data exceeds a certain proportion of the size of the buffer, the data is merged and written to the disk.
2. As the number of overflow files increases, the background thread will merge them into a larger, more ordered file. In fact, regardless of the map side or the reduce side, MapReduce repeatedly performs sorting and merging operations.
3. The merge process will generate a lot of intermediate files (written to disk), but MapReduce will make the data written to the disk as small as possible, and the result of the last merge is not written to the disk, but directly input To reduce the function.
I would like to know if the order of the data records matter (performance wise) when joining two tables?
P.S. I am not using any map-side join or bucket join.
Thank you!
On the one hand order should not matter because during shuffle join files are being read by mappers in parallel, also files may be splitted between few mappers or vice-versa, one mapper can read few files, then mappers output passed to each reducer. And even if data was sorted it is being read and distributed not in it's order due to parallelism.
On the other hand, sorting improves compression depending on the data entropy. Similar data can be compressed better. Therefore files ordered compressed are smaller and they will be read faster during join query execution. This may improve join speed because mappers will read data faster and internal indexes in ORC work efficiently if data was sorted by filter columns during load and PPD is enabled. Sorted and compressed file size can be reduced x3 times or even more, it will result in x3 less mappers.
Sorting is efficient when you are writing and sorting once and reading many times.
What if the output is so big that it does not fit into the reducers RAM?
For example a sorting task. In this case, output is as big as the input. If you use a single reducer then all the data do not fit into the RAM. How does the sorting take place then?
I think I have got the answer.
Yes, it is possible to perform any map task in a single reducer, even if the data are bigger than the memory of reduce. In the shuffle phase reducer copies the data from mapper to reducer's memory and sorts it until it spills. Once it spills the memory that part of data is stored in reducers local disk and it starts to get the new values. Once it spills again it merges the new data with the previously stored file. The merged file maintains the sorted fashion (Probably using external merge sort). Once the shuffling is done the intermediate key,value pairs are stored in a sorted manner. Then the reduce task is performed on that data. As the data are sorted it is easy to do the aggregation in memory by taking a chunk of the data at a time in memory.
When joining datasets, you have an option to tell Pig that the keys might be skewed like the statement below.
... JOIN data1 BY my-join-key USING ‘skewed’ …
PIG will get an estimate of my-join-key values to see if there are some values that occur with much higher frequency than others. There is some overhead cost for doing this (10% or so, but it depends on many factors).
How is this information exactly used in map/reduce jobs? If there is skew, then will PIG try to partition keys to be more balanced across reducers?
In this scenario will PIG replicate the smaller dataset across mapper tasks or it will just use more reducers?
As per documentation
Skewed join does not place a restriction on the size of the input
keys. It accomplishes this by splitting the left input on the join
predicate and streaming the right input. The left input is sampled to
create the histogram.
Skewed join can be used when the underlying data is sufficiently
skewed and you need a finer control over the allocation of reducers to
counteract the skew. It should also be used when the data associated
with a given key is too large to fit in memory.
Pig spawns a mapper which parses the data and observes the key distribution, based on which reducer key allocation is made.
Pig makes no attempt to replicate the smaller dataset to the mappers (Think you mean replicated join here). The right side of the join is streamed to the reducer splits based upon the skew in the left side of the join.
My Question is related to Map side join in Hadoop.
I was reading ProHadoop the other day I did not understand following sentence
"The map-side join provides a framework for performing operations on multiple sorted
datasets. Although the individual map tasks in a join lose much of the advantage of data locality,
the overall job gains due to the potential for the elimination of the reduce phase and/or the
great reduction in the amount of data required for the reduce."
How can it lose advantage of data locality when if sorted data sets are stored on HDFS?Wan't job tracker in Hadoop will run task tracker in on the same on where data set block localize?
Correct my understanding please.
The satement is correct. You do not loss all data locality, but part of it. Lets see how it works:
We usually distinguish smaller and bigger part of the join.
Smaller partitions of the join are distributed to places where corresponding bigger partitions are stored. As a result we loss data locality for one of the joined datasets.
I don't know what does David mean, but to me, this is because you have only map phase, then you just go there and finish your job by bring different tables together, without any gains about HDFS?
This is the process followed in Map-side join:
Suppose we have two datasets R and S, assume that both of them fit into the main memory. R is large and S is small.
Smaller dataset is loaded to the main memory iteratively to match the pairs with R.
In this case, we achieve data locality for R but not S.