Join vs COGROUP in PIG - hadoop

Are there any advantages (wrt performance / no of map reduces ) when i use COGROUP instead of JOIN in pig ?
http://developer.yahoo.com/hadoop/tutorial/module6.html talks about the difference in the type of output they produce. But, ignoring the "output schema", are there any significant difference in performance ?

There are no major performance differences. The reason I say this is they both end up being a single MapReduce job that send the same data forward to the reducers. Both need to send all of the records forward with the key being the foreign key. If at all, the COGROUP might be a bit faster because it does not do the cartesian product across the hits and keeps them in separate bags.
If one of your data sets is small, you can use a join option called "replicated join". This will distribute the second data set across all map tasks and load it into main memory. This way, it can do the entire join in the mapper and not need a reducer. In my experience, this is very worth it because the bottleneck in joins and cogroups is the shuffling of the entire data set to the reducer. You can't do this with COGROUP, to my knowledge.

Related

In hive ,is partitioning fast or bucketization fast?

This is the interview question I faced, if we have 1 TB data in HDFS. Which type of method in hive gives us faster performance i.e partitioning or bucketing ?
I told them depending upon data we choose either partitioning or bucketing .But the interviewer didn't satisfied with my answer.
What should be proper answer (along with example) for it?
Your answer is correct that - It really depends on the data and what exactly you want to do with the data.
Partitioning is used for distributing load horizontally in a logical fashion. It optimizes the performance, but sometime it could lead to partition having very less amount of the within them. This results into bad performance, as the mapreduce works on bigger files than many small files.
Here, bucketing can help, because bucketing guarantee that all the data for the bucketing column remains together. E.g. if we bucket the employee table and use emp_id as the bucketing column, the value of this column will be hashed by a user-defined number of buckets (which must be optimized considering number of records). Records with the same emp_id will always be stored in the bucket. At the same time, one bucket may have many emp_id together having a more optimized chunk of data for mapreduce processing. bucketing is specially helpful, if you want to perform map-side join.
Your answer is correct--
Hive partitioning is an effective method to improve the query performance on larger tables . Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s).
Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).

How does GreenPlum handle multiple large joins and simultaneous workloads?

Our product is extracts from our database, they can be as large as 300GB+ in file format. To achieve that we join multiple large tables (tables close to 1TB in size in some cases). We do not aggregate data period, it's pure extracts. How does GreenPlum handle these kind of large data sets (The join keys are 3+ column keys and not every table has the same keys to join with, the only common key is the first key and if data would be distributed by that there will be a lot of skew since the data itself is not balanced).
You should use writable external tables for those types of large data extracts because it can leverage gpfdist and write data in parallel. It will be very fast.
https://gpdb.docs.pivotal.io/510/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
Also, your use case doesn't really indicate skew. Skew would be either storing the data by a poor column choice like gender_code or processing skew where you filter by a column or columns where only a few segments has the data.
In general, Greenplum Database handles this kind of load just fine. The query is executed in parallel on the segments.
Your bottleneck is likely the final export from the database - if you use SQL (or COPY), everything has to go through the master to the client. That takes time, and is slow.
As Jon pointed out, consider using an external table, and write out the data as it comes out of the query. Also avoid any kind of sort operation in your query, if possible. This is unnecessary because the data arrives unsorted in the external table file.

Hive Does the order of the data record matters for joining tables

I would like to know if the order of the data records matter (performance wise) when joining two tables?
P.S. I am not using any map-side join or bucket join.
Thank you!
On the one hand order should not matter because during shuffle join files are being read by mappers in parallel, also files may be splitted between few mappers or vice-versa, one mapper can read few files, then mappers output passed to each reducer. And even if data was sorted it is being read and distributed not in it's order due to parallelism.
On the other hand, sorting improves compression depending on the data entropy. Similar data can be compressed better. Therefore files ordered compressed are smaller and they will be read faster during join query execution. This may improve join speed because mappers will read data faster and internal indexes in ORC work efficiently if data was sorted by filter columns during load and PPD is enabled. Sorted and compressed file size can be reduced x3 times or even more, it will result in x3 less mappers.
Sorting is efficient when you are writing and sorting once and reading many times.

How will Pig handle Skewed Joins?

When joining datasets, you have an option to tell Pig that the keys might be skewed like the statement below.
... JOIN data1 BY my-join-key USING ‘skewed’ …
PIG will get an estimate of my-join-key values to see if there are some values that occur with much higher frequency than others. There is some overhead cost for doing this (10% or so, but it depends on many factors).
How is this information exactly used in map/reduce jobs? If there is skew, then will PIG try to partition keys to be more balanced across reducers?
In this scenario will PIG replicate the smaller dataset across mapper tasks or it will just use more reducers?
As per documentation
Skewed join does not place a restriction on the size of the input
keys. It accomplishes this by splitting the left input on the join
predicate and streaming the right input. The left input is sampled to
create the histogram.
Skewed join can be used when the underlying data is sufficiently
skewed and you need a finer control over the allocation of reducers to
counteract the skew. It should also be used when the data associated
with a given key is too large to fit in memory.
Pig spawns a mapper which parses the data and observes the key distribution, based on which reducer key allocation is made.
Pig makes no attempt to replicate the smaller dataset to the mappers (Think you mean replicated join here). The right side of the join is streamed to the reducer splits based upon the skew in the left side of the join.

Map side join in Hadoop loses advantage data locality?

My Question is related to Map side join in Hadoop.
I was reading ProHadoop the other day I did not understand following sentence
"The map-side join provides a framework for performing operations on multiple sorted
datasets. Although the individual map tasks in a join lose much of the advantage of data locality,
the overall job gains due to the potential for the elimination of the reduce phase and/or the
great reduction in the amount of data required for the reduce."
How can it lose advantage of data locality when if sorted data sets are stored on HDFS?Wan't job tracker in Hadoop will run task tracker in on the same on where data set block localize?
Correct my understanding please.
The satement is correct. You do not loss all data locality, but part of it. Lets see how it works:
We usually distinguish smaller and bigger part of the join.
Smaller partitions of the join are distributed to places where corresponding bigger partitions are stored. As a result we loss data locality for one of the joined datasets.
I don't know what does David mean, but to me, this is because you have only map phase, then you just go there and finish your job by bring different tables together, without any gains about HDFS?
This is the process followed in Map-side join:
Suppose we have two datasets R and S, assume that both of them fit into the main memory. R is large and S is small.
Smaller dataset is loaded to the main memory iteratively to match the pairs with R.
In this case, we achieve data locality for R but not S.

Resources