Pig Inner Join produces a Job with 1 hanging reducer - hadoop

I have a Pig Script that I've been working on that has an Inner Join from 2 different data sources. This join happens to be the 1st MapReducing causing operation. With the only operations before hand being filters and foreachs. When this Join is then executed everything goes threw the map phase perfectly and fast, but when it comes to the reduce phase all the reducers but 1 finish fast. However the 1 just sits there at the Reduce part of the Phase chugging over data at a very very slow pace. To the point that it can take up to an hour+ just waiting on that 1 reducer to complete. I have tried increasing the reducers as well as switching to a skewed join, but nothing seems to help.
Any ideas for things to look into.
I also did an explain to see if I could see anything, but that just shows a simple single job flow with nothing amazing.

Likely what is happening is a single key has huge number of instances on both sides and it's exploding.
For example, if you join:
x,4 x,'f'
x,5 x,'g'
x,6 X x,'h'
y,7 x,'i'
... you will get 12 pairs of x! So you can imagine that if you have 1000 of one key and 2000 of the same key in the other data set, you will get 2 million pairs just from those 2000 rows. The single reducer unfortunately has to take the brunt force of this explosion.
Adding reducers or using a skew join isn't going to help here, because at the end of the day, a single reducer needs to handle this one big explosion of pairs.
Here are a few things to check:
It sounds like only a single join key is causing this issue since only one reducer is getting hammered. The common culprit is NULL. Can the column in either of these be NULL? If so, it'll get a huge explosion! Try filtering out NULL on the foreign key of both relations before running through the join and see if there is a difference. Or, instead of NULL... perhaps you have some sort of default value or a single value that shows up a lot.
Try to figure out how many of each key there actually are, and figure out what the explosion will look like. Something like (warning: I'm not actually testing this code, hopefully it works):
A1 = LOAD ... -- load dataset 1
B1 = GROUP A1 BY fkey1;
C1 = FOREACH B1 GENERATE group, COUNT_STAR(A1) as cnt1;
A2 = LOAD ... -- load dataset 2
B2 = GROUP A2 BY fkey2;
C2 = FOREACH B2 GENERATE group, COUNT_STAR(A2) as cnt2;
D = JOIN C1 by fkey1, C2 by fkey2; -- do the join on the counts
E = FOREACH D GENERATE fkey1, (cnt1 * cnt2) as cnt; -- multiply out the counts
F = ORDER E BY cnt DESC; -- order it by the highest first
STORE F INTO ...
Similarly, it may have nothing to do with an explosion. One of your relations might just have a single key a ton of times. For example, in the word count example, the reducer that ends up with the word "the" is going to have a lot more counting to do than the one that gets "zebra". I don't think this is the case here since only one of your reducers is getting hammered, which is why I think #1 is probably the case.
If you have some huge number for one of the keys, that's why. And you also know what key is causing the issue.

Related

Hive count distinct UDAF2

I've read a question on SO:
I ran into a Hive query calculating a count distinct without grouping,
which runs very slow. So I was wondering how is this functionality
implemented in Hive, is there a UDAFCountDistinct for this?
And the answer:
To achieve count distinct, Hive relies on the GenericUDAFCount. There
is no UDAF specifically implemented for count distinct. Those
'distinct by' keys will be a part of the partitioning key of the
MapReduce Shuffle phase, this way they are 'distincted' quite
natually.
As per your case, it runs slowly because there will be only one
reducer to process massive detailed data. You can use a group by
before counting to get more parallelism:
select count(1) from (select id from tbl group by id) tmp;
However I don't understand a few things:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
Why there will be only one reducer in this case?
Why the weird inner query will cause more partitions?
I'll try to explain.
Part 1:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
The UDAF GenericUDAFCount is capable of both count and count distinct. How does it work to achieve count distinct?
Let's take the following query as an example:
select category, count(distinct brand) from market group by category;
One MapReduce Job will be launched for this query.
distinct-by keys are the expressions(columns) within count(distinct ..., in this case, brand.
partition-by keys are the fields used to calculate a hash code for a record at map phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by keys lies in the group by part of a SQL query. In this case, it's category.
The actual output-key of mappers will be the composition of partition-by key and a distinct-by key. For the above case, a mapper's output key may be like (drink, Pepsi).
This design makes all rows with the same group-by key fall into the same reducer.
The value part of mappers' output doesn’t matter here.
Later at the Shuffle phase, records are sort according to the sort-by keys, which is the same as the output key.
Then at reduce phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct ) aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*) at each group. The input key of a call to the reduce method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.
Part 2:
Why there will be only one reducer in this case?
When calculating count distinct without group by like this:
select count(distinct brand) from market
There will be just one reducer taking all the work. Why? Because the partition-by key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.
Part 3:
Why the weird inner query will cause more partitions?
The inner query's partition-by key is the group by key, id. There’s a chance that id values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id are different from each other. So now a simple count(1) is all that's needed.
But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1), map-side aggregation hugely cut down the amount of data processed by reducers.
One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.

A join operation using Hadoop MapReduce

How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.

How should I select top 10% of the table?

I need the select the top x% rows of a table in Pig. Could someone tell me how to do it without writing a UDF?
Thanks!
As mentioned before, first you need to count the number of rows in your table and then obviously you can do:
A = load 'X' as (row);
B = group A all;
C = foreach B generate COUNT(A) as count;
D = LIMIT A C.count/10; --you might need a cast to integer here
The catch is that, dynamic argument support for LIMIT function was introduced in Pig 0.10. If you're working with a previous version, then a suggestion is offered here using the TOP function.
Not sure how you would go about pulling a percentage, but if you know your table size is 100 rows, you can use the LIMIT command to get the top 10% for example:
A = load 'myfile' as (t, u, v);
B = order A by t;
C = limit B 10;
(Above example adapted from http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+LIMIT+Operator)
As for dynamically limiting to 10%, not sure you can do this without knowing how 'big' the table is, and i'm pretty sure you couldn't do this in a UDF, you'd need to run a job to count the number of rows, then another job to do the LIMIT query.
I won't write the pig code as it will take a while to write and test, but I would do it like this (if you need the exact solution, if not, there are simpler methods):
Get a sample from your input. Say a few thousand data points or so.
Sort this and find the n quantiles, where n should be somewhere in the order of the number of reducers you have or somewhat larger.
Count the data points for each quantile.
At this point the min point of the top 10% will fall into one of these intervals. Find this interval (this is easy as the counts will tell you exactly where it is), and using the sum of the counts of the larger quantiles together with the relevant quantile find the 10% point in this interval.
Go over your data again and filter out everything but the points larger than the one you just found.
Portions of this might require UDFs.

How to implement self-join/cross-product with hadoop?

It is common task to make some evaluation on pairs of items:
Examples: de-duplication, collaborative filtering, similar items etc
This is basically self-join or cross-product with the same source of data.
To do a self join, you can follow the "reduce-side join" pattern. The mapper emits the join/foreign key as key, and the record as the value.
So, let's say we wanted to do a self-join on "city" (the middle column) on the following data:
don,baltimore,12
jerry,boston,19
bob,baltimore,99
cameron,baltimore,13
james,seattle,1
peter,seattle,2
The mapper would emit the key->value pairs:
(baltimore -> don,12)
(boston -> jerry,19)
(baltimore -> bob,99)
(baltimore -> cameron,13)
(seattle -> james,1)
(seattle -> peter,2)
In the reducer, we'll get this:
(baltimore -> [(don,12), (bob,99), (cameron,13)])
(boston -> [(jerry,19)])
(seattle -> [(james,1), (peter,2)])
From here, you can do the inner join logic, if you so choose. To do this, you'd just match up every item for every other item. To do this, load it up the data into an array list, then do a N x N loop over the items to compare each to each other.
Realize that reduce-side joins are expensive. They send pretty much all of the data to the reducers if you don't filter anything out. Also, be careful of loading the data up into memory in the reducers-- you may blow your heap on a hot join key by loading all of the data in an array list.
The above is a bit different than the typical reduce-side join. The idea is the same when joining two data sets: the foreign key is the key, and the record is the value. The only difference is that the values could be coming from two or more data sets. You can use MultipleInputs to have different mappers parse different input sets, then have the reducer collect data from both.
Cross product in the case where you don't have any constraints is a nightmare. I.e.,
select * from tablea, tableb;
There are a number of ways to do this. None of them are particularly efficient. If you want this type of behavior, leave me a comment and I'll spend more time explaining a way to do this.
If you can figure out some sort of join key which is a fundamental key to similarity, you are much better off.
Plug for my book: MapReduce Design Patterns. It should be published in a few months, but if you are really interested I can email you the chapter on joins.
One typically uses the reducer to perform whatever logic is required on the join. The trick is to map the dataset twice, possibly adding some marker to the value indicating which run it is. Then a self join is no different from any other kind of join.

Configure Map Side join for multiple mappers in Hadoop Map/Reduce

I have a question about configuring Map/Side inner join for multiple mappers in Hadoop.
Suppose I have two very large data sets A and B, I use the same partition and sort algorithm to split them into smaller parts. For A, assume I have a(1) to a(10), and for B I have b(1) to b(10). It is assured that a(1) and b(1) contain the same keys, a(2) and b(2) have the same keys, and so on. I would like to setup 10 mappers, specifically, mapper(1) to mapper(10). To my understanding, Map/Side join is a pre-processing task prior to the mapper, therefore, I would like to join a(1) and b(1) for mapper(1), to join a(2) and b(2) for mapper(2), and so on.
After reading some reference materials, it is still not clear to me how to configure these ten mappers. I understand that using CompositeInputFormat I would be able to join two files, but it seems only configuring one mapper and joining the 20 files pair after pair (in 10 sequential tasks). How to configure all these ten mappers and join ten pairs at the same time in a genuine Map/Reduce (10 tasks in parallel)? To my understanding, ten mappers would require ten CompositeInputFormat settings because the files to join are all different. I strongly believe this is practical and doable, but I couldn't figure out what exact commands I should use.
Any hint and suggestion is highly welcome and appreciated.
Shi
Thanks a lot for the replies, David and Thomas!
I appreciate your emphasis about the pre-requirements on Map-side Join. Yes, I am aware about the sort, API, etc. After reading your comments, I think my actual problem is what is the correct expression for joining multiple splits of two files in CompositeInputFormat. For example, I have dataA and dataB sorted and reduced in 2 files respectively:
/A/dataA-r-00000
/A/dataA-r-00001
/B/dataB-r-00000
/B/dataB-r-00001
The expression command I am using now is:
inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"/A/dataA-r-00000"),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"/B/dataB-r-00000"))
It works but as you mentioned, it only starts two mappers (because the inner join prevents from splitting) and could be very inefficient if the files are big. If I want to use more mappers (say another 2 mappers to join dataA-r-00001 and dataB-r-00001), how should I construct the expression, is it something like:
String joinexpression = "inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/dataA-r-00000'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/dataB-r-00000'), tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/dataA-r-00001'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/dataB-r-00001'))" ;
But I think that could be mistaken, because the command above actually perform inner join of four files (which will result in nothing in my case because file *r-00000 and *r-00001 have non-overlapping keys).
Or I could just use the two dirs as inputs, like:
String joinexpression = "inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/'))" ;
The inner join will match the pairs automatically according to the file endings, say "00000" to "00000", "00001" to "00001"? I am stuck at this point because I need to construct the expression and pass it to
conf.set("mapred.join.expr", joinexpression);
So in one word, how should I build the proper expression if I want to use more mappers to join multiple pairs of files simultaneously?
There are map- and reduce side joins.
You proposed to use a map side join, which is executed inside a mapper and not before it.
Both sides must have the same key and value types. So you can't join a LongWritable and a Text, although they might have the same value.
There are subtle more things to note:
input files have to be sorted, so it has likely to be a reducer output
You can control the number of mappers in your join-map-phase by setting the number of reducers in the job that should've sorted the datasets
The whole procedure basically works like this: You have dataset A and dataset B, both share the same key, let's say LongWritable.
Run two jobs that sort the two datasetsby their keys, both jobs HAVE TO set the number of reducers to an equal number, say 2.
this will result in 2 sorted files for each dataset
now you setup your job that joins the datasets, this job will spawn with 2 mappers. It could be more if you're setting the reduce numbers higher in the previous job.
do whatever you like in the reduce step.
If the number of the files to be joined is not equal it will result in an exception during job setup.
Setting up a join is kind of painful, mainly because you have to use the old API for mapper and reducer if your version is less than 0.21.x.
This document describes very well how it works. Scroll all the way to the bottom, sadly this documentation is somehow missing in the latest Hadoop docs.
Another good reference is "Hadoop the Definitive Guide", which explains all of this in more detail and with examples.
I think you're missing the point. You don't control the number of mappers. It's the number of reducers that you have control over. Simply emit the correct keys from your mapper. Then run 10 reducers.

Resources