How to implement self-join/cross-product with hadoop? - hadoop

It is common task to make some evaluation on pairs of items:
Examples: de-duplication, collaborative filtering, similar items etc
This is basically self-join or cross-product with the same source of data.

To do a self join, you can follow the "reduce-side join" pattern. The mapper emits the join/foreign key as key, and the record as the value.
So, let's say we wanted to do a self-join on "city" (the middle column) on the following data:
don,baltimore,12
jerry,boston,19
bob,baltimore,99
cameron,baltimore,13
james,seattle,1
peter,seattle,2
The mapper would emit the key->value pairs:
(baltimore -> don,12)
(boston -> jerry,19)
(baltimore -> bob,99)
(baltimore -> cameron,13)
(seattle -> james,1)
(seattle -> peter,2)
In the reducer, we'll get this:
(baltimore -> [(don,12), (bob,99), (cameron,13)])
(boston -> [(jerry,19)])
(seattle -> [(james,1), (peter,2)])
From here, you can do the inner join logic, if you so choose. To do this, you'd just match up every item for every other item. To do this, load it up the data into an array list, then do a N x N loop over the items to compare each to each other.
Realize that reduce-side joins are expensive. They send pretty much all of the data to the reducers if you don't filter anything out. Also, be careful of loading the data up into memory in the reducers-- you may blow your heap on a hot join key by loading all of the data in an array list.
The above is a bit different than the typical reduce-side join. The idea is the same when joining two data sets: the foreign key is the key, and the record is the value. The only difference is that the values could be coming from two or more data sets. You can use MultipleInputs to have different mappers parse different input sets, then have the reducer collect data from both.
Cross product in the case where you don't have any constraints is a nightmare. I.e.,
select * from tablea, tableb;
There are a number of ways to do this. None of them are particularly efficient. If you want this type of behavior, leave me a comment and I'll spend more time explaining a way to do this.
If you can figure out some sort of join key which is a fundamental key to similarity, you are much better off.
Plug for my book: MapReduce Design Patterns. It should be published in a few months, but if you are really interested I can email you the chapter on joins.

One typically uses the reducer to perform whatever logic is required on the join. The trick is to map the dataset twice, possibly adding some marker to the value indicating which run it is. Then a self join is no different from any other kind of join.

Related

Hive count distinct UDAF2

I've read a question on SO:
I ran into a Hive query calculating a count distinct without grouping,
which runs very slow. So I was wondering how is this functionality
implemented in Hive, is there a UDAFCountDistinct for this?
And the answer:
To achieve count distinct, Hive relies on the GenericUDAFCount. There
is no UDAF specifically implemented for count distinct. Those
'distinct by' keys will be a part of the partitioning key of the
MapReduce Shuffle phase, this way they are 'distincted' quite
natually.
As per your case, it runs slowly because there will be only one
reducer to process massive detailed data. You can use a group by
before counting to get more parallelism:
select count(1) from (select id from tbl group by id) tmp;
However I don't understand a few things:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
Why there will be only one reducer in this case?
Why the weird inner query will cause more partitions?
I'll try to explain.
Part 1:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
The UDAF GenericUDAFCount is capable of both count and count distinct. How does it work to achieve count distinct?
Let's take the following query as an example:
select category, count(distinct brand) from market group by category;
One MapReduce Job will be launched for this query.
distinct-by keys are the expressions(columns) within count(distinct ..., in this case, brand.
partition-by keys are the fields used to calculate a hash code for a record at map phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by keys lies in the group by part of a SQL query. In this case, it's category.
The actual output-key of mappers will be the composition of partition-by key and a distinct-by key. For the above case, a mapper's output key may be like (drink, Pepsi).
This design makes all rows with the same group-by key fall into the same reducer.
The value part of mappers' output doesn’t matter here.
Later at the Shuffle phase, records are sort according to the sort-by keys, which is the same as the output key.
Then at reduce phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct ) aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*) at each group. The input key of a call to the reduce method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.
Part 2:
Why there will be only one reducer in this case?
When calculating count distinct without group by like this:
select count(distinct brand) from market
There will be just one reducer taking all the work. Why? Because the partition-by key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.
Part 3:
Why the weird inner query will cause more partitions?
The inner query's partition-by key is the group by key, id. There’s a chance that id values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id are different from each other. So now a simple count(1) is all that's needed.
But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1), map-side aggregation hugely cut down the amount of data processed by reducers.
One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.

(Spark) Is there any possible way to optimize two large rdd join when both of them is too large for memory(means cannot use broadcast)?

As title.
Is there any possible way to optimize two large rdd join when both of them is too large for memory? In this case I suppose we cannot use broadcast for map side join.
If I have to join this two rdd, and both of them is too large to fit in memory:
country_rdd:
(id, country)
income_rdd:
(id, (income, month, year))
joined_rdd = income_rdd.join(country_rdd)
Is there any possible way to reduce the shuffling here? Or anything I can do to tuning the join performance?
Besides, the joined_rdd will be further calculated and reduced only by country and time, not relevant to id anymore. Eg: my final result = income for different country in different years. What's the best practice to do that?
I used to consider do some pre-partition, but seems if I only need to do join once that won't help much?
In general case (no a priori knowledge of the key properties) it is not possible. Shuffle is a essential part of the join and cannot be avoided.
In specific cases you can reduce shuffling in two ways:
Design your own Partitioner which takes advantage of pre-existing data distribution. For example if you know that data is sorted by key you can use that knowledge to limit the shuffle.
If you apply inner join, and only a fraction of keys occurs in both RDDs you can:
Create Bloom filters on each datasets. Lets call these leftFilter and rightFilter.
Filter RDD with opposite filters (leftRDD with rightFilter, rightRDD with leftFilter)
Join filtered RDD

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?
Map side join performs join before data reached to Map. Map function expects a strong prerequisites before joining data at map side. Both method have some pros and cons. Map side join is efficient compare to reduce side but it require strict format.
Prerequisites:
Data should be partitioned and sorted in particular way.
Each input data should be divided in same number of partition.
Must be sorted with same key.
All the records for a particular key must reside in the same partition.
Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type.
It will have to go through sort and shuffle phase which would incur network overhead.Reduce side join uses few terms like data source, tag and group key lets be familiar with it.
Data Source is referring to data source files, probably taken from RDBMS
Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given point of time be it is in map/reduce phase. why it is required will cover it later.
Group key is referring column to be used as join key between two data sources.
As we know we are going to join this data on reduce side we must prepare in a way that it can be used for joining in reduce phase. let’s have a look what are the steps needs to be perform.
For more information check this link:
http://hadoopinterviews.com/map-side-join-reduce-side-join/
You will use mapside join if one of your table can be fit in memory which will reduce overhead on your sort and shuffle data.
Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have
to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer.

A join operation using Hadoop MapReduce

How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.

join a record of a text file with all the other records in the same file in mapreduce

this article xrds:article in the subsection "An Example of the Tradeoff" describes a way (the first one) that every single record is joined with all the other records of the input file. I wonder how could that be possible in mapreduce without passing the whole input file in only one mapper.
There are three major types of joins (there are a few others out there) for MapReduce.
Reduce Side Join - For both data sets, you output the "foreign key" as the output key of the mapper. You use something like MultipleInputs to load two data sets at once. In the reducer, data from both data sets is brought together by foreign key, which allows you to do the join logic (like Cartesian product, perhaps) there. This is general purpose and will work for just about every situation.
Replicated Join - You push out the smaller data set into the DistributedCache. In each matter, you load the smaller data set from there into memory. As records pass through the mapper, join the data up against the in-memory data set. This is what you suggest in your question. It should be only used when the smaller data set can be stored in memory.
Composite Join - This one is a bit niche because it needs to be set up. If two data sets are sorted and partitioned by the foreign key, then you can do a composite join using CompositeInputFormat. It basically does a merge-like operation that is pretty efficient.
Shameless plug for my book MapReduce Design Patterns: there is a whole chapter on joins (chapter 5).
Check out the code examples for the book here: https://github.com/adamjshook/mapreducepatterns/tree/master/MRDP/src/main/java/mrdp/ch5

Resources