How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.
I have two Spark dataframes DFa and DFb, they have same schema, ('country', 'id', 'price', 'name').
DFa has around 610 million rows,
DFb has 3000 milllion rows.
Now I want to find all rows from DFa and DFb that have same id, where id looks like "A6195A55-ACB4-48DD-9E57-5EAF6A056C80".
It's a SQL inner join, but when I run Spark SQL inner join, one task got killed because container used too much memory and caused Java heap memory error. And my cluster has limited resources, tuning YARN and Spark configuration is not a feasible method.
Is there any other solution to deal with this? Not using spark solution is also acceptable if the runtime is acceptable.
More generally, Can anyone give some algorithms and solutions when find common elements in two very large datasets.
First compute 64 bit hashes of your ids. The comparison will be a lot faster on the hashes, than on the string ids.
My basic idea is:
Build a hash table from DFa.
As you compute the hashes for DFb, you do a lookup in the table. If there's nothing there then drop the entry (no match). If you get a hit compare the actual IDs to make sure you don't get a false positive.
The complexity is O(N). Not knowing how many overlaps you expect this is the best you can do since you might have to output everything, because it all matches.
The naive implementation would use about 6GB of ram for the table (assuming 80% occupancy and that you use a flat hash table), but you can do better.
Since we already have the hash, we only need to know if it's exists. So you only need one bit to mark that which reduces the memory usage by a lot (you need 64x less memory per entry, but you need to lower occupancy). However this is not a common datastructure so you'll need to implement it.
But there's something even better, something even more compact. That is called bloom filter. This will introduce some more false positives, but we had to double check anyway because we didn't trust the hash, so it's not a big downside. The best part is that you should be able to find libraries for it already available.
So everything together it looks like this:
Compute hashes from DFa and build a bloom filter.
Compute hashes from DFb and check against the bloom filter. If you get a match lookup the ID in DFa to make sure it's a real match and add it to the result.
This is a typical usecase in any big data environment. You can use the Map-Side joins where you cache the smaller table which is broadcasted to all the executors.
You can read more about broadcasted joins here
this is related to cassandra time series modeling when time can go backward, but I think I have a better scenario to explain why the topic is important.
Imagine I have a simple table
CREATE TABLE measures(
key text,
measure_time timestamp,
value int,
PRIMARY KEY (key, measure_time))
The purpose of the clustering key is to have data arranged in a decreasing timestamp ordering. This leads to very efficient range-based queries, that for a given key lead to sequential disk reading (which are intrinsically fast).
Many times I have seen suggestions to use a generated timeuuid as timestamp value ( using now() ), and this is obviously intrinsically ordered. But you can't always do that. It seems to me a very common pattern, you can't use it if:
1) your user wants to query on the actual time when the measure has been taken, not the time where the measure has been written.
2) you use multiple writing threads
So, I want to understand what happens if I write data in an unordered fashion (with respect to measure_time column).
I have personally tested that if I insert timestamp-unordered values, Cassandra indeed reports them to me in a timestamp-ordered fashion when I run a select.
But what happens "under the hood"? In my opinion, it is impossible that data are still ordered on disk. At some point in fact data need to be flushed on disk. Imagine you flush a data set in the time range [0,10]. What if the next data set to flush has measures with timestamp=9? Are data re-arranged on disk? At what cost?
Hope I was clear, I couldn't find any explanation about this on Datastax site but I admit I'm quite a novice on Cassandra. Any pointers appreciated
Sure, once written a SSTable file is immutable, Your timestamp=9 will end up in another SSTable, and C* will have to merge and sort data from both SSTables, if you'll request both timestamp=10 and timestamp=9. And that would be less effective than reading from a single SSTable.
The Compaction process may merge those two SSTables into new single one. See
And try to avoid very wide rows/partitions, which will be the case if you have a lot measurements (i.e. a lot of measure_time values) for a single key.
Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family
this article xrds:article in the subsection "An Example of the Tradeoff" describes a way (the first one) that every single record is joined with all the other records of the input file. I wonder how could that be possible in mapreduce without passing the whole input file in only one mapper.
There are three major types of joins (there are a few others out there) for MapReduce.
Reduce Side Join - For both data sets, you output the "foreign key" as the output key of the mapper. You use something like MultipleInputs to load two data sets at once. In the reducer, data from both data sets is brought together by foreign key, which allows you to do the join logic (like Cartesian product, perhaps) there. This is general purpose and will work for just about every situation.
Replicated Join - You push out the smaller data set into the DistributedCache. In each matter, you load the smaller data set from there into memory. As records pass through the mapper, join the data up against the in-memory data set. This is what you suggest in your question. It should be only used when the smaller data set can be stored in memory.
Composite Join - This one is a bit niche because it needs to be set up. If two data sets are sorted and partitioned by the foreign key, then you can do a composite join using CompositeInputFormat. It basically does a merge-like operation that is pretty efficient.
Shameless plug for my book MapReduce Design Patterns: there is a whole chapter on joins (chapter 5).
Check out the code examples for the book here:
I'm looking at solutions to a problem that involves reading keyed data from more than one file. In a single map step I need all the values for a particular key in the same place at the same time. I see in White's book the discussion about "the shuffle" and am tempted to wonder if when you come out of merging and the input to a reducer is sorted by key, if all the data for a key is there....if you can count on that.
The bigger pictures is that I want to do a poor-man's triple-store federation and the triples I want to load into an in-memory store don't all come from the same file. It's a vertical (?) partition where the values for a particular key are in different files. Said another way, the columns for a complete record each come from different files. Does Hadoop re-assemble that? least for a single key at a time.
In short: yes. In a Hadoop job, the partitioner chooses which reducer receives which (key, value) pairs. Quote from the Yahoo tutorial section on partitioning: "It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same". This is also necessary for many of the types of algorithms typically solved with map reduce (such as distributed sorting, which is what you're describing).