So I'm just working on some Hadoop training getting to learn the lay of the land, and I'm attempting to do a reduce side join, which I have functioning, except for the secondary sort.
So the basics:
Two files
One has player,team,salary
Another has player,team,home runs
Output should be team,player,salary,home runs
The New York Mets should be partitioned into one file, while all the other crappy teams should be put into another.
Each of these files should be sorted by team, and secondarily by player salary.
I'm using a key of team,playerID to join on and that works, but I have no idea how I would sort by salary since only one of the two files has it.
Is this a possible task or can this only be accomplished via a map side join?
For this "The New York Mets should be partitioned into one file, while all the other crappy teams should be put into another."
You can use custom partitioner and return 0 for New York Mets and 1 for others.
Regarding sorting on salary, you are right - it is efficient to do using map side join. If data set is not very big, you can sort in the reducer by looping twice on the reducer input. In first you have to build some in memory collection to store the sorted data and in the second loop you can emit the data. But it is highly inefficient on larger data set - especially if each of the team have too many players, it will be slow as well as it can run into out of memory issues.
Related
I am creating a DynamoDB table to support an Alexa Skill for use as a podcast player. The way I envision the table is to use the episode number as the Partition Key and the PublicationDate as the optional Sort Key. I have two concerns about designing my table schema in this way.
First, say I wanted to query the table to get the latest episode - I'm not sure that I can do it in this fashion, as a query requires an equivalence operation on the Partition Key (episode = X), which I wouldn't know in advance. Am I correct in believing that a scan would be quite an expensive operation if the podcast has a large number of episodes (say more than 1000)?
I would need to look at each item in the table, compare its episode number (Partition Key value) to the previous returned Item and update a variable with the more recent Item each time one was found until all Items in the table were cycled through in this way.
Secondly, DynamoDB best practices say two things which work incongruently in my use-case (probably a sign that my design is flawed). First, the Partition Key should be unique or close to unique. Second, queries should be expected to be more or less uniformly dispersed amongst the keys. In my case, though, while the Partition Key would indeed be unique, I would expect the vast majority of queries to be targeting the latest Partition Key in the table, for the Item containing data for the latest podcast episode. What would be the impact on performance if, say for example, the skill gets 1000 queries on any given day all aimed at a single Partition Key?
Does anyone have a better table architecture solution for this type of data?
Thanks to everyone in advance!
Question 1:
First, say I wanted to query the table to get the latest episode - I'm
not sure that I can do it in this fashion, as a query requires an
equivalence operation on the Partition Key (episode = X), which I
wouldn't know in advance. Am I correct in believing that a scan would
be quite an expensive operation if the podcast has a large number of
episodes (say more than 1000)?
You are right that you would NOT be able to query for the latest episode because each episode is in their own Partition. Partitions are almost like different isolated tables so there is no way to query across all Partitions without Scanning (as you said).
Question 2:
Secondly, DynamoDB best practices say two things which work
incongruently in my use-case (probably a sign that my design is
flawed). First, the Partition Key should be unique or close to unique.
Second, queries should be expected to be more or less uniformly
dispersed amongst the keys. In my case, though, while the Partition
Key would indeed be unique, I would expect the vast majority of
queries to be targeting the latest Partition Key in the table, for the
Item containing data for the latest podcast episode. What would be the
impact on performance if, say for example, the skill gets 1000 queries
on any given day all aimed at a single Partition Key?
The issue here is two fold, AWS expects you to be reading (and writing) equally to each partition (or close to equally) so basically what is going to happen is you are going to pay for Write Units (and Read Units) on the partitions you are NOT using, even though you are not using them.
Exactly how much more that is going to run you is going to depend on the number of times you QUERY the database, however, Reading is much cheaper than writing and 1000 reads is basically nothing on a table with 1000 items. ie. You MIGHT be able to get away with it but it's not ideal.
Alternate Table Schema / Key Design
What other Queries will you make? ie. other than "Check for latest Episode"
How many Podcasts are added per day? week? year?
Are there multiple 'shows' or categories that could be used for Partition Keys that might have more even distribution and could be 'known'?
How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?
Map side join performs join before data reached to Map. Map function expects a strong prerequisites before joining data at map side. Both method have some pros and cons. Map side join is efficient compare to reduce side but it require strict format.
Prerequisites:
Data should be partitioned and sorted in particular way.
Each input data should be divided in same number of partition.
Must be sorted with same key.
All the records for a particular key must reside in the same partition.
Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type.
It will have to go through sort and shuffle phase which would incur network overhead.Reduce side join uses few terms like data source, tag and group key lets be familiar with it.
Data Source is referring to data source files, probably taken from RDBMS
Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given point of time be it is in map/reduce phase. why it is required will cover it later.
Group key is referring column to be used as join key between two data sources.
As we know we are going to join this data on reduce side we must prepare in a way that it can be used for joining in reduce phase. let’s have a look what are the steps needs to be perform.
For more information check this link:
http://hadoopinterviews.com/map-side-join-reduce-side-join/
You will use mapside join if one of your table can be fit in memory which will reduce overhead on your sort and shuffle data.
Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have
to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer.
this is related to cassandra time series modeling when time can go backward, but I think I have a better scenario to explain why the topic is important.
Imagine I have a simple table
CREATE TABLE measures(
key text,
measure_time timestamp,
value int,
PRIMARY KEY (key, measure_time))
WITH CLUSTERING ORDER BY (measure_time DESC);
The purpose of the clustering key is to have data arranged in a decreasing timestamp ordering. This leads to very efficient range-based queries, that for a given key lead to sequential disk reading (which are intrinsically fast).
Many times I have seen suggestions to use a generated timeuuid as timestamp value ( using now() ), and this is obviously intrinsically ordered. But you can't always do that. It seems to me a very common pattern, you can't use it if:
1) your user wants to query on the actual time when the measure has been taken, not the time where the measure has been written.
2) you use multiple writing threads
So, I want to understand what happens if I write data in an unordered fashion (with respect to measure_time column).
I have personally tested that if I insert timestamp-unordered values, Cassandra indeed reports them to me in a timestamp-ordered fashion when I run a select.
But what happens "under the hood"? In my opinion, it is impossible that data are still ordered on disk. At some point in fact data need to be flushed on disk. Imagine you flush a data set in the time range [0,10]. What if the next data set to flush has measures with timestamp=9? Are data re-arranged on disk? At what cost?
Hope I was clear, I couldn't find any explanation about this on Datastax site but I admit I'm quite a novice on Cassandra. Any pointers appreciated
Sure, once written a SSTable file is immutable, Your timestamp=9 will end up in another SSTable, and C* will have to merge and sort data from both SSTables, if you'll request both timestamp=10 and timestamp=9. And that would be less effective than reading from a single SSTable.
The Compaction process may merge those two SSTables into new single one. See http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
And try to avoid very wide rows/partitions, which will be the case if you have a lot measurements (i.e. a lot of measure_time values) for a single key.
How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.
this article xrds:article in the subsection "An Example of the Tradeoff" describes a way (the first one) that every single record is joined with all the other records of the input file. I wonder how could that be possible in mapreduce without passing the whole input file in only one mapper.
There are three major types of joins (there are a few others out there) for MapReduce.
Reduce Side Join - For both data sets, you output the "foreign key" as the output key of the mapper. You use something like MultipleInputs to load two data sets at once. In the reducer, data from both data sets is brought together by foreign key, which allows you to do the join logic (like Cartesian product, perhaps) there. This is general purpose and will work for just about every situation.
Replicated Join - You push out the smaller data set into the DistributedCache. In each matter, you load the smaller data set from there into memory. As records pass through the mapper, join the data up against the in-memory data set. This is what you suggest in your question. It should be only used when the smaller data set can be stored in memory.
Composite Join - This one is a bit niche because it needs to be set up. If two data sets are sorted and partitioned by the foreign key, then you can do a composite join using CompositeInputFormat. It basically does a merge-like operation that is pretty efficient.
Shameless plug for my book MapReduce Design Patterns: there is a whole chapter on joins (chapter 5).
Check out the code examples for the book here: https://github.com/adamjshook/mapreducepatterns/tree/master/MRDP/src/main/java/mrdp/ch5