Configure Map Side join for multiple mappers in Hadoop Map/Reduce - hadoop

I have a question about configuring Map/Side inner join for multiple mappers in Hadoop.
Suppose I have two very large data sets A and B, I use the same partition and sort algorithm to split them into smaller parts. For A, assume I have a(1) to a(10), and for B I have b(1) to b(10). It is assured that a(1) and b(1) contain the same keys, a(2) and b(2) have the same keys, and so on. I would like to setup 10 mappers, specifically, mapper(1) to mapper(10). To my understanding, Map/Side join is a pre-processing task prior to the mapper, therefore, I would like to join a(1) and b(1) for mapper(1), to join a(2) and b(2) for mapper(2), and so on.
After reading some reference materials, it is still not clear to me how to configure these ten mappers. I understand that using CompositeInputFormat I would be able to join two files, but it seems only configuring one mapper and joining the 20 files pair after pair (in 10 sequential tasks). How to configure all these ten mappers and join ten pairs at the same time in a genuine Map/Reduce (10 tasks in parallel)? To my understanding, ten mappers would require ten CompositeInputFormat settings because the files to join are all different. I strongly believe this is practical and doable, but I couldn't figure out what exact commands I should use.
Any hint and suggestion is highly welcome and appreciated.
Shi
Thanks a lot for the replies, David and Thomas!
I appreciate your emphasis about the pre-requirements on Map-side Join. Yes, I am aware about the sort, API, etc. After reading your comments, I think my actual problem is what is the correct expression for joining multiple splits of two files in CompositeInputFormat. For example, I have dataA and dataB sorted and reduced in 2 files respectively:
/A/dataA-r-00000
/A/dataA-r-00001
/B/dataB-r-00000
/B/dataB-r-00001
The expression command I am using now is:
inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"/A/dataA-r-00000"),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"/B/dataB-r-00000"))
It works but as you mentioned, it only starts two mappers (because the inner join prevents from splitting) and could be very inefficient if the files are big. If I want to use more mappers (say another 2 mappers to join dataA-r-00001 and dataB-r-00001), how should I construct the expression, is it something like:
String joinexpression = "inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/dataA-r-00000'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/dataB-r-00000'), tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/dataA-r-00001'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/dataB-r-00001'))" ;
But I think that could be mistaken, because the command above actually perform inner join of four files (which will result in nothing in my case because file *r-00000 and *r-00001 have non-overlapping keys).
Or I could just use the two dirs as inputs, like:
String joinexpression = "inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/'))" ;
The inner join will match the pairs automatically according to the file endings, say "00000" to "00000", "00001" to "00001"? I am stuck at this point because I need to construct the expression and pass it to
conf.set("mapred.join.expr", joinexpression);
So in one word, how should I build the proper expression if I want to use more mappers to join multiple pairs of files simultaneously?

There are map- and reduce side joins.
You proposed to use a map side join, which is executed inside a mapper and not before it.
Both sides must have the same key and value types. So you can't join a LongWritable and a Text, although they might have the same value.
There are subtle more things to note:
input files have to be sorted, so it has likely to be a reducer output
You can control the number of mappers in your join-map-phase by setting the number of reducers in the job that should've sorted the datasets
The whole procedure basically works like this: You have dataset A and dataset B, both share the same key, let's say LongWritable.
Run two jobs that sort the two datasetsby their keys, both jobs HAVE TO set the number of reducers to an equal number, say 2.
this will result in 2 sorted files for each dataset
now you setup your job that joins the datasets, this job will spawn with 2 mappers. It could be more if you're setting the reduce numbers higher in the previous job.
do whatever you like in the reduce step.
If the number of the files to be joined is not equal it will result in an exception during job setup.
Setting up a join is kind of painful, mainly because you have to use the old API for mapper and reducer if your version is less than 0.21.x.
This document describes very well how it works. Scroll all the way to the bottom, sadly this documentation is somehow missing in the latest Hadoop docs.
Another good reference is "Hadoop the Definitive Guide", which explains all of this in more detail and with examples.

I think you're missing the point. You don't control the number of mappers. It's the number of reducers that you have control over. Simply emit the correct keys from your mapper. Then run 10 reducers.

Related

Why split points are out of order on Hadoop total order partitioner?

I use Hadoop total order partitioner and random sampler as input sampler.
But when I increase my slave nodes and reduce tasks to 8, I get following error:
Caused by: java.io.IOException: Split points are out of order
I don't know the reason for this error.
How can I set the number of three parameters on inputsampler.randomsampler function?
Two possible problems
You have duplicate keys
You are using a different comparator for the input sampler and the task on which you are running the total order partitioner
You can diagnose this by downloading the partition file and examining its contents. The partitions file is the value of total.order.partitioner.path if it is set or _partition.lst otherwise. If your keys are text, you can run hdfs dfs -text path_to_partition_file | less to get a look. This may also work for other key types, but I haven't tried it.
If there are duplicate lines in the partition file, you have duplicate keys, otherwise you're probably using the wrong comparator.
How to fix
Duplicate Keys
My best guess is that your keys are so unbalanced that an even division of records among partitions is generating partitions with identical split points.
To solve this you have several options:
Choose a value to use as a key that better distinguishes your inputs (probably not possible, but much better if you can)
Use fewer partitions and reducers (not as scalable or certain as the next solution, but simpler to implement, especially if you have only a few duplicates). Divide the original number of partitions by largest number of duplicate entries. (For example, if your partition key file lists: a, a, b, c, c, c, d, e as split points then you have 9 reducers (8 split points) and max duplicates of 3. So, use 3 reducers (3=floor(9/3)) and if your sampling is good, you'll probably end up with proper split points. For complete stability you'd need to be able to re-run the partition step if it has duplicate entries so you can guard against the occasional over-sampling of the unbalanced keys, but at that level of complexity, you may as well look into the next solution.
Read the partitions file, rewrite it without duplicate entries, count the number of entries (call it num_non_duplicates) and use num_non_duplicates+1 reducers. The reducers with the duplicated keys will have much more work than the other reducers and run longer. If the reduce operation is commutative and associative, you may be able to mitigate this by using combiners.
Using the wrong comparator
Make sure you have mapred.output.key.comparator.class set identically in both the call to writePartitionFile and the job using TotalOrderPartitioner
Extra stuff you don't need to read but might enjoy:
The Split points are out of order error message comes from the code:
RawComparator<K> comparator =
(RawComparator<K>) job.getOutputKeyComparator();
for (int i = 0; i < splitPoints.length - 1; ++i) {
if (comparator.compare(splitPoints[i], splitPoints[i+1]) >= 0) {
throw new IOException("Split points are out of order");
}
}
The line comparator.compare(splitPoints[i], splitPoints[i+1]) >= 0 means that a pair of split points is rejected if they are either identical or out-of-order.
1 or 2 reducers will never generate this error since there can't be more than 1 split point and the loop will never execute.
Are you sure you are generating enough keys?
From the javadoc: TotalOrderPartitioner
The input file must be sorted with the same comparator and contain
JobContextImpl.getNumReduceTasks() - 1 keys.

A join operation using Hadoop MapReduce

How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.

Hadoop and Cassandra processing rows in sorted order

I want to fill a Cassandra database with a list of strings that I then process using Hadoop. What I want to do it run through all the strings in order using a Hadoop cluster and record how much overlap there is between each string in order to find the Longest Common Substring.
My question is, will the InputFormat object allow me to read out the data in a sorted order or will my strings be read out "randomly" (according to how Cassandra decides to distribute them) throughout every machine in the cluster? Is the MapReduce process designed to process each row by itself w/out the intent of looking at two rows consecutively like I'm asking for?
First of all, the Mappers will read the data in whatever order they get it from the InputFormat. I'm not a Cassandra expert, but I don't expect that will be in sorted order.
If you want sorted order, you should use an identity mapper (one that does nothing) whose output key is the string itself. Then they will be sorted before passed to the reduce step. But it gets a little more complicated since you can have more than one reducer. With only one reducer, everything is globally sorted. With more than one, each reducer's input is sorted, but the input across reducers might not be sorted. That is, adjacent strings might not go to the same reducer. You would need a custom partitioner to handle that.
Lastly, you mentioned that you're doing longest common substring- are you looking for the longest substring among each pair of strings? Among consecutive pairs of strings? Among all strings? Each of these possibilities will affect how you need to structure your MapReduce job.

Hadoop. Reducing result to the single value

I started learning Hadoop, and am a bit confused by MapReduce. For tasks where result natively is a list of key-value pairs everything seems clear. But I don't understand how should I solve the tasks where result is a single value (say, sum of squared input decimals, or centre of mass for input points).
On the one hand I can put all results of mapper to the same key. But as far as I understood in this case the only reducer will manage the whole set of data (calculate sum, or mean coordinates). It doesn't look like a good solution.
Another one that I can imaging is to group mapper results. Say, mapper that processed examples 0-999 will produce key equals to 0, 1000-1999 will produce key equals to 1, and so on. As far as there still will be multiple results of reducers, it will be necessary to build chain of reducers (reducing will be repeated until only one result remains). It looks much more computational effective, but a bit complicated.
I still hope that Hadoop has the off-the-shelf tool that executes superposition of reducers to maximise the efficiency of reducing the whole data to a single value. Although I failed to find one.
What is the best practise of solving the tasks where result is a single value?
If you are able to reformulate your task in terms of commutative reduce you should look at Combiners. Any way you should take a look on it, it can significantly reduce amount data to shuffle.
From my point of view, you are tackling the problem from the wrong angle.
See that problem where you need to sum the squares of your input, let's assume you have many and large text input files consisting out of a number per line.
Then ideally you want to parallelize your sums in the mapper and then just sum up the sums in the reducer.
e.G:
map: (input "x", temporary sum "s") -> s+=(x*x)
At the end of map, you would emit that temporary sum of every mapper with a global key.
In the reduce stage, you basically get all the sums from your mappers and sum the sums up, note that this is fairly small (n-times a single integer, where n is the number of mappers) in relation to your huge input files and therefore a single reducer is really not a scalability bottleneck.
You want to cut down the communication cost between the mapper and the reducer, not proxy all your data to a single reducer and read through it there, that would not parallelize anything.
I think your analysis of the specific use cases you bring up are spot on. These use cases still fall into a rather inclusive scope of what you can do with hadoop and there are certainly other things that hadoop just wasn't designed to handle. If I had to solve the same problem, I would follow your first approach unless I knew the data was too big, then I'd follow your two-step approach.

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources