What is the use of grouping comparator in hadoop map reduce - hadoop

I would like to know why grouping comparator is used in secondary sort of mapreduce.
According to the definitive guide example of secondary sorting
We want the sort order for keys to be by year (ascending) and then by
temperature (descending):
1900 35°C
1900 34°C
1900 34°C
...
1901 36°C
1901 35°C
By setting a partitioner to partition by the year part of the key, we can guarantee that
records for the same year go to the same reducer. This still isn’t enough to achieve our
goal, however. A partitioner ensures only that one reducer receives all the records for
a year; it doesn’t change the fact that the reducer groups by key within the partition.
Since we would have already written our own partitioner which would take care of the map output keys going to particular reducer,so why should we group it.
Thanks in advance

In support of the chosen answer I add:
Following on from this explanation
**Input**:
symbol time price
a 1 10
a 2 20
b 3 30
**Map output**: create composite key\values like so:
> symbol-time time-price
>
>**a-1** 1-10
>
>**a-2** 2-20
>
>**b-3** 3-30
The Partitioner: will route the a-1 and a-2 keys to the same reducer despite the keys being different. It will also route the b-3 to a separate reducer.
GroupComparator: once the composites key\value arrive at the reducer instead of the reducer getting
>(**a-1**,{1-10})
>
>(**a-2**,{2-20})
the above will happen due to the unique key values following composition.
the group comparator will ensure the reducer gets:
(a-1,{**1-10,2-20**})
The key of the grouped values will be the one which comes first in the group. This can be controlled by Key comparator.
**[[In a single reduce method call.]]**

Let me improve the statement "... take care of the map output keys going to particular reducer".
Reducer Instance vs reduce method:
One JVM is created per Reduce task and each of these has a single instance of the Reducer class.This is Reducer instance(I call it Reducer from now).Within each Reducer, reduce method is called multiple times depending on 'key grouping'.Each time reduce is called, 'valuein' has a list of map output values grouped by the key you define in 'grouping comparator'.By default, grouping comparator uses the entire map output key.
In the example, map output key is changed to 'year and temperature' to achieve sorting.Unless you define a grouping comparator that uses only the 'year' part of the map output key,you can't make all records of the same year go to same reduce method call.

You need to introduce an intermediate key that is a composite of the year and temperature; partition on the natural key (the year) and introduce a comparator that will sort on the entire composite key. You're right that by partitioning on the year you'll get all the data for a year in the same reducer, so the comparator will effectively sort the data for each year by the temperature.

The default partitioner calculates the hash of the key, and those keys which has the same hash value will be sent to the same reducer. If you have a composite(natural+augment) key emitted in your mapper and if you want to send the keys which has the same natural key to the same reducer then you have to implement a custom partitioner.
public class SimplePartitioner implements Partitioner {
#Override
public int getPartition(Text compositeKey, LongWritable value, int numReduceTasks) {
//Split the key into natural and augment
String naturalKey = compositeKey.toString().split("separator")
return naturalKey.hashCode();
}
}
And now if you want all your relevant rows within a partition of data are sent to a single reducer you must also implement a grouping comparator which considers only the natural key
public class SimpleGroupingComparator extends WritableComparator {
#Override
public int compare(Text compositeKey1, Text compositeKey2) {
return compare(compositeKey1.getNaturalKey(),compositeKey2.getNaturalKey());
}
}

Related

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

Sending Items to specific partitions

I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD
Lets say I have two RDDs of key-value pairs
val a:RDD[(Int, Foo)]
val b:RDD[(Int, Foo)]
val aStructure = a.reduceByKey(//reduce into large data structure)
b.mapPartitions{
iter =>
val usefulItem = aStructure(samePartitionKey)
iter.map(//process iterator)
}
How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable).
One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution.
Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic)
thank you for your help!
edit:
Pangea makes a very good point that I should offer some more specifics. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large.
My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition and doing a join would cause me to have a lot of copies of that index.
val vectors:RDD[(Int, SparseVector)]
val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
iter =>
val invIndex = invertedIndexes(samePartitionKey)
iter.map(invIndex.calculateSimilarity(_))
)
}
A Partitioner is a function that, given a generic element, will return in which partition it belongs. It also decides the number of partitions.
There's a form of reduceByKey that takes a partitioner as an argument.
If I am understanding correctly your question, you want the data be partitioned while doing the reduce.
See the example:
// create example data
val a =sc.parallelize(List( (1,1),(1,2), (2,3),(2,4) ) )
// create simple sample partitioner - 2 partitions, one for odd
// one for even key.hashCode. You should put your partitioning logic here
val p = new Partitioner { def numPartitions: Int = 2; def getPartition(key:Any) = key.hashCode % 2 }
// your reduceByKey function. Sample: just add
val f = (a:Int,b:Int) => a + b
val rdd = a.reduceByKey(p, f)
// here your rdd will be partitioned the way you want with the number
// of partitions you want
rdd.partitions.size
res8: Int = 2
rdd.map() .. // go on with your processing

How to know Hadoop reducer assigned records

I'm using custom partitioner that assign records to the reducers randomly. Then the reducers start processing.
Is there a way I can know how many records assigned to each reducer before the reducers start working??
Partitioner does not assign records to the reducer randomly ,it has predefined logic
when we write custom partitioner we write the logic how records should be distributed among reducers
for instance if you are dealing with the data with consists of one field age
You can decide how your input would be processed at reducer
first of all you would have to configure no. of reducer you want for particular job which can configured in driver program of map reduce job
suppose you have configured 3 number of reducer .
While writing custom partitioner you would define the logic for instance
if(ageInt <=20){
return 0;
}
//else if the age is between 20 and 50, assign partition 1
if(ageInt >20 && ageInt <=50){
return 1 % numReduceTasks;
}
//otherwise assign partition 2
else
return 2 % numReduceTasks;
All those records which falls in the category to less than age 20 would go to first reducer .
Even before executing job you can count all number of records based on your condition.

Hive cluster by vs order by vs sort by

As far as I understand;
sort by only sorts with in the reducer
order by orders things globally but shoves everything into one reducers
cluster by intelligently distributes stuff into reducers by the key hash and make a sort by
So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?
The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.
A shorter answer: yes, CLUSTER BY guarantees global ordering, provided you're willing to join the multiple output files yourself.
The longer version:
ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or more unsorted files with non-overlapping ranges.
CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.
Make sense? So CLUSTER BY is basically the more scalable version of ORDER BY.
Let me clarify first: clustered by only distributes your keys into different buckets, clustered by ... sorted by get buckets sorted.
With a simple experiment (see below) you can see that you will not get global order by default. The reason is that default partitioner splits keys using hash codes regardless of actual key ordering.
However you can get your data totally ordered.
Motivation is "Hadoop: The Definitive Guide" by Tom White (3rd edition, Chapter 8, p. 274, Total Sort), where he discusses TotalOrderPartitioner.
I will answer your TotalOrdering question first, and then describe several sort-related Hive experiments that I did.
Keep in mind: what I'm describing here is a 'proof of concept', I was able to handle a single example using Claudera's CDH3 distribution.
Originally I hoped that org.apache.hadoop.mapred.lib.TotalOrderPartitioner will do the trick. Unfortunately it did not because it looks like Hive partitions by value, not key. So I patch it (should have subclass, but I do not have time for that):
Replace
public int getPartition(K key, V value, int numPartitions) {
return partitions.findPartition(key);
}
with
public int getPartition(K key, V value, int numPartitions) {
return partitions.findPartition(value);
}
Now you can set (patched) TotalOrderPartitioner as your Hive partitioner:
hive> set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
hive> set total.order.partitioner.natural.order=false
hive> set total.order.partitioner.path=/user/yevgen/out_data2
I also used
hive> set hive.enforce.bucketing = true;
hive> set mapred.reduce.tasks=4;
in my tests.
File out_data2 tells TotalOrderPartitioner how to bucket values.
You generate out_data2 by sampling your data. In my tests I used 4 buckets and keys from 0 to 10. I generated out_data2 using ad-hoc approach:
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.hive.ql.io.HiveKey;
import org.apache.hadoop.fs.FileSystem;
public class TotalPartitioner extends Configured implements Tool{
public static void main(String[] args) throws Exception{
ToolRunner.run(new TotalPartitioner(), args);
}
#Override
public int run(String[] args) throws Exception {
Path partFile = new Path("/home/yevgen/out_data2");
FileSystem fs = FileSystem.getLocal(getConf());
HiveKey key = new HiveKey();
NullWritable value = NullWritable.get();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, getConf(), partFile, HiveKey.class, NullWritable.class);
key.set( new byte[]{1,3}, 0, 2);//partition at 3; 1 came from Hive -- do not know why
writer.append(key, value);
key.set( new byte[]{1, 6}, 0, 2);//partition at 6
writer.append(key, value);
key.set( new byte[]{1, 9}, 0, 2);//partition at 9
writer.append(key, value);
writer.close();
return 0;
}
}
Then I copied resulting out_data2 to HDFS (into /user/yevgen/out_data2)
With these settings I got my data bucketed/sorted (see last item in my experiment list).
Here is my experiments.
Create sample data
bash> echo -e "1\n3\n2\n4\n5\n7\n6\n8\n9\n0" > data.txt
Create basic test table:
hive> create table test(x int);
hive> load data local inpath 'data.txt' into table test;
Basically this table contains values from 0 to 9 without order.
Demonstrate how table copying works (really mapred.reduce.tasks parameter which sets MAXIMAL number of reduce tasks to use)
hive> create table test2(x int);
hive> set mapred.reduce.tasks=4;
hive> insert overwrite table test2
select a.x from test a
join test b
on a.x=b.x; -- stupied join to force non-trivial map-reduce
bash> hadoop fs -cat /user/hive/warehouse/test2/000001_0
1
5
9
Demonstrate bucketing. You can see that keys are assinged at random without any sort order:
hive> create table test3(x int)
clustered by (x) into 4 buckets;
hive> set hive.enforce.bucketing = true;
hive> insert overwrite table test3
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test3/000000_0
4
8
0
Bucketing with sorting. Results are partially sorted, not totally sorted
hive> create table test4(x int)
clustered by (x) sorted by (x desc)
into 4 buckets;
hive> insert overwrite table test4
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test4/000001_0
1
5
9
You can see that values are sorted in ascending order. Looks like Hive bug in CDH3?
Getting partially sorted without cluster by statement:
hive> create table test5 as
select x
from test
distribute by x
sort by x desc;
bash> hadoop fs -cat /user/hive/warehouse/test5/000001_0
9
5
1
Use my patched TotalOrderParitioner:
hive> set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
hive> set total.order.partitioner.natural.order=false
hive> set total.order.partitioner.path=/user/training/out_data2
hive> create table test6(x int)
clustered by (x) sorted by (x) into 4 buckets;
hive> insert overwrite table test6
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test6/000000_0
1
2
0
bash> hadoop fs -cat /user/hive/warehouse/test6/000001_0
3
4
5
bash> hadoop fs -cat /user/hive/warehouse/test6/000002_0
7
6
8
bash> hadoop fs -cat /user/hive/warehouse/test6/000003_0
9
CLUSTER BY does not produce global ordering.
The accepted answer (by Lars Yencken) misleads by stating that the reducers will receive non-overlapping ranges. As Anton Zaviriukhin correctly points to the BucketedTables documentation, CLUSTER BY is basically DISTRIBUTE BY (same as bucketing) plus SORT BY within each bucket/reducer. And DISTRIBUTE BY simply hashes and mods into buckets and while the hashing function may preserve order (hash of i > hash of j if i > j), mod of hash value does not.
Here's a better example showing overlapping ranges
http://myitlearnings.com/bucketing-in-hive/
As I understand, short answer is No.
You'll get overlapping ranges.
From SortBy documentation:
"Cluster By is a short-cut for both Distribute By and Sort By."
"All rows with the same Distribute By columns will go to the same reducer."
But there is no information that Distribute by guarantee non-overlapping ranges.
Moreover, from DDL BucketedTables documentation:
"How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets."
I suppose that Cluster by in Select statement use the same principle to distribute rows between reducers because it's main use is for populating bucketed tables with the data.
I created a table with 1 integer column "a", and inserted numbers from 0 to 9 there.
Then I set number of reducers to 2
set mapred.reduce.tasks = 2;.
And select data from this table with Cluster by clause
select * from my_tab cluster by a;
And received result that I expected:
0
2
4
6
8
1
3
5
7
9
So, first reducer (number 0) got even numbers (because their mode 2 gives 0)
and second reducer (number 1) got odd numbers (because their mode 2 gives 1)
So that's how "Distribute By" works.
And then "Sort By" sorts the results inside each reducer.
Use case : When there is a large dataset then one should go for sort by as in sort by , all the set reducers sort the data internally before clubbing together and that enhances the performance. While in Order by, the performance for the larger dataset reduces as all the data is passed through a single reducer which increases the load and hence takes longer time to execute the query.
Please see below example on 11 node cluster.
This one is Order By example output
This one is Sort By example output
This one is Cluster By example
What I observed , the figures of sort by , cluster by and distribute by is SAME But internal mechanism is different. In DISTRIBUTE BY : The same column rows will go to one reducer , eg. DISTRIBUTE BY(City) - Bangalore data in one column , Delhi data in one reducer:
Cluster by is per reducer sorting not global. In many books also it is mentioned incorrectly or confusingly. It has got particular use where say you distribute each department to specific reducer and then sort by employee name in each department and do not care abt order of dept no the cluster by to be used and it more perform-ant as workload is distributed among reducers.
SortBy: N or more sorted files with overlapping ranges.
OrderBy: Single output i.e fully ordered.
Distribute By: Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.
For more information http://commandstech.com/hive-sortby-vs-orderby-vs-distributeby-vs-clusterby/
ClusterBy: Refer to the same example as above, if we use Cluster By x, the two reducers will further sort rows on x:
If I understood it correctly
1.sort by - only sorts the data within the reducer
2.order by - orders things globally by pushing the entire data set to a single reducer. If we do have a lot of data(skewed), this process will take a lot of time.
cluster by - intelligently distributes stuff into reducers by the key hash and make a sort by, but does not grantee global ordering. One key(k1) can be placed into two reducers. 1st reducer gets 10K K1 data, the second one might get 1K k1 data.

Is relational database able to leverage the way of consistent hashing to do the partition table?

Assume we have a user table to be partitioned by user id as integer 1,2,3...n . Can I use the way of consistent hashing used to partition the table?
The benefit would be if the number of partitions is increased or decreased, old index can be the same.
Question A.
Is it a good idea to use consistent hashing algorithm to do the partition table?
Question B.
Any relational database has this built in supported?
I guess some nosql database already use it.
But database here refer to relational database.
I just encountered this question in an interview. In the first reaction I just answered mod by length, but then be challenged if partitioning the table into more pieces, then it will cause problems.
After I researched some wiki reference pages like Partition (database)
I believe my idea belongs to Composite partitioning .
Composite partitioning allows for certain combinations of the above partitioning schemes, by for example first applying a range partitioning and then a hash partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed.
It also introduces some concepts like Consistent hashing, and Hash table
But some link like Partition (database) is kind of old. If some one can find more latest reference that will be better. My answer is incomplete indeed. Hope some one can answer it better!
UPDATE
Looks like Jonathan Ellis already mentioned in his blog, The Cassandra distributed database supports two partitioning schemes now: the traditional consistent hashing scheme, and an order-preserving partitioner.
http://spyced.blogspot.com/2009/05/consistent-hashing-vs-order-preserving.html
From Tom White's blog. A sample implemtation in java of consistent hashing
import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
public class ConsistentHash<T> {
private final HashFunction hashFunction;
private final int numberOfReplicas;
private final SortedMap<Integer, T> circle = new TreeMap<Integer, T>();
public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
Collection<T> nodes) {
this.hashFunction = hashFunction;
this.numberOfReplicas = numberOfReplicas;
for (T node : nodes) {
add(node);
}
}
public void add(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.put(hashFunction.hash(node.toString() + i), node);
}
}
public void remove(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.remove(hashFunction.hash(node.toString() + i));
}
}
public T get(Object key) {
if (circle.isEmpty()) {
return null;
}
int hash = hashFunction.hash(key);
if (!circle.containsKey(hash)) {
SortedMap<Integer, T> tailMap = circle.tailMap(hash);
hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
}
return circle.get(hash);
}
}
About oracle hash partition, part from oracle help doc
After some research, oracle actually do support consistent hashing by the default hash partitioning. Though how it did is a secret and not published. But it actually leverage the way HashMap, but hidden some partitions. So when you add/remove partition, very less work for oracle to adjust the data in different partitions. The algorithms only ensures evenly splitting data into partitions of numbers power of 2 such as 4. So if it's not, then merge/split some partitions.
The magic is like if to increase from four partitions to five, it actually spilts one partition into two. If to decrease from four partitions into three, it actually merges two partitions into one.
If anyone has more insight, add a more detailed answer.
Hash Partitioning
Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies to the partitioning key that you identify. The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size.
Hash partitioning is the ideal method for distributing data evenly across devices. Hash partitioning is also an easy-to-use alternative to range partitioning, especially when the data to be partitioned is not historical or has no obvious partitioning key.
Note:
You cannot change the hashing algorithms used by partitioning.
About MYSQL hash partition, part from mysql help doc
It provides two partition function
One is partition by HASH.
The other is partition by KEY.
Partitioning by key is similar to partitioning by hash, except that where hash partitioning employs a user-defined expression, the hashing function for key partitioning is supplied by the MySQL server. MySQL Cluster uses MD5() for this purpose; for tables using other storage engines, the server employs its own internal hashing function which is based on the same algorithm as PASSWORD().
The syntax rules for CREATE TABLE ... PARTITION BY KEY are similar to those for creating a table that is partitioned by hash.
The major differences are listed here:
•KEY is used rather than HASH.
•KEY takes only a list of one or more column names. Beginning with MySQL 5.1.5, the column or columns used as the partitioning key must comprise part or all of the table's primary key, if the table has one.
CREATE TABLE k1 (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20)
)
PARTITION BY KEY()
PARTITIONS 2;
If there is no primary key but there is a unique key, then the unique key is used for the partitioning key:
CREATE TABLE k1 (
id INT NOT NULL,
name VARCHAR(20),
UNIQUE KEY (id)
)
PARTITION BY KEY()
PARTITIONS 2;
However, if the unique key column were not defined as NOT NULL, then the previous statement would fail.
But it don't tell how it partitions, will have to look into code.

Resources