Consider this class: (From Hadoop: The definitive guide 3rd edition):
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
public TextPair(Text first, Text second) {
set(first, second);
public void set(Text first, Text second) {
this.first = first;
this.second = second;
public Text getFirst() {
return first;
public Text getSecond() {
return second;
public void write(DataOutput out) throws IOException {
public void readFields(DataInput in) throws IOException {
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
return false;
public String toString() {
return first + "\t" + second;
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
return second.compareTo(tp.second);
// ^^ TextPair
// vv TextPairComparator
public static class Comparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public Comparator() {
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
int cmp =, s1, firstL1, b2, s2, firstL2);
if (cmp != 0) {
return cmp;
return, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
static {
WritableComparator.define(TextPair.class, new Comparator());
// ^^ TextPairComparator
// vv TextPairFirstComparator
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
return, b);
// ^^ TextPairFirstComparator
// vv TextPair
// ^^ TextPair
There are two kinds of comparators defined:
one is sorting by first followed by second which is the default comparator.
The other is sorting by first ONLY, which is the firstComparator.
If I have to use use firstComparator for sorting my keys, how do I achieve that?
That is, how do I override my default comparator with the first comparator, I defined above.
I assume you expect a method something like setComparator(firstComparator). As far as I know there is no such method. The keys are sorted (on the mapper side) using the compareTo() of the Writeable type representing the keys. In your case, the compareTo() method checks the first value and then the second one. In other words, the keys will be sorted by the first value and, then, the keys in the same group (i.e. having the same first value) will be sorted by their second value.
All in all, this means that your keys will always be sorted by the first value (+ by the second value if the first one isn't able to take the decision). Which in turn means that there is no need to have a different comparator (firstComparator) which looks only at the first value because that is already achieved with the compareTo() method of your TextPair class.
On the other hand, if the firstComparator sorts the keys completely differently, the only solution is to move the logic in firstComparator to the compareTo() method of the Writable class representing your key. I don't see any reason why you wouldn't do that. If you already have the firstComparator and want to reuse it, you can instantiate it and invoke it in the compareTo() method of the TexPair Writable.
You might also want to take a look at the GroupingComparator which is used to decide which keys are used together in the same call of the reduce() method. Since you didn't describe exactly what you want to achieve, I can't say for sure if this will be helpful or not.
Unit testing, as the name says, implies testing a single unit of code (most of the time a method/function/procedure). If you want to unit-test your reduce method you have to provide the interesting input cases and to check that the method under test outputs the expected result. More concretely, you have to create/mock a sorted Iterable over your keys and invoke your reduce function with it. Unit testing a reduce method shouldn't rely on the execution of the corresponding map method.


I want to write a Hadoop MapReduce Join in Java

I'm completely new in Hadoop Framework and I want to write a "MapReduce" program ( that joins on x attribute between two tables R and S. The structure of the two tables is :
R (tag : char, x : int, y : varchar(30))
S (tag : char, x : int, z : varchar(30))
For example we have for R table :
r 10 r-10-0
r 11 r-11-0
r 12 r-12-0
r 21 r-21-0
And for S table :
s 11 s-11-0
s 21 s-41-0
s 21 s-41-1
s 12 s-31-0
s 11 s-31-1
The result should look like :
r 11 r-11-0 s 11 s-11-0
Can anyone help me please ?
It will be very difficult to describe join in mapreduce for someone who is new to this Framework but here I provide a working implementation for your situation and I definitely recommend you to read section 9 of Hadoop The Definitive Guide 4th Eddition. It has described how to implement Join in mapreduce very well.
First of all you might consider using higher level frameworks such as Pig, Hive and Spark because they provide join operation in their core part of implementation.
Secondly There are many ways to implement mapreduce depending of the nature of your data. This ways include map-side join and reduce-side join. In this answer I have implemented the reduce-side join:
First of all we should have two different mapper for two different datset notice that in your case same mapper can be used for two dataset but in many situation you need different mappers for different dataset and because of that I have defined two mappers to make this solution more general:
I have used TextPair that have two attributes, one of them is the key that is used to join data and the other one is a tag that specify which dataset this record belongs to. If it belongs to the first dataset this tag will be 0. otherwise it will be 1.
I have implemented TextPair.FirstComparator to ensure that for each key(join by key) the record of the first dataset is the first key which is received by reducer. And all the other records in second dataset with that id are received after that. Actually this line of code will do the trick for us:
So in reducer the first record that we will receive is the record from dataset1 and after that we receive record from dataset2. The only thing that should be done is that we have to write those records.
Mapper for dataset1:
public class JoinDataSet1Mapper
extends Mapper<LongWritable, Text, TextPair, Text> {
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split(" ");
context.write(new TextPair(data[1], "0"), value);
Mapper for DataSet2:
public class JoinDataSet2Mapper
extends Mapper<LongWritable, Text, TextPair, Text> {
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split(" ");
context.write(new TextPair(data[1], "1"), value);
public class JoinReducer extends Reducer<TextPair, Text, NullWritable, Text> {
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
protected void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Iterator<Text> iter = values.iterator();
Text stationName = new Text(;
while (iter.hasNext()) {
Text record =;
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
context.write(NullWritable.get(), outValue);
Custom key:
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
public TextPair(Text first, Text second) {
set(first, second);
public void set(Text first, Text second) {
this.first = first;
this.second = second;
public Text getFirst() {
return first;
public Text getSecond() {
return second;
public void write(DataOutput out) throws IOException {
public void readFields(DataInput in) throws IOException {
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
return false;
public String toString() {
return first + "\t" + second;
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
return second.compareTo(tp.second);
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
return, b);
public class JoinJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "Join two DataSet");
Path ncdcInputPath = new Path(getConf().get("job.input1.path"));
Path stationInputPath = new Path(getConf().get("job.input2.path"));
Path outputPath = new Path(getConf().get("job.output.path"));
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, JoinDataSet1Mapper.class);
MultipleInputs.addInputPath(job, stationInputPath,
TextInputFormat.class, JoinDataSet2Mapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = JoinJob(), args);

Part of key changes when iterating through values when using composite key - Hadoop

I have implemented Secondary sort on Hadoop and I don't really understand the behavior of the framework.
I have created a composite key which contains original key and part of value, that is used for sorting.
To achieve this I have implemented my own partitioner
public class CustomPartitioner extends Partitioner<CoupleAsKey, LongWritable>{
public int getPartition(CoupleAsKey couple, LongWritable value, int numPartitions) {
return Long.hashCode(couple.getKey1()) % numPartitions;
My own group comparator
public class GroupComparator extends WritableComparator {
protected GroupComparator()
super(CoupleAsKey.class, true);
public int compare(WritableComparable w1, WritableComparable w2) {
CoupleAsKey c1 = (CoupleAsKey)w1;
CoupleAsKey c2 = (CoupleAsKey)w2;
return, c2.getKey1());
And defined the couple in the following way
public class CoupleAsKey implements WritableComparable<CoupleAsKey>{
private long key1;
private long key2;
public CoupleAsKey() {
public CoupleAsKey(long key1, long key2) {
this.key1 = key1;
this.key2 = key2;
public long getKey1() {
return key1;
public void setKey1(long key1) {
this.key1 = key1;
public long getKey2() {
return key2;
public void setKey2(long key2) {
this.key2 = key2;
public void write(DataOutput output) throws IOException {
public void readFields(DataInput input) throws IOException {
key1 = input.readLong();
key2 = input.readLong();
public int compareTo(CoupleAsKey o2) {
int cmp =, o2.getKey1());
if(cmp != 0)
return cmp;
return, o2.getKey2());
public String toString() {
return key1 + "," + key2 + ",";
And here is the driver
Configuration conf = new Configuration();
Job job = new Job(conf);
FileInputFormat.addInputPath(job, new Path("/home/marko/WORK/Whirlpool/input.csv"));
FileOutputFormat.setOutputPath(job, new Path("/home/marko/WORK/Whirlpool/output"));
Now, this works, but what is really strange is that while iterating in reducer for a key, second part of the key (the value part) changes in each iteration. Why and how?
protected void reduce(CoupleAsKey key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
for (LongWritable value : values) {
//key.key2 changes during iterations, why?
context.write(key, value);
Definition says that "if you want all your relevant rows within a partition of data sent to a single reducer you must implement a grouping comparator". This only ensures that those set of keys will be sent to a single reduce call, and not that the key will change from composite (or whatever) to something that only contains that part of key on which grouping was done.
However, when you iterate over values, the corresponding keys will also change. We normally do not observe this happening, as by default the values are grouped on the same (non-composite) key, and thus, even when the value changes, the (value of-) key remains the same.
You can try printing the object reference of the key, and you shall notice that with every iteration, the object reference of the key is also changing (like this:)
Alternatively, you can also try applying a group-comparator on an IntWritable in a following way (you will have to write your own logic to do so):
1 a
1 b
2 c
3 c
3 d
4 a
and you shall see that with every iteration of value, your key is also changing.

Hadoop Raw comparator

I am trying to implement the following in a Raw Comparator but not sure how to write this?
the tumestamp field here is of tyoe LongWritable.
if (this.getNaturalKey().compareTo(o.getNaturalKey()) != 0) {
return this.getNaturalKey().compareTo(o.getNaturalKey());
} else if (this.timeStamp != o.timeStamp) {
return timeStamp.compareTo(o.timeStamp);
} else {
return 0;
I found a hint here, but not sure how do I implement this dealing with a LongWritabel type?
Thanks for your help
Let say i have a CompositeKey that represents a pair of (String stockSymbol, long timestamp).
We can do a primary grouping pass on the stockSymbol field to get all of the data of one type together, and then our "secondary sort" during the shuffle phase uses the timestamp long member to sort the timeseries points so that they arrive at the reducer partitioned and in sorted order.
public class CompositeKey implements WritableComparable<CompositeKey> {
// natural key is (stockSymbol)
// composite key is a pair (stockSymbol, timestamp)
private String stockSymbol;
private long timestamp;
......//Getter setter omiited for clarity here
public void readFields(DataInput in) throws IOException {
this.stockSymbol = in.readUTF();
this.timestamp = in.readLong();
public void write(DataOutput out) throws IOException {
public int compareTo(CompositeKey other) {
if (this.stockSymbol.compareTo(other.stockSymbol) != 0) {
return this.stockSymbol.compareTo(other.stockSymbol);
else if (this.timestamp != other.timestamp) {
return timestamp < other.timestamp ? -1 : 1;
else {
return 0;
Now the CompositeKey comparator would be:
public class CompositeKeyComparator extends WritableComparator {
protected CompositeKeyComparator() {
super(CompositeKey.class, true);
public int compare(WritableComparable wc1, WritableComparable wc2) {
CompositeKey ck1 = (CompositeKey) wc1;
CompositeKey ck2 = (CompositeKey) wc2;
int comparison = ck1.getStockSymbol().compareTo(ck2.getStockSymbol());
if (comparison == 0) {
// stock symbols are equal here
if (ck1.getTimestamp() == ck2.getTimestamp()) {
return 0;
else if (ck1.getTimestamp() < ck2.getTimestamp()) {
return -1;
else {
return 1;
else {
return comparison;
Are you asking about way to compare LongWritable type provided by hadoop ?
If yes, then the answer is to use compare() method. For more details, scroll down here.
The best way to correctly implement RawComparator is to extend WritableComparator and override compare() method. The WritableComparator is very good written, so you can easily understand it.
It is already implemented from what I see in the LongWritable class:
/** A Comparator optimized for LongWritable. */
public static class Comparator extends WritableComparator {
public Comparator() {
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
long thisValue = readLong(b1, s1);
long thatValue = readLong(b2, s2);
return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
That byte comparision is the override of the RawComparator.

Sort order with Hadoop MapRed

I'd like to know how can I change the sort order of my simple WordCount program after the reduce task? I've already made another map to order by value instead by keys, but it still ordered in ascending order.
Is there an easy simple method to do this (change the sort order)?!
If you are using the older API (mapred.*), then set the OutputKeyComparatorClass in the job conf:
ReverseComparator can be something like this:
static class ReverseComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public ReverseComparator() {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
try {
return (-1)* TEXT_COMPARATOR
.compare(b1, s1, l1, b2, s2, l2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof Text && b instanceof Text) {
return (-1)*(((Text) a)
.compareTo((Text) b)));
return, b);
In the new API (mapreduce.*), I think you need to use the Job.setSortComparator() method.
This one is almost the same as above, just looks a bit simpler
class MyKeyComparator extends WritableComparator {
protected DescendingKeyComparator() {
super(Text.class, true);
public int compare(WritableComparable w1, WritableComparable w2) {
Text key1 = (Text) w1;
Text key2 = (Text) w2;
return -1 * key1.compareTo(key2);
Then add it it to the job
Text key1 = (Text) w1;
Text key2 = (Text) w2;
you can change the above text type as per ur use.
