what is the significance of RawComparator and at what scenarios we use this - hadoop

What is RawComparator and its significance?
Is it mandatory to use RawComparator for every mapreduce program?

A RawComparator directly operates on byte representations of objects
it is not mandatory to use it in every map reduce program
MapReduce is fundamentally a batch processing system, and is not
suitable for interactive analysis. You can’t run a query and get results back in a few seconds or less. Queries typically take minutes or more, so it’s best for offline use, where there is n’t a human sitting in the processing loop waiting for results.
If you still want to optimize time taken by Map Reduce Job, then you have to use RawComparator.
Use of RawComparator:
Intermediate key value pairs have been passed from Mapper to Reducer. before these values reach Reducer from Mapper, shuffle and sorting steps will be performed.
Sorting is improved because the RawComparator will compare the keys by byte. If we did not use RawComparator, the intermediary keys would have to be completely de-serialized to perform a comparison.
Example:
public class IndexPairComparator extends WritableComparator {
protected IndexPairComparator() {
super(IndexPair.class);
}
#Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
int i1 = readInt(b1, s1);
int i2 = readInt(b2, s2);
int comp = (i1 < i2) ? -1 : (i1 == i2) ? 0 : 1;
if(0 != comp)
return comp;
int j1 = readInt(b1, s1+4);
int j2 = readInt(b2, s2+4);
comp = (j1 < j2) ? -1 : (j1 == j2) ? 0 : 1;
return comp;
}
}
In above example, we did not directly implement RawComparator. Instead we extended WritableComparator, which internally implements RawComparator.
Refer to this RawComparator article for more details.

I know I am answering to an old question.
Here is another example of writing a RawComparator for a WritableComparable
public class CompositeWritable2 implements WritableComparable<CompositeWritable2> {
private Text textData1;
private LongWritable longData;
private Text textData2;
static {
WritableComparator.define(CompositeWritable2.class, new Comparator());
}
/**
* Empty constructor
*/
public CompositeWritable2() {
textData1 = new Text();
longData = new LongWritable();
textData2 = new Text();
}
/**
* Comparator
*
* #author CuriousCat
*/
public static class Comparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
private static final LongWritable.Comparator LONG_COMPARATOR = new LongWritable.Comparator();
public Comparator() {
super(CompositeWritable2.class);
}
/*
* (non-Javadoc)
*
* #see org.apache.hadoop.io.WritableComparator#compare(byte[], int, int, byte[], int, int)
*/
#Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
int cmp;
try {
// Find the length of the first text property
int textData11Len = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int textData12Len = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
// Compare the first text data as bytes
cmp = TEXT_COMPARATOR.compare(b1, s1, textData11Len, b2, s2, textData12Len);
if (cmp != 0) {
return cmp;
}
// Read and compare the next 8 bytes starting from the length of first text property.
// The reason for hard coding 8 is, because the second property is long.
cmp = LONG_COMPARATOR.compare(b1, textData11Len, 8, b2, textData12Len, 8);
if (cmp != 0) {
return cmp;
}
// Move the index to the end of the second long property
textData11Len += 8;
textData12Len += 8;
// Find the length of the second text property
int textData21Len = WritableUtils.decodeVIntSize(b1[textData11Len]) + readVInt(b1, textData11Len);
int textData22Len = WritableUtils.decodeVIntSize(b2[textData12Len]) + readVInt(b2, textData12Len);
// Compare the second text data as bytes
return TEXT_COMPARATOR.compare(b1, textData11Len, textData21Len, b2, textData12Len, textData22Len);
} catch (IOException ex) {
throw new IllegalArgumentException("Failed in CompositeWritable's RawComparator!", ex);
}
}
}
/**
* #return the textData1
*/
public Text getTextData1() {
return textData1;
}
/**
* #return the longData
*/
public LongWritable getLongData() {
return longData;
}
/**
* #return the textData2
*/
public Text getTextData2() {
return textData2;
}
/**
* Setter method
*/
public void set(Text textData1, LongWritable longData, Text textData2) {
this.textData1 = textData1;
this.longData = longData;
this.textData2 = textData2;
}
/*
* (non-Javadoc)
*
* #see org.apache.hadoop.io.Writable#write(java.io.DataOutput)
*/
#Override
public void write(DataOutput out) throws IOException {
textData1.write(out);
longData.write(out);
textData2.write(out);
}
/*
* (non-Javadoc)
*
* #see org.apache.hadoop.io.Writable#readFields(java.io.DataInput)
*/
#Override
public void readFields(DataInput in) throws IOException {
textData1.readFields(in);
longData.readFields(in);
textData2.readFields(in);
}
/*
* (non-Javadoc)
*
* #see java.lang.Comparable#compareTo(java.lang.Object)
*/
#Override
public int compareTo(CompositeWritable2 o) {
int cmp = textData1.compareTo(o.getTextData1());
if (cmp != 0) {
return cmp;
}
cmp = longData.compareTo(o.getLongData());
if (cmp != 0) {
return cmp;
}
return textData2.compareTo(o.getTextData2());
}
}

Related

use an int Variable for range to get random number

I am still fairly new to java. I want to make a game with 3 character types that have different stats. I am using int values for each type so that their attack value is a range instead of just being a constant value. Since each character has a different range, I want to substitute an int value instead of an actual number for the method to get a random number. Here is my code.
package battleme;
import java.util.Random;
/**
*
* #author Kitten
*/
class Character {
String name;
int life;
int playerAttacklow;
int playerAttackhigh;
int playerDefense;
int playerLevel;
int currentXP;
int currentGold;
public Character(String name, int life, int playerAttacklow,
int playerAttachhigh, int playerDefense,
int playerLevel, int currentXP, int currentGold) {
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getLife() {
return life;
}
public void setLife(int life) {
this.life = life;
}
public int getPlayerAttacklow() {
return playerAttacklow;
}
public void setPlayerAttacklow(int playerAttacklow) {
this.playerAttacklow = playerAttacklow;
}
public int getPlayerAttackhigh() {
return playerAttackhigh;
}
public void setPlayerAttackhigh(int playerAttackhigh) {
this.playerAttackhigh = playerAttackhigh;
}
public int getPlayerDefense() {
return playerDefense;
}
public void setPlayerDefense(int playerDefense) {
this.playerDefense = playerDefense;
}
public int getPlayerLevel() {
return playerLevel;
}
public void setPlayerLevel(int playerLevel) {
this.playerLevel = playerLevel;
}
public int getCurrentXP() {
return currentXP;
}
public void setCurrentXP(int currentXP) {
this.currentXP = currentXP;
}
public int getCurrentGold() {
return currentGold;
}
public void setCurrentGold(int currentGold) {
this.currentGold = currentGold;
}
//the problem child
int ActualAttackGen(int playerAttackhigh, int playerAttacklow) {
Random rn = new Random();
int randomNum;
randomNum= rn.nextInt((playerAttackhigh-playerAttacklow) + 1)+ playerAttacklow ;
return randomNum ;
}
package battleme;
public class BattleMe {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
Character Warrior = new Character("Warrior", 30, 2, 10, 3, 1, 1, 15);
Character Rouge = new Character("Rouge", 25, 3, 6, 2, 1, 1, 15);
Character Mage = new Character("Mage", 18, 2, 8, 1, 1, 1, 15);
// trying to run the problem child
System.out.println(Warrior.ActualAttackGen(Warrior.playerAttackhigh,Warrior.playerAttacklow));
}
}
Whenever I try to run this, I always get an value of 0. Please help!
In your constructor you have to assign the passed values to the respective member of your Character class:
public Character(String name, int life, int playerAttacklow,
int playerAttachhigh, int playerDefense,
int playerLevel, int currentXP, int currentGold) {
this.name = name;
....
}
BTW: It would be good coding practice to distinguish member and parameter names. Usually one prefixes one of them (or both). E.g. member myName, parameter aName. So you do not have to reference the member with "this.":
myName = aName;

how to get the keys sorted by custom comparator in map-reduce job in Hadoop?

Consider this class: (From Hadoop: The definitive guide 3rd edition):
import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}
#Override
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}
#Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
#Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}
#Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
#Override
public String toString() {
return first + "\t" + second;
}
#Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
// ^^ TextPair
// vv TextPairComparator
public static class Comparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public Comparator() {
super(TextPair.class);
}
#Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
if (cmp != 0) {
return cmp;
}
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
}
static {
WritableComparator.define(TextPair.class, new Comparator());
}
// ^^ TextPairComparator
// vv TextPairFirstComparator
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
super(TextPair.class);
}
#Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
// ^^ TextPairFirstComparator
// vv TextPair
}
// ^^ TextPair
There are two kinds of comparators defined:
one is sorting by first followed by second which is the default comparator.
The other is sorting by first ONLY, which is the firstComparator.
If I have to use use firstComparator for sorting my keys, how do I achieve that?
That is, how do I override my default comparator with the first comparator, I defined above.
Secondly, how would I unitTest this since the output of map job is not sorted. ?
If I have to use use firstComparator for sorting my keys, how do I achieve that? That is, how do I override my default comparator with the first comparator, I defined above.
I assume you expect a method something like setComparator(firstComparator). As far as I know there is no such method. The keys are sorted (on the mapper side) using the compareTo() of the Writeable type representing the keys. In your case, the compareTo() method checks the first value and then the second one. In other words, the keys will be sorted by the first value and, then, the keys in the same group (i.e. having the same first value) will be sorted by their second value.
All in all, this means that your keys will always be sorted by the first value (+ by the second value if the first one isn't able to take the decision). Which in turn means that there is no need to have a different comparator (firstComparator) which looks only at the first value because that is already achieved with the compareTo() method of your TextPair class.
On the other hand, if the firstComparator sorts the keys completely differently, the only solution is to move the logic in firstComparator to the compareTo() method of the Writable class representing your key. I don't see any reason why you wouldn't do that. If you already have the firstComparator and want to reuse it, you can instantiate it and invoke it in the compareTo() method of the TexPair Writable.
You might also want to take a look at the GroupingComparator which is used to decide which keys are used together in the same call of the reduce() method. Since you didn't describe exactly what you want to achieve, I can't say for sure if this will be helpful or not.
Secondly, how would I unitTest this since the output of map job is not sorted. ?
Unit testing, as the name says, implies testing a single unit of code (most of the time a method/function/procedure). If you want to unit-test your reduce method you have to provide the interesting input cases and to check that the method under test outputs the expected result. More concretely, you have to create/mock a sorted Iterable over your keys and invoke your reduce function with it. Unit testing a reduce method shouldn't rely on the execution of the corresponding map method.

Serializing a long string in Hadoop

I have a class which implements WritableComparable class in Hadoop. This class has two string variables, one short and one very long. I use writeChars to write these variables and readLine to read them but it seems like I get some sort of error. What is the best way to serialize such a long String in Hadoop?
I think you can use byteswritable to make it more efficient. Check the below custom key which has BytesWritable type as callId.
public class CustomMRKey implements WritableComparable<CustomMRKey> {
private BytesWritable callId;
private IntWritable mapperType;
/**
* #default constructor
*/
public CustomMRKey() {
set(new BytesWritable(), new IntWritable());
}
/**
* Constructor
*
* #param callId
* #param mapperType
*/
public CustomMRKey(BytesWritable callId, IntWritable mapperType) {
set(callId, mapperType);
}
/**
* sets the call id and mapper type
*
* #param callId
* #param mapperType
*/
public void set(BytesWritable callId, IntWritable mapperType) {
this.callId = callId;
this.mapperType = mapperType;
}
/**
* This method returns the callId
*
* #return callId
*/
public BytesWritable getCallId() {
return callId;
}
/**
* This method sets the callId given a callId
*
* #param callId
*/
public void setCallId(BytesWritable callId) {
this.callId = callId;
}
/**
* This method returns the mapper type
*
*
* #return
*/
public IntWritable getMapperType() {
return mapperType;
}
/**
* This method is set to store the mapper type
*
* #param mapperType
*/
public void setMapperType(IntWritable mapperType) {
this.mapperType = mapperType;
}
#Override
public void readFields(DataInput in) throws IOException {
callId.readFields(in);
mapperType.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
callId.write(out);
mapperType.write(out);
}
#Override
public boolean equals(Object obj) {
if (obj instanceof CustomMRCdrKey) {
CustomMRCdrKey key = (CustomMRCdrKey) obj;
return callId.equals(key.callId)
&& mapperType.equals(key.mapperType);
}
return false;
}
#Override
public int compareTo(CustomMRCdrKey key) {
int cmp = callId.compareTo(key.getCallId());
if (cmp != 0) {
return cmp;
}
return mapperType.compareTo(key.getMapperType());
}
}
To use in say mapper code say you can generate the key of BytesWritable form using something as following :-
You can call as :
CustomMRKey customKey=new CustomMRKey(new BytesWritable(),new IntWritable());
customKey.setCallId(makeKey(value, this.resultKey));
customKey.setMapperType(this.mapTypeIndicator);
Then makeKey method is something like below :-
public BytesWritable makeKey(Text value, BytesWritable key) throws IOException {
try {
ByteArrayOutputStream byteKey = new ByteArrayOutputStream(Constants.MR_DEFAULT_KEY_SIZE);
for (String field : keyFields) {
byte[] bytes = value.getString(field).getBytes();
byteKey.write(bytes,0,bytes.length);
}
if(key==null){
return new BytesWritable(byteKey.toByteArray());
}else{
key.set(byteKey.toByteArray(), 0, byteKey.size());
return key;
}
} catch (Exception ex) {
throw new IOException("Could not generate key", ex);
}
}
Hope this may help.

Hadoop Raw comparator

I am trying to implement the following in a Raw Comparator but not sure how to write this?
the tumestamp field here is of tyoe LongWritable.
if (this.getNaturalKey().compareTo(o.getNaturalKey()) != 0) {
return this.getNaturalKey().compareTo(o.getNaturalKey());
} else if (this.timeStamp != o.timeStamp) {
return timeStamp.compareTo(o.timeStamp);
} else {
return 0;
}
I found a hint here, but not sure how do I implement this dealing with a LongWritabel type?
http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/serialization/id3548156
Thanks for your help
Let say i have a CompositeKey that represents a pair of (String stockSymbol, long timestamp).
We can do a primary grouping pass on the stockSymbol field to get all of the data of one type together, and then our "secondary sort" during the shuffle phase uses the timestamp long member to sort the timeseries points so that they arrive at the reducer partitioned and in sorted order.
public class CompositeKey implements WritableComparable<CompositeKey> {
// natural key is (stockSymbol)
// composite key is a pair (stockSymbol, timestamp)
private String stockSymbol;
private long timestamp;
......//Getter setter omiited for clarity here
#Override
public void readFields(DataInput in) throws IOException {
this.stockSymbol = in.readUTF();
this.timestamp = in.readLong();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.stockSymbol);
out.writeLong(this.timestamp);
}
#Override
public int compareTo(CompositeKey other) {
if (this.stockSymbol.compareTo(other.stockSymbol) != 0) {
return this.stockSymbol.compareTo(other.stockSymbol);
}
else if (this.timestamp != other.timestamp) {
return timestamp < other.timestamp ? -1 : 1;
}
else {
return 0;
}
}
Now the CompositeKey comparator would be:
public class CompositeKeyComparator extends WritableComparator {
protected CompositeKeyComparator() {
super(CompositeKey.class, true);
}
#Override
public int compare(WritableComparable wc1, WritableComparable wc2) {
CompositeKey ck1 = (CompositeKey) wc1;
CompositeKey ck2 = (CompositeKey) wc2;
int comparison = ck1.getStockSymbol().compareTo(ck2.getStockSymbol());
if (comparison == 0) {
// stock symbols are equal here
if (ck1.getTimestamp() == ck2.getTimestamp()) {
return 0;
}
else if (ck1.getTimestamp() < ck2.getTimestamp()) {
return -1;
}
else {
return 1;
}
}
else {
return comparison;
}
}
}
Are you asking about way to compare LongWritable type provided by hadoop ?
If yes, then the answer is to use compare() method. For more details, scroll down here.
The best way to correctly implement RawComparator is to extend WritableComparator and override compare() method. The WritableComparator is very good written, so you can easily understand it.
It is already implemented from what I see in the LongWritable class:
/** A Comparator optimized for LongWritable. */
public static class Comparator extends WritableComparator {
public Comparator() {
super(LongWritable.class);
}
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
long thisValue = readLong(b1, s1);
long thatValue = readLong(b2, s2);
return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
}
That byte comparision is the override of the RawComparator.

Sort order with Hadoop MapRed

Well,
I'd like to know how can I change the sort order of my simple WordCount program after the reduce task? I've already made another map to order by value instead by keys, but it still ordered in ascending order.
Is there an easy simple method to do this (change the sort order)?!
Thanks
Vellozo
If you are using the older API (mapred.*), then set the OutputKeyComparatorClass in the job conf:
jobConf.setOutputKeyComparatorClass(ReverseComparator.class);
ReverseComparator can be something like this:
static class ReverseComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public ReverseComparator() {
super(Text.class);
}
#Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
try {
return (-1)* TEXT_COMPARATOR
.compare(b1, s1, l1, b2, s2, l2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof Text && b instanceof Text) {
return (-1)*(((Text) a)
.compareTo((Text) b)));
}
return super.compare(a, b);
}
}
In the new API (mapreduce.*), I think you need to use the Job.setSortComparator() method.
This one is almost the same as above, just looks a bit simpler
class MyKeyComparator extends WritableComparator {
protected DescendingKeyComparator() {
super(Text.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
Text key1 = (Text) w1;
Text key2 = (Text) w2;
return -1 * key1.compareTo(key2);
}
}
Then add it it to the job
job.setSortComparatorClass(MyKeyComparator.class);
Text key1 = (Text) w1;
Text key2 = (Text) w2;
you can change the above text type as per ur use.

Resources