Does default sorting in mapreduce uses Comparator defined in WritableComparable class or the comapreTo() method? - hadoop

How does sort happens in mapreduce before the output is passed from mapper to reducer. If my output key from mapper is of type IntWritable, does it uses the comparator defined in IntWritable class or compareTo method in the class, if yes how the call is made. If not how the sort is performed, how the call is made?

Map job outputs are first collected and then sent to the Partitioner, responsible for determining to which Reducer the data will be sent (it's not yet grouped by reduce() call though). The default Partitioner uses the hashCode() method of the Key and a modulo with the number of Reducers to do that.
After that, the Comparator will be called to perform a sort on the Map outputs. Flow looks like that:
Collector --> Partitioner --> Spill --> Comparator --> Local Disk (HDFS) <-- MapOutputServlet
Each Reducer will then copy the data from the mapper that has been assigned to it by the partitioner, and pass it through a Grouper that will determine how records are grouped for a single Reducer function call:
MapOutputServlet --> Copy to Local Disk (HDFS) --> Group --> Reduce
Before a function call, the records will also go through a Sorting phase to determine in which order they arrive to the reducer. The Sorter (WritableComparator()) will call the compareTo() (WritableComparable() interface) method of the Key.
To give you a better idea, here is how you would implement a basic compareTo(), grouper and sorter for a custom composite key:
public class CompositeKey implements WritableComparable<CompositeKey> {
IntWritable primaryField = new IntWritable();
IntWritable secondaryField = new IntWritable();
public CompositeKey(IntWritable p, IntWritable s) {
this.primaryField.set(p);
this.secondaryField = s;
}
public void write(DataOutput out) throws IOException {
this.primaryField.write(out);
this.secondaryField.write(out);
}
public void readFields(DataInput in) throws IOException {
this.primaryField.readFields(in);
this.secondaryField.readFields(in);
}
// Called by the partitionner to group map outputs to same reducer instance
// If the hash source is simple (primary type or so), a simple call to their hashCode() method is good enough
public int hashCode() {
return this.primaryField.hashCode();
}
#Override
public int compareTo(CompositeKey other) {
if (this.getPrimaryField().equals(other.getPrimaryField())) {
return this.getSecondaryField().compareTo(other.getSecondaryField());
} else {
return this.getPrimaryField().compareTo(other.getPrimaryField());
}
}
}
public class CompositeGroupingComparator extends WritableComparator {
public CompositeGroupingComparator() {
super(CompositeKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.getPrimaryField().compareTo(second.getPrimaryField());
}
}
public class CompositeSortingComparator extends WritableComparator {
public CompositeSortingComparator() {
super (CompositeKey.class, true);
}
#Override
public int compare (WritableComparable a, WritableComparable b){
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.compareTo(second);
}
}

After Mapper framework takes care about comparing for us for all the default datatypes like IntWritable, DoubleWritable e.t.c ... But if you have a user defined keytype you need to implement WritableComparable Interface.
WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.
Note that hashCode() is frequently used in Hadoop to partition keys. It's important that your implementation of hashCode() returns the same result across different instances of the JVM. Note also that the default hashCode() implementation in Object does not satisfy this property.
Example:
public class MyWritableComparable implements WritableComparable {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable o) {
int thisValue = this.value;
int thatValue = o.value;
return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + counter;
result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
return result
}
}
From :https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/WritableComparable.html

Related

hadoop partitioner not working

public class Partitioner_2 implements Partitioner<Text,Text>{
#Override
public int getPartition(Text key, Text value, int numPartitions) {
int hashValue=0;
for(char c: key.toString().split("\\|\\|")[0].toCharArray()){
hashValue+=(int)c;
}
return Math.abs(hashValue * 127) % numPartitions;
}
}
That is my partitioner code and the key is in the form:
"str1||str2" , I would like to send all keys that have the same value for str1 to the same reducer.
My GroupComparator and KeyComparator are as follows:
public static class GroupComparator_2 extends WritableComparator {
protected GroupComparator_2() {
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
Text kw1 = (Text) w1;
Text kw2 = (Text) w2;
String k1=kw1.toString().split("||")[0].trim();
String k2=kw2.toString().split("||")[0].trim();
return k1.compareTo(k2);
}
}
public static class KeyComparator_2 extends WritableComparator {
protected KeyComparator_2() {
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
Text key1 = (Text) w1;
Text key2 = (Text) w2;
String kw1_key1=key1.toString().split("||")[0];
String kw1_key2=key2.toString().split("||")[0];
int cmp=kw1_key1.compareTo(kw1_key2);
if(cmp==0){
String kw2_key1=key1.toString().split("||")[1].trim();
String kw2_key2=key2.toString().split("||")[1].trim();
cmp=kw2_key1.compareTo(kw2_key2);
}
return cmp;
}
}
The error I am currently receiving is :
KeywordKeywordCoOccurrence_2.java:92: interface expected here
public class Partitioner_2 implements Partitioner<Text,Text>{
^
KeywordKeywordCoOccurrence_2.java:94: method does not override or implement a method from a supertype
#Override
^
KeywordKeywordCoOccurrence_2.java:147: setPartitionerClass(java.lang.Class<? extends org.apache.hadoop.mapreduce.Partitioner>) in org.apache.hadoop.mapreduce.Job cannot be applied to (java.lang.Class<KeywordKeywordCoOccurrence_2.Partitioner_2>)
job.setPartitionerClass(Partitioner_2.class);
But as far as I can tell I have overridden the getPartition() method which is the only method in the Partitioner interface? Any help in identifying what I am doing wrong and how to fix it would be much appreciated.
Thanks in advance!
Partitioner is an abstract class in the new mapreduce API (that you're apparently using).
So you should define it as:
public class Partitioner_2 extends Partitioner<Text, Text> {

Custom Partitioner Error

I am writing my own custom Partitioner(Old Api) below is the code where I am extending Partitioner class:
public static class WordPairPartitioner extends Partitioner<WordPair,IntWritable> {
#Override
public int getPartition(WordPair wordPair, IntWritable intWritable, int numPartitions) {
return wordPair.getWord().hashCode() % numPartitions;
}
}
Setting the JobConf:
conf.setPartitionerClass(WordPairPartitioner.class);
WordPair Class contains:
private Text word;
private Text neighbor;
Questions:
1. I am getting error:"actual argument class (WordPairPartitioner) cannot convert to Class (?extends Partitioner).
2. Is this a right way to write the custom partitioner or do I need to override some other functionality as well?
I believe you are mixing up old API(classes from org.apache.hadoop.mapred.*) and new API(classes from org.apache.hadoop.mapreduce.*)
Using old API, you may do as follows:
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
public static class WordPairPartitioner implements Partitioner<WordPair,IntWritable> {
#Override
public int getPartition(WordPair wordPair, IntWritable intWritable, int numPartitions) {
return wordPair.getWord().hashCode() % numPartitions;
}
#Override
public void configure(JobConf arg0) {
}
}
In addition to Amar's answer, you should handle the eventuality of hashCode returning a negative number by bit masking:
#Override
public int getPartition(WordPair wordPair, IntWritable intWritable, int numPartitions) {
return (wordPair.getWord().hashCode() % numPartitions) & 0x7FFFFFFF;
}

How to sort JPA entities using TreeSet

I have a few entity beans with sets of other entities as attributes. I want to sort them and guarantee that at every insertion they will remain sorted.
I tried this way:
#Entity
#Table(name = "Document")
public class Document implements Serializable, Comparable<Document> {
...
#ManyToOne
private Project relativeProject;
...
#Override
public int compareTo(Document d) {
long data1 = getDate().getTimeInMillis();
long data2 = d.getDate().getTimeInMillis();
if(data1 > data2)
return 1;
else
if(data2 < data1)
return -1;
else
return 0;
}
#Override
public boolean equals(Object d) {
return (getIdDocument() == ((Document)d).getIdDocument());
}
#Override
public int hashCode() {
return Long.valueOf(getIdDocument()).hashCode();
}
}
#Entity
#Table(name="Project")
public class Project implements Serializable, Comparable<Project> {
...
#OneToMany(mappedBy="relativeProject", fetch = FetchType.EAGER)
#Sort(type = SortType.NATURAL)
private SortedSet<Document> formedByDocuments;
public Progetto() {
this.formedByDocuments = new TreeSet<Document>();
}
...
}
But it does not work. The problem is that, even if in the database there are all needed entries, when a session bean returns a Project there will miss some Document. Moreover, entries are not sorted at all in the database.
If I do not sort at all (using HashSet) and republish the project, everything works fine and I get all the elements in a set (but not sorted, of course).
Can someone help me to find out what's wrong with my sorting?
I'm assuming getIdDocument() returns an object and not a primitive.
In that case, you need to use equals and not ==
public boolean equals(Object d) {
return (getIdDocument().equals((Document)d).getIdDocument());
}
Edit:
Looks like the problem is in the second if(data2 < data1) statement of the compareTo() method. This should be if(data1 < data2)

Can Hadoop mapper produce multiple keys in output?

Can a single Mapper class produce multiple key-value pairs (of same type) in a single run?
We output the key-value pair in the mapper like this:
context.write(key, value);
Here's a trimmed down (and exemplified) version of the Key:
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class MyKey extends ObjectWritable implements WritableComparable<MyKey> {
public enum KeyType {
KeyType1,
KeyType2
}
private KeyType keyTupe;
private Long field1;
private Integer field2 = -1;
private String field3 = "";
public KeyType getKeyType() {
return keyTupe;
}
public void settKeyType(KeyType keyType) {
this.keyTupe = keyType;
}
public Long getField1() {
return field1;
}
public void setField1(Long field1) {
this.field1 = field1;
}
public Integer getField2() {
return field2;
}
public void setField2(Integer field2) {
this.field2 = field2;
}
public String getField3() {
return field3;
}
public void setField3(String field3) {
this.field3 = field3;
}
#Override
public void readFields(DataInput datainput) throws IOException {
keyTupe = KeyType.valueOf(datainput.readUTF());
field1 = datainput.readLong();
field2 = datainput.readInt();
field3 = datainput.readUTF();
}
#Override
public void write(DataOutput dataoutput) throws IOException {
dataoutput.writeUTF(keyTupe.toString());
dataoutput.writeLong(field1);
dataoutput.writeInt(field2);
dataoutput.writeUTF(field3);
}
#Override
public int compareTo(MyKey other) {
if (getKeyType().compareTo(other.getKeyType()) != 0) {
return getKeyType().compareTo(other.getKeyType());
} else if (getField1().compareTo(other.getField1()) != 0) {
return getField1().compareTo(other.getField1());
} else if (getField2().compareTo(other.getField2()) != 0) {
return getField2().compareTo(other.getField2());
} else if (getField3().compareTo(other.getField3()) != 0) {
return getField3().compareTo(other.getField3());
} else {
return 0;
}
}
public static class MyKeyComparator extends WritableComparator {
public MyKeyComparator() {
super(MyKey.class);
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return compareBytes(b1, s1, l1, b2, s2, l2);
}
}
static { // register this comparator
WritableComparator.define(MyKey.class, new MyKeyComparator());
}
}
And this is how we tried to output both keys in the Mapper:
MyKey key1 = new MyKey();
key1.settKeyType(KeyType.KeyType1);
key1.setField1(1L);
key1.setField2(23);
MyKey key2 = new MyKey();
key2.settKeyType(KeyType.KeyType2);
key2.setField1(1L);
key2.setField3("abc");
context.write(key1, value1);
context.write(key2, value2);
Our job's output format class is: org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
I'm stating this because in other output format classes I've seen the output not appending and just committing in their implementation of write method.
Also, we are using the following classes for Mapper and Context:
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Context
Writing to the context multiple times in one map task is perfectly fine.
However, you may have several problems with your key class. Whenever you implement WritableComparable for a key, you should also implement equals(Object) and hashCode() methods. These aren't part of the WritableComparable interface, since they are defined in Object, but you must provide implementations.
The default partitioner uses the hashCode() method to decide which reducer each key/value pair goes to. If you don't provide a sane implementation, you can get strange results.
As a rule of thumb, whenever you implement hashCode() or any sort of comparison method, you should provide an equals(Object) method as well. You will have to make sure it accepts an Object as the parameter, as this is how it is defined in the Object class (whose implementation you are probably overriding).

Hadoop/MapReduce: Reading and writing classes generated from DDL

Can someone walk me though the basic work-flow of reading and writing data with classes generated from DDL?
I have defined some struct-like records using DDL. For example:
class Customer {
ustring FirstName;
ustring LastName;
ustring CardNo;
long LastPurchase;
}
I've compiled this to get a Customer class and included it into my project. I can easily see how to use this as input and output for mappers and reducers (the generated class implements Writable), but not how to read and write it to file.
The JavaDoc for the org.apache.hadoop.record package talks about serializing these records in Binary, CSV or XML format. How do I actually do that? Say my reducer produces IntWritable keys and Customer values. What OutputFormat do I use to write the result in CSV format? What InputFormat would I use to read the resulting files in later, if I wanted to perform analysis over them?
Ok, so I think I have this figured out. I'm not sure if it is the most straight-forward way, so please correct me if you know a simpler work-flow.
Every class generated from DDL implements the Record interface, and consequently provides two methods:
serialize(RecordOutput out) for writing
deserialize(RecordInput in) for reading
RecordOutput and RecordInput are utility interfaces provided in the org.apache.hadoop.record package. There are a few implementations (e.g. XMLRecordOutput, BinaryRecordOutput, CSVRecordOutput)
As far as I know, you have to implement your own OutputFormat or InputFormat classes to use these. This is fairly easy to do.
For example, the OutputFormat I talked about in the original question (one that writes Integer keys and Customer values in CSV format) would be implemented like this:
private static class CustomerOutputFormat
extends TextOutputFormat<IntWritable, Customer>
{
public RecordWriter<IntWritable, Customer> getRecordWriter(FileSystem ignored,
JobConf job,
String name,
Progressable progress)
throws IOException {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new CustomerRecordWriter(fileOut);
}
protected static class CustomerRecordWriter
implements RecordWriter<IntWritable, Customer>
{
protected DataOutputStream outStream ;
public AnchorRecordWriter(DataOutputStream out) {
this.outStream = out ;
}
public synchronized void write(IntWritable key, Customer value) throws IOException {
CsvRecordOutput csvOutput = new CsvRecordOutput(outStream);
csvOutput.writeInteger(key.get(), "id") ;
value.serialize(csvOutput) ;
}
public synchronized void close(Reporter reporter) throws IOException {
outStream.close();
}
}
}
Creating the InputFormat is much the same. Because the csv format is one entry per line, we can use a LineRecordReader internally to do most of the work.
private static class CustomerInputFormat extends FileInputFormat<IntWritable, Customer> {
public RecordReader<IntWritable, Customer> getRecordReader(
InputSplit genericSplit,
JobConf job,
Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
return new CustomerRecordReader(job, (FileSplit) genericSplit);
}
private class CustomerRecordReader implements RecordReader<IntWritable, Customer> {
private LineRecordReader lrr ;
public CustomerRecordReader(Configuration job, FileSplit split)
throws IOException{
this.lrr = new LineRecordReader(job, split);
}
public IntWritable createKey() {
return new IntWritable();
}
public Customer createValue() {
return new Customer();
}
public synchronized boolean next(IntWritable key, Customer value)
throws IOException {
LongWritable offset = new LongWritable() ;
Text line = new Text() ;
if (!lrr.next(offset, line))
return false ;
CsvRecordInput cri = new CsvRecordInput(new
ByteArrayInputStream(line.toString().getBytes())) ;
key.set(cri.readInt("id")) ;
value.deserialize(cri) ;
return true ;
}
public float getProgress() {
return lrr.getProgress() ;
}
public synchronized long getPos() throws IOException {
return lrr.getPos() ;
}
public synchronized void close() throws IOException {
lrr.close();
}
}
}

Resources