ReadField with complextype in hadoop - hadoop

I have this Class:
public class Stripe implements WritableComparable<Stripe>{
private List<Term> occorrenze = new ArrayList<Term>();
public Stripe(){}
public void readFields(DataInput in) throws IOException {
public class Term implements WritableComparable<Term> {
private Text key;
private IntWritable frequency;
public void readFields(DataInput in) throws IOException {
Stripe is a List of Term (pair of Text and intWritable).
how can I set the method "readField" for read the complex type Stripe from DataInput?

To serialize an list you'll need to write out the length of the list, followed by the elements themselves. A simple readFields / write method pair for Stripe could be:
public void readFields(DataInput in) throws IOException {
int cnt = in.readInt();
for (int x = 0; x < cnt; x++) {
Term term = new Term();
public void write(DataOutput out) throws IOException {
for (Term term : occorrenze) {
You could make this more efficient by using a VInt rather than an int, and by using a pool of Terms which can be re-used to save on object creation / garbage collection in the readFields method

You could use ArrayWritable which is a list of writables of the same type.


I want to write a Hadoop MapReduce Join in Java

I'm completely new in Hadoop Framework and I want to write a "MapReduce" program ( that joins on x attribute between two tables R and S. The structure of the two tables is :
R (tag : char, x : int, y : varchar(30))
S (tag : char, x : int, z : varchar(30))
For example we have for R table :
r 10 r-10-0
r 11 r-11-0
r 12 r-12-0
r 21 r-21-0
And for S table :
s 11 s-11-0
s 21 s-41-0
s 21 s-41-1
s 12 s-31-0
s 11 s-31-1
The result should look like :
r 11 r-11-0 s 11 s-11-0
Can anyone help me please ?
It will be very difficult to describe join in mapreduce for someone who is new to this Framework but here I provide a working implementation for your situation and I definitely recommend you to read section 9 of Hadoop The Definitive Guide 4th Eddition. It has described how to implement Join in mapreduce very well.
First of all you might consider using higher level frameworks such as Pig, Hive and Spark because they provide join operation in their core part of implementation.
Secondly There are many ways to implement mapreduce depending of the nature of your data. This ways include map-side join and reduce-side join. In this answer I have implemented the reduce-side join:
First of all we should have two different mapper for two different datset notice that in your case same mapper can be used for two dataset but in many situation you need different mappers for different dataset and because of that I have defined two mappers to make this solution more general:
I have used TextPair that have two attributes, one of them is the key that is used to join data and the other one is a tag that specify which dataset this record belongs to. If it belongs to the first dataset this tag will be 0. otherwise it will be 1.
I have implemented TextPair.FirstComparator to ensure that for each key(join by key) the record of the first dataset is the first key which is received by reducer. And all the other records in second dataset with that id are received after that. Actually this line of code will do the trick for us:
So in reducer the first record that we will receive is the record from dataset1 and after that we receive record from dataset2. The only thing that should be done is that we have to write those records.
Mapper for dataset1:
public class JoinDataSet1Mapper
extends Mapper<LongWritable, Text, TextPair, Text> {
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split(" ");
context.write(new TextPair(data[1], "0"), value);
Mapper for DataSet2:
public class JoinDataSet2Mapper
extends Mapper<LongWritable, Text, TextPair, Text> {
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split(" ");
context.write(new TextPair(data[1], "1"), value);
public class JoinReducer extends Reducer<TextPair, Text, NullWritable, Text> {
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
protected void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Iterator<Text> iter = values.iterator();
Text stationName = new Text(;
while (iter.hasNext()) {
Text record =;
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
context.write(NullWritable.get(), outValue);
Custom key:
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
public TextPair(Text first, Text second) {
set(first, second);
public void set(Text first, Text second) {
this.first = first;
this.second = second;
public Text getFirst() {
return first;
public Text getSecond() {
return second;
public void write(DataOutput out) throws IOException {
public void readFields(DataInput in) throws IOException {
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
return false;
public String toString() {
return first + "\t" + second;
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
return second.compareTo(tp.second);
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
return, b);
public class JoinJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "Join two DataSet");
Path ncdcInputPath = new Path(getConf().get("job.input1.path"));
Path stationInputPath = new Path(getConf().get("job.input2.path"));
Path outputPath = new Path(getConf().get("job.output.path"));
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, JoinDataSet1Mapper.class);
MultipleInputs.addInputPath(job, stationInputPath,
TextInputFormat.class, JoinDataSet2Mapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = JoinJob(), args);

Hadoop and Custom Writable Issue

I am using Hadoop 2.7 and I have got an issue when using a custom Writable "TextPair" (page 104 of the Definitive Guide). Basically, my program works fine when I am using just Text whereas it outputs "test.TextTuple#3b86249a test.TextTuple#63cd18fd" when using the TextPair.
Please, Any idea of what is wrong with my code (below)?
public class KWMapper extends Mapper<LongWritable, Text, TextTuple, TextTuple> {
public void map(LongWritable k, Text v, Mapper.Context c) throws IOException, InterruptedException {
String keywordRelRecord[] = v.toString().split(",");
String subTopicID = keywordRelRecord[0];
String paperID = keywordRelRecord[1];
//set the KEY
TextTuple key = new TextTuple();
key.setNaturalKey(new Text(subTopicID));
key.setSecondaryKey(new Text("K"));
//set the VALUE
TextTuple value = new TextTuple();
value.setNaturalKey(new Text(paperID));
value.setSecondaryKey(new Text("K"));
c.write(key, value);
public class TDMapper extends Mapper<LongWritable, Text, TextTuple, TextTuple> {
public void map(LongWritable k, Text v, Mapper.Context c) throws IOException, InterruptedException {
String topicRecord[] = v.toString().split(",");
String superTopicID = topicRecord[0];
String subTopicID = topicRecord[1].substring(1, topicRecord[1].length() - 1);
TextTuple key = new TextTuple();
key.setNaturalKey(new Text(subTopicID));
key.setSecondaryKey(new Text("T"));
TextTuple value = new TextTuple();
value.setNaturalKey(new Text(superTopicID));
value.setSecondaryKey(new Text("T"));
c.write(key, value);
public class TDKRReducer extends Reducer<TextTuple, TextTuple, Text, Text>{
public void reduce(TextTuple k, Iterable<TextTuple> values, Reducer.Context c) throws IOException, InterruptedException{
for (TextTuple val : values) {
c.write(k.getNaturalKey(), val.getNaturalKey());
public class TDDriver {
public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException {
// This class support the user for the configuration of the execution;
Configuration confStage1 = new Configuration();
Job job1 = new Job(confStage1, "TopDecKeywordRel");
// Setting the driver class
// Setting the input Files and processing them using the corresponding mapper class
MultipleInputs.addInputPath(job1, new Path(args[0]), TextInputFormat.class, TDMapper.class);
MultipleInputs.addInputPath(job1, new Path(args[1]), TextInputFormat.class, KWMapper.class);
// Setting the Reducer Class;
// Setting the output class for the Key-value pairs
// Setting the output file
Path outputPA = new Path(args[2]);
FileOutputFormat.setOutputPath(job1, outputPA);
// Submitting the Job Monitoring the execution of the Job
System.exit(job1.waitForCompletion(true) ? 0 : 1);
public class TextTuple implements Writable, WritableComparable<TextTuple> {
private Text naturalKey;
private Text secondaryKey;
public TextTuple() {
this.naturalKey = new Text();
this.secondaryKey = new Text();
public void setNaturalKey(Text naturalKey) {
this.naturalKey = naturalKey;
public void setSecondaryKey(Text secondaryKey) {
this.secondaryKey = secondaryKey;
public Text getNaturalKey() {
return naturalKey;
public Text getSecondaryKey() {
return secondaryKey;
public void write(DataOutput out) throws IOException {
public void readFields(DataInput in) throws IOException {
//This comparator controls the sort order of the keys.
public int compareTo(TextTuple o) {
// comparing the naturalKey
int compareValue = this.naturalKey.compareTo(o.naturalKey);
if (compareValue == 0) {
compareValue = this.secondaryKey.compareTo(o.secondaryKey);
return -1 * compareValue;

Custom object as value for Mapper output

I have object have constructed as following:
Class ObjExample {
String s;
Object[] objArray; // element in this array can be primitive type or array of primitive type.
I know that to using it as output type for mapper or reducer, we have to implement WritableComparable for it.
But I really get confused how to write readFields(), write(), compareTo() for this kind of class?
You can wrap field s in Text and objArray in ArrayWritable. Each element of the objArray would be an array (also ArrayWritable) of primitives. Here is possible implementation:
public static final class ObjExample implements WritableComparable<ObjExample> {
public final Text s = new Text(); // wrapped String
public final ArrayOfArrays objArray = new ArrayOfArrays();
public int compareTo(ObjExample o) {
// your logic here, example:
return s.compareTo(o.s);
public void write(DataOutput dataOutput) throws IOException {
public void readFields(DataInput dataInput) throws IOException {
// set size of the objArray
public void setSize(int n) {
objArray.set(new IntArray[n]);
// set i-th element of the objArray to an array of elements
public void setElement(int i, IntWritable... elements) {
IntArray subArr = new IntArray();
objArray.get()[i] = subArr;
You will need two more classes to make it work:
// array of primitives
public static final class IntArray extends ArrayWritable {
public IntArray() {
// you can specify any other primitive wrapper (DoubleWritable, Text, ...)
// array of arrays
public static final class ArrayOfArrays extends ArrayWritable {
public ArrayOfArrays() {
Example of construction of the object:
ObjExample o = new ObjExample();
o.setElement(0, new IntWritable(0)); // single primitive
o.setElement(1, new IntWritable(1), new IntWritable(2)); // array of primitives

Custom WritableCompare displays object reference as output

I am new to Hadoop and Java, and I feel there is something obvious I am just missing. I am using Hadoop 1.0.3 if that means anything.
My goal for using hadoop is to take a bunch of files and parse them one file at a time (as opposed to line by line). Each file will produce multiple key-values, but context to the other lines is important. The key and value are multi-value/composite, so I have implemented WritableCompare for the key and Writable for the value. Because the processing of each file take a bit of CPU, I want to save the output of the mapper, then run multiple reducers later on.
For the composite keys, I followed [][1]
The problem is, the output is just Java object references as opposed to the composite key and value. Example:
LinkKeyWritable#bd2f9730 LinkValueWritable#8752408c
I am not sure if the problem is related to not reducing the data at all or
Here is my main class:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Parser.class);
PerFileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
And my Mapper class:
public class RawMap extends MapReduceBase implements
Mapper {
public void map(NullWritable key, Text value,
OutputCollector<LinkKeyWritable, LinkValueWritable> output,
Reporter reporter) throws IOException {
String json = value.toString();
SerpyReader reader = new SerpyReader(json);
GoogleParser parser = new GoogleParser(reader);
for (String page : reader.getPages()) {
String content = reader.readPageContent(page);
for (Link link : parser.getLinks()) {
LinkKeyWritable linkKey = new LinkKeyWritable(link);
LinkValueWritable linkValue = new LinkValueWritable(link);
output.collect(linkKey, linkValue);
Link is basically a struct of various information that get's split between LinkKeyWritable and LinkValueWritable
public class LinkKeyWritable implements WritableComparable<LinkKeyWritable>{
protected Link link;
public LinkKeyWritable() {
link = new Link();
public LinkKeyWritable(Link link) {
super(); = link;
public void readFields(DataInput in) throws IOException {
link.batchDay = in.readLong();
link.source = in.readUTF();
link.domain = in.readUTF();
link.path = in.readUTF();
public void write(DataOutput out) throws IOException {
public int compareTo(LinkKeyWritable o) {
return ComparisonChain.start().
public int hashCode() {
return Objects.hashCode(link.batchDay, link.source, link.domain, link.path);
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.batchDay,
&& Objects.equal(link.source,
&& Objects.equal(link.domain,
&& Objects.equal(link.path,;
return false;
public class LinkValueWritable implements Writable{
protected Link link;
public LinkValueWritable() {
link = new Link();
public LinkValueWritable(Link link) { = new Link(); = link.type; = link.description;
public void readFields(DataInput in) throws IOException {
link.type = in.readUTF();
link.description = in.readUTF();
public void write(DataOutput out) throws IOException {
public int hashCode() {
return Objects.hashCode(link.type, link.description);
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.type,
&& Objects.equal(link.description,;
return false;
I think the answer is in the implementation of the TextOutputFormat. Specifically, the LineRecordWriter's writeObject method:
* Write the object to the byte stream, handling Text as a special
* case.
* #param o the object to print
* #throws IOException if the write throws, we pass it on
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
As you can see, if your key or value is not a Text object, it calls the toString method on it and writes that out. Since you've left toString unimplemented in your key and value, it's using the Object class's implementation, which is writing out the reference.
I'd say that you should try writing an appropriate toString function or using a different OutputFormat.
It looks like you have a list of objects just like you wanted. You need to implement toString() on your writable if you want a human-readable version printed out instead of an ugly java reference.

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
VarLongWritable userID = new VarLongWritable(Long.parseLong(;
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
context.write(userID, itemID);
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 =;
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 =;
context.write(new IntWritable(index1),
new IntWritable(index2));
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
Job job_cooccurence = new Job (getConf());
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
return 0;
public static void main(String[] args) throws Exception { Configuration(), new RecommenderJob(), args);
The error that I get is: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
But you define the following in your driver code:
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
